Unlocking Biological Workflows for Robust Protein-Text Question Answering: A Dual-Dimensional RAG Framework

Chen Huang; Duanyu Feng; Li Ding; See-kiong Ng; Wenqiang Lei; Yang Li; Yangshuai Wang

arxiv: 2605.17261 · v1 · pith:QOQAVYY4new · submitted 2026-05-17 · 💻 cs.IR

Unlocking Biological Workflows for Robust Protein-Text Question Answering: A Dual-Dimensional RAG Framework

Li Ding , Duanyu Feng , Chen Huang , Yangshuai Wang , Yang Li , Wenqiang Lei , See-Kiong Ng This is my paper

Pith reviewed 2026-05-19 23:27 UTC · model grok-4.3

classification 💻 cs.IR

keywords protein-text question answeringretrieval augmented generationbiological workflowsBLASTout of distributionprotein functiondual dimensional filtering

0 comments

The pith

2D-ProteinRAG embeds LLMs in BLAST workflows with dual filtering to handle novel proteins in question answering

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to improve protein-text question answering by moving beyond standard RAG that uses static data and struggles with new proteins. It proposes 2D-ProteinRAG which lets LLMs follow the established BLAST workflow from biology. After retrieving similar proteins, it applies two filtering steps: one aligns database attributes to the specific query horizontally, and the other denoises semantic contradictions across homologs vertically using clustering. Evaluations show this leads to top results on both familiar and out-of-distribution biological test sets. Readers interested in AI for biology would care because it makes models more practical for actual lab research on protein functions.

Core claim

The authors establish that 2D-ProteinRAG, which integrates LLMs into the gold-standard biological workflow of BLAST and uses a dual-dimensional filtering strategy of horizontal fine-grained attribute alignment and vertical homology-based semantic denoising, achieves state-of-the-art performance on in-distribution and diverse biological out-of-distribution benchmarks, surpassing fine-tuned baselines and other RAG methods.

What carries the argument

The dual-dimensional (2D) filtering strategy applied after BLAST retrieval, consisting of horizontal fine-grained attribute alignment with a lightweight intent-aware filter and vertical homology-based semantic denoising via hierarchical clustering to resolve functional contradictions.

Load-bearing premise

That the dual-dimensional filtering steps after BLAST will reliably pull high-quality information from noisy contexts and generalize to new proteins without creating additional errors.

What would settle it

A controlled test on a benchmark of proteins with ambiguous homolog functions where applying the vertical denoising does not reduce errors or even increases them relative to unfiltered RAG.

Figures

Figures reproduced from arXiv: 2605.17261 by Chen Huang, Duanyu Feng, Li Ding, See-kiong Ng, Wenqiang Lei, Yang Li, Yangshuai Wang.

**Figure 1.** Figure 1: Overview of the 2D-ProteinRAG Framework. The workflow consists of three phases (1) Raw Homology Retrieval: The [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Robustness analysis under strict homology con [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Impact of retrieval number 𝑘 on performance Top-𝑘 sensitivity analysis reveals a task-dependent information-complexity trade-off, where challenging functional queries benefit from a broader evolutionary consensus. In [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: An illustrative example of the 2D-ProteinRAG inference process. The pipeline progressively filters noise from raw [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

read the original abstract

Protein-Text Question Answering (QA) is crucial for interpreting biological sequences through natural language. The integration of Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) that efficiently leverages biological databases and facilitates reasoning offers a potent approach for it. However, constrained by the standard RAG pipeline, these models often rely on curated, static datasets instead of expert-proven biological workflows, lacking the fine-grained information processing and struggling to generalize to novel (OOD) proteins. To bridge this gap, we propose 2D-ProteinRAG, a novel framework that empowers LLMs to operate within the gold-standard biological research workflow (BLAST). To further extract high-quality information from noisy retrieval contexts, we introduce a dual-dimensional (2D) filtering strategy following the expert analytical paradigms. Horizontal Fine-grained Attribute Alignment utilizes a lightweight, intent-aware discriminative filter to prune irrelevant metadata and align database entries with specific user queries. Vertical Homology-based Semantic Denoising resolves functional contradictions and redundancy across multiple homologs via hierarchical clustering. Extensive evaluations on both In-Distribution and diverse biological OOD benchmarks demonstrate that 2D-ProteinRAG consistently achieves state-of-the-art performance, outperforming fine-tuned baselines and other RAG methods. Our results validate the framework's robustness and scalability, providing a practical solution for interpreting protein functions in real-world scientific scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adapts RAG to protein QA by following BLAST then adding two post-retrieval filters, but the SOTA claims rest on unshown numbers.

read the letter

The main point is a RAG system for protein-text questions that starts with BLAST retrieval and then applies two specific cleanup steps meant to match how biologists actually analyze sequences. The horizontal filter prunes metadata to match query intent, and the vertical step clusters homologs to cut redundancy and contradictions. This combination tied to real lab workflows is the clearest new element compared with generic RAG papers. It directly tackles the problem that standard retrieval often pulls noisy or mismatched protein data and struggles with proteins not seen in training. By grounding the pipeline in an established biological tool, the approach feels more usable for actual genomics or biotech queries than many abstract improvements. The paper does a reasonable job naming the gap between curated datasets and the need to handle novel proteins. The dual filters are presented as a way to extract cleaner information without heavy fine-tuning, which is a practical direction. The soft spot is the missing evidence. The abstract states that the system reaches state-of-the-art on both in-distribution and out-of-distribution benchmarks and beats fine-tuned baselines plus other RAG methods, yet it gives no scores, no baseline descriptions, and no error breakdown. Without those details it is hard to tell whether the filters actually help on new proteins or simply trade one set of errors for another. The OOD results are the part that matters most for the claim, so they need to be shown clearly. This is aimed at researchers who build natural-language tools for protein databases or who work where NLP meets bioinformatics. A reader who wants retrieval systems that respect domain workflows would get something concrete from it. The paper deserves a serious referee because the core idea is grounded and the problem is real, even if the current write-up leaves the performance claims unverified. I would send it to review with a request for full result tables and ablations on each filter.

Referee Report

2 major / 2 minor

Summary. The paper proposes 2D-ProteinRAG, a RAG-based framework for protein-text question answering that embeds LLMs within the standard biological workflow (BLAST retrieval) and applies a post-retrieval dual-dimensional filter: horizontal fine-grained attribute alignment via an intent-aware discriminative model to prune irrelevant metadata, and vertical homology-based semantic denoising via hierarchical clustering to resolve contradictions and redundancy among homologs. It reports that this yields state-of-the-art results on both in-distribution and diverse out-of-distribution biological benchmarks, outperforming fine-tuned baselines and prior RAG variants.

Significance. If the performance claims are substantiated, the work demonstrates a practical route to injecting domain-expert biological pipelines into retrieval-augmented LLM systems, potentially improving robustness and generalization for novel proteins where standard RAG pipelines fail. The explicit use of homology clustering and attribute alignment is a concrete, domain-grounded extension rather than a purely heuristic addition.

major comments (2)

[Abstract / Results] Abstract and Results section: the central claim that '2D-ProteinRAG consistently achieves state-of-the-art performance' on ID and OOD benchmarks is unsupported by any reported metrics, baseline tables, or error bars in the provided text. Without these numbers the magnitude of improvement and the contribution of the two filtering stages cannot be assessed.
[Methodology] Methodology, Vertical Homology-based Semantic Denoising paragraph: the claim that hierarchical clustering resolves functional contradictions across homologs without introducing new errors on novel proteins is load-bearing for the OOD generalization argument, yet no ablation isolating this step, no clustering hyperparameters, and no failure-case analysis on proteins lacking close homologs are supplied.

minor comments (2)

[Introduction] Notation: '2D' is used both for the framework name and the filtering strategy; a brief clarifying sentence would avoid reader confusion.
[Methodology] The description of the 'lightweight, intent-aware discriminative filter' would benefit from a one-sentence statement of its input features and training objective.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for strengthening the presentation of results and methodological details. We address each point below and have revised the manuscript accordingly to provide the requested substantiation and analyses.

read point-by-point responses

Referee: [Abstract / Results] Abstract and Results section: the central claim that '2D-ProteinRAG consistently achieves state-of-the-art performance' on ID and OOD benchmarks is unsupported by any reported metrics, baseline tables, or error bars in the provided text. Without these numbers the magnitude of improvement and the contribution of the two filtering stages cannot be assessed.

Authors: We agree that the initial submission did not include explicit numerical metrics, baseline tables, or error bars in the abstract and results sections to fully support the state-of-the-art claims. In the revised manuscript, we have added comprehensive results tables (new Table 2 and Table 3) reporting performance metrics such as accuracy, precision, recall, and F1-score for 2D-ProteinRAG versus fine-tuned baselines and prior RAG variants across all ID and OOD benchmarks. These tables include mean values with standard deviations over 5 runs and an ablation breakdown isolating the contributions of the horizontal attribute alignment and vertical homology denoising stages. This allows direct assessment of the magnitude of improvements. revision: yes
Referee: [Methodology] Methodology, Vertical Homology-based Semantic Denoising paragraph: the claim that hierarchical clustering resolves functional contradictions across homologs without introducing new errors on novel proteins is load-bearing for the OOD generalization argument, yet no ablation isolating this step, no clustering hyperparameters, and no failure-case analysis on proteins lacking close homologs are supplied.

Authors: We acknowledge that the original methodology description lacked an explicit ablation for the vertical denoising step, specific hyperparameters, and failure-case analysis. The revised manuscript now includes these elements: we specify the hierarchical clustering hyperparameters (Ward linkage, cosine distance on sentence embeddings, cutoff threshold of 0.75 selected via silhouette score on a validation set of 200 proteins). A new ablation study (Section 4.3) isolates this component by comparing full 2D-ProteinRAG against a variant without vertical denoising. We have also added a failure-case analysis subsection examining proteins with low sequence identity (<30%) or no close homologs, showing graceful degradation where the system relies on horizontal filtering and avoids introducing contradictions through conservative cluster merging. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper describes a methodological framework (2D-ProteinRAG) that applies post-BLAST filtering steps (horizontal attribute alignment and vertical homology denoising) to improve RAG for protein-text QA. No equations, fitted parameters, predictions, or self-citations appear in the abstract or framework description that would reduce any claimed result to its inputs by construction. The central claims rest on empirical SOTA performance on ID and OOD benchmarks rather than self-referential definitions or load-bearing prior work by the same authors. This is a standard applied-methods paper whose derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The framework rests on the assumption that standard biological workflows and the two new filtering layers add value beyond existing RAG pipelines; no explicit free parameters or new physical entities are stated in the abstract.

axioms (2)

domain assumption BLAST retrieval provides useful candidate entries for protein queries
The paper states it empowers LLMs to operate within the gold-standard biological research workflow (BLAST).
domain assumption Retrieval contexts contain both relevant and noisy metadata that can be filtered by intent-aware and homology-based rules
The dual-dimensional filtering strategy is introduced to extract high-quality information from noisy retrieval contexts.

invented entities (2)

Horizontal Fine-grained Attribute Alignment filter no independent evidence
purpose: Prune irrelevant metadata and align database entries with user queries
New lightweight discriminative filter introduced in the framework.
Vertical Homology-based Semantic Denoising via hierarchical clustering no independent evidence
purpose: Resolve functional contradictions and redundancy across multiple homologs
New denoising step introduced in the framework.

pith-pipeline@v0.9.0 · 5796 in / 1411 out tokens · 29089 ms · 2026-05-19T23:27:10.897126+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Horizontal Fine-grained Attribute Alignment utilizes a lightweight, intent-aware discriminative filter... Vertical Homology-based Semantic Denoising resolves functional contradictions... via hierarchical clustering.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 4 internal anchors

[1]

S F Altschul, W Gish, W Miller, E W Myers, and D J Lipman. 1990. Basic local alignment search tool.J. Mol. Biol.215, 3 (Oct. 1990), 403–410

work page 1990
[2]

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, et al. 2025. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities. arXiv:2507.06261 [cs.CL] https: //arxiv.org/abs/2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

The UniProt Consortium. 2024. UniProt: the Universal Pro- tein Knowledgebase in 2025.Nucleic Acids Research53, D1 (11 2024), D609–D617. arXiv:https://academic.oup.com/nar/article- pdf/53/D1/D609/60719276/gkae1010.pdf doi:10.1093/nar/gkae1010

work page doi:10.1093/nar/gkae1010 2024
[4]

D Devos and A Valencia. 2000. Practical limits of function prediction.Proteins 41, 1 (Oct. 2000), 98–107

work page 2000
[5]

Wenqi Fan, Yi Zhou, Shijie Wang, Yuyao Yan, Hui Liu, Qian Zhao, Le Song, and Qing Li. 2025. Computational Protein Science in the Era of Large Language Models (LLMs). arXiv:2501.10282 [cs.CE] https://arxiv.org/abs/2501.10282

work page arXiv 2025
[6]

Yin Fang, Xiaozhuan Liang, Ningyu Zhang, Kangwei Liu, Rui Huang, Zhuo Chen, Xiaohui Fan, and Huajun Chen. 2024. Mol-Instructions: A Large-Scale Biomolec- ular Instruction Dataset for Large Language Models. InICLR. OpenReview.net. https://openreview.net/pdf?id=Tlsdsb6l9n

work page 2024
[7]

Xiao Fei, Michail Chatzianastasis, Sarah Almeida Carneiro, Hadi Abdine, Lawrence Paul Petalidis, and Michalis Vazirgiannis. 2025. Prot2Text-V2: Protein Function Prediction with Multimodal Contrastive Alignment. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems. https: //openreview.net/forum?id=w1FUXt3ujK

work page 2025
[8]

Iddo Friedberg. 2006. Automated protein function prediction—the ge- nomic challenge.Briefings in Bioinformatics7, 3 (09 2006), 225–242. arXiv:https://academic.oup.com/bib/article-pdf/7/3/225/930740/bbl004.pdf doi:10. 1093/bib/bbl004

work page 2006
[9]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI] https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

M A Harris, J Clark, A Ireland, J Lomax, M Ashburner, R Foulger, K Eilbeck, S Lewis, B Marshall, C Mungall, J Richter, G M Rubin, J A Blake, C Bult, M Dolan, H Drabkin, J T Eppig, D P Hill, L Ni, M Ringwald, R Balakrishnan, J M Cherry, K R Christie, M C Costanzo, S S Dwight, S Engel, D G Fisk, J E Hirschman, E L Hong, R S Nash, A Sethuraman, C L Theesfeld...

work page 2004
[11]

Sofroniew, Deniz Oktay, Zeming Lin, Robert Verkuil, Vincent Q

Thomas Hayes, Roshan Rao, Halil Akin, Nicholas J. Sofroniew, Deniz Oktay, Zeming Lin, Robert Verkuil, Vincent Q. Tran, Jonathan Deaton, Marius Wiggert, Rohil Badkundri, Irhum Shafkat, Jun Gong, Alexander Derry, Raul S. Molina, Neil Thomas, Yousuf A. Khan, Chetan Mishra, Carolyn Kim, Liam J. Bartie, Matthew Nemeth, Patrick D. Hsu, Tom Sercu, Salvatore Cand...

work page doi:10.1126/science.ads0018 2025
[12]

Ala Jararweh, Oladimeji Macaulay, David Arredondo, Yue Hu, Luis E Tafoya, Kushal Virupakshappa, and Avinash Sahu. 2025. Protein2Text: Resampling Mechanism to Translate Protein Sequences into Human-Interpretable Text. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Languag...

work page doi:10.18653/v1/2025.naacl-industry.68 2025
[13]

Rusu, Kieran Milan, John Quan, Tiago Ramalho, Ag- nieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Ku- maran, and Raia Hadsell

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Ag- nieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Ku- maran, and Raia Hadsell. 2017. Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences114, ...

work page 2017
[14]

arXiv:https://www.pnas.org/doi/pdf/10.1073/pnas.1611835114 doi:10.1073/ pnas.1611835114

work page doi:10.1073/pnas.1611835114
[15]

David Lee, Oliver Redfern, and Christine Orengo. 2007. Predicting protein func- tion from sequence and structure.Nat. Rev. Mol. Cell Biol.8, 12 (Dec. 2007), 995–1005

work page 2007
[16]

Weizhong Li and Adam Godzik. 2006. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.Bioinformatics22, 13 (05 2006), 1658–1659. arXiv:https://academic.oup.com/bioinformatics/article- pdf/22/13/1658/48838763/bioinformatics_22_13_1658.pdf doi:10.1093/ bioinformatics/btl158

work page 2006
[17]

Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Salvatore Candido, and Alexander Rives. 2023. Evolutionary-scale prediction of atomic-level pro- tein structure with a language model.Science379, 6637 (2023), 1123–

work page 2023
[18]

1126/science.ade2574

arXiv:https://www.science.org/doi/pdf/10.1126/science.ade2574 doi:10. 1126/science.ade2574

work page doi:10.1126/science.ade2574
[19]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 [cs.CL] https://arxiv.org/abs/1907.11692

work page internal anchor Pith review Pith/arXiv arXiv 2019
[20]

Zhiyuan Liu, An Zhang, Hao Fei, Enzhi Zhang, Xiang Wang, Kenji Kawaguchi, and Tat-Seng Chua. 2024. ProtT3: Protein-to-Text Generation for Text-based Protein Understanding. InProceedings of the 62nd Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association fo...

work page doi:10.18653/v1/2024.acl-long.324 2024
[21]

Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon, and Tie-Yan Liu. 2022. BioGPT: generative pre-trained trans- former for biomedical text generation and mining.Briefings in Bioinfor- matics23, 6 (09 2022), bbac409. arXiv:https://academic.oup.com/bib/article- pdf/23/6/bbac409/47144271/bbac409.pdf doi:10.1093/bib/bbac409

work page doi:10.1093/bib/bbac409 2022
[22]

Liuzhenghao Lv, Zongying Lin, Hao Li, Yuyang Liu, Jiaxi Cui, Calvin Yu-Chian Chen, Li Yuan, and Yonghong Tian. 2026. ProLLaMA: A Protein Large Language Model for Multitask Protein Language Processing.IEEE Transactions on Artificial Intelligence7, 2 (2026), 642–653. doi:10.1109/TAI.2025.3564914

work page doi:10.1109/tai.2025.3564914 2026
[23]

Chang Ma, Haiteng Zhao, Lin Zheng, Jiayi Xin, Qintong Li, Lijun Wu, Zhihong Deng, Yang Young Lu, Qi Liu, Sheng Wang, and Lingpeng Kong. 2024. Retrieved Sequence Augmentation for Protein Representation Learning. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.)...

work page doi:10.18653/v1/ 2024
[24]

A G Murzin, S E Brenner, T Hubbard, and C Chothia. 1995. SCOP: a structural clas- sification of proteins database for the investigation of sequences and structures. J. Mol. Biol.247, 4 (April 1995), 536–540

work page 1995
[25]

B Rost, J Liu, R Nair, K O Wrzeszczynski, and Y Ofran. 2003. Automatic prediction of protein function.Cell. Mol. Life Sci.60, 12 (Dec. 2003), 2637–2650

work page 2003
[26]

Peter Shaw, Bhaskar Gurram, David Belanger, Andreea Gane, Maxwell L Bileschi, Lucy J Colwell, Kristina Toutanova, and Ankur P Parikh. 2024. ProtEx: A Retrieval- Augmented Approach for Protein Function Prediction.bioRxiv(2024). https: //www.biorxiv.org/content/early/2024/06/02/2024.05.30.596539

work page 2024
[27]

Duane Szafron, Paul Lu, Russell Greiner, David S Wishart, Brett Poulin, Ro- man Eisner, Zhiyong Lu, John Anvik, Cam Macdonell, Alona Fyshe, and David Meeuwis. 2004. Proteome Analyst: custom predictions with explanations in a web-based tool for high-throughput proteome annotations.Nucleic Acids Res.32, Web Server issue (July 2004), W365–71

work page 2004
[28]

Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic

work page
[29]

Galactica: A Large Language Model for Science

Galactica: A Large Language Model for Science. arXiv:2211.09085 [cs.CL] https://arxiv.org/abs/2211.09085

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Chao Wang, Hehe Fan, Ruijie Quan, Lina Yao, and Yi Yang. 2025. ProtChat- GPT: Towards Understanding Proteins with Hybrid Representation and Large Language Models. InProceedings of the 48th International ACM SIGIR Confer- ence on Research and Development in Information Retrieval(Padua, Italy)(SIGIR ’25). Association for Computing Machinery, New York, NY, U...

work page doi:10.1145/3726302.3730064 2025
[31]

Zihan Wang, Zihan Liang, Zhou Shao, Yufei Ma, Huangyu Dai, Ben Chen, Ling- tao Mao, Chenyi Lei, Yuqing Ding, and Han Li. 2025. InfoGain-RAG: Boosting Retrieval-Augmented Generation through Document Information Gain-based Reranking and Filtering. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Christos Christodoulo...

work page doi:10.18653/v1/2025.emnlp-main.365 2025
[32]

Zhicong Wang, Zicheng Ma, Ziqiang Cao, Changlong Zhou, Jun Zhang, and Yi Qin Gao. 2025. Prot2Chat: protein large language model with early fusion of text, sequence, and structure.Bioinformatics41, 8 (07 2025), btaf396. arXiv:https://academic.oup.com/bioinformatics/article- pdf/41/8/btaf396/63866323/btaf396.pdf doi:10.1093/bioinformatics/btaf396

work page doi:10.1093/bioinformatics/btaf396 2025
[33]

James C Whisstock and Arthur M Lesk. 2003. Prediction of protein function from protein sequence and structure.Q. Rev. Biophys.36, 3 (Aug. 2003), 307–340

work page 2003
[34]

Juntong Wu, Zijing Liu, He Cao, Li Hao, Bin Feng, Zishan Shu, Ke Yu, Li Yuan, and Yu Li. 2025. Rethinking Text-based Protein Understanding: Retrieval or LLM?. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Conference acronym ’XX, June 03–05, 2018, Woo...

work page doi:10.18653/v1/2025.emnlp-main.1211 2025
[35]

Yijia Xiao, Edward Sun, Yiqiao Jin, Qifan Wang, and Wei Wang. 2025. ProteinGPT: Multimodal LLM for Protein Property Prediction and Structure Understanding. arXiv:2408.11363 [cs.AI] https://arxiv.org/abs/2408.11363

work page arXiv 2025
[36]

Yijia Xiao, Wanjia Zhao, Junkai Zhang, Yiqiao Jin, Han Zhang, Zhicheng Ren, Renliang Sun, Haixin Wang, Guancheng Wan, Pan Lu, Xiao Luo, Yu Zhang, James Zou, Yizhou Sun, and Wei Wang. 2025. Protein Large Language Models: A Com- prehensive Survey. InFindings of the Association for Computational Linguistics: EMNLP 2025, Christos Christodoulopoulos, Tanmoy Ch...

work page doi:10.18653/v1/2025.findings-emnlp.1255 2025
[37]

Minghao Xu, Xinyu Yuan, Santiago Miret, and Jian Tang. 2023. ProtST: multi- modality learning of protein sequences and biomedical texts. InProceedings of the 40th International Conference on Machine Learning(Honolulu, Hawaii, USA) (ICML’23). JMLR.org, Article 1615, 19 pages. Unlocking Biological Workflows for Robust Protein-Text Question Answering: A Dual...

work page 2023
[38]

Catalytic Ac- tivity

to supplement the GO cross-references found in UniProt an- notations, thereby maximizing the utilization of raw biological information. It is worth noting that we exclusively utilized the Swiss-Prot database, distinguished by its manual review and high-quality an- notations, rather than the TrEMBL dataset, which consists of unre- viewed, computationally g...

work page 2018

[1] [1]

S F Altschul, W Gish, W Miller, E W Myers, and D J Lipman. 1990. Basic local alignment search tool.J. Mol. Biol.215, 3 (Oct. 1990), 403–410

work page 1990

[2] [2]

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, et al. 2025. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities. arXiv:2507.06261 [cs.CL] https: //arxiv.org/abs/2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

The UniProt Consortium. 2024. UniProt: the Universal Pro- tein Knowledgebase in 2025.Nucleic Acids Research53, D1 (11 2024), D609–D617. arXiv:https://academic.oup.com/nar/article- pdf/53/D1/D609/60719276/gkae1010.pdf doi:10.1093/nar/gkae1010

work page doi:10.1093/nar/gkae1010 2024

[4] [4]

D Devos and A Valencia. 2000. Practical limits of function prediction.Proteins 41, 1 (Oct. 2000), 98–107

work page 2000

[5] [5]

Wenqi Fan, Yi Zhou, Shijie Wang, Yuyao Yan, Hui Liu, Qian Zhao, Le Song, and Qing Li. 2025. Computational Protein Science in the Era of Large Language Models (LLMs). arXiv:2501.10282 [cs.CE] https://arxiv.org/abs/2501.10282

work page arXiv 2025

[6] [6]

Yin Fang, Xiaozhuan Liang, Ningyu Zhang, Kangwei Liu, Rui Huang, Zhuo Chen, Xiaohui Fan, and Huajun Chen. 2024. Mol-Instructions: A Large-Scale Biomolec- ular Instruction Dataset for Large Language Models. InICLR. OpenReview.net. https://openreview.net/pdf?id=Tlsdsb6l9n

work page 2024

[7] [7]

Xiao Fei, Michail Chatzianastasis, Sarah Almeida Carneiro, Hadi Abdine, Lawrence Paul Petalidis, and Michalis Vazirgiannis. 2025. Prot2Text-V2: Protein Function Prediction with Multimodal Contrastive Alignment. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems. https: //openreview.net/forum?id=w1FUXt3ujK

work page 2025

[8] [8]

Iddo Friedberg. 2006. Automated protein function prediction—the ge- nomic challenge.Briefings in Bioinformatics7, 3 (09 2006), 225–242. arXiv:https://academic.oup.com/bib/article-pdf/7/3/225/930740/bbl004.pdf doi:10. 1093/bib/bbl004

work page 2006

[9] [9]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI] https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

M A Harris, J Clark, A Ireland, J Lomax, M Ashburner, R Foulger, K Eilbeck, S Lewis, B Marshall, C Mungall, J Richter, G M Rubin, J A Blake, C Bult, M Dolan, H Drabkin, J T Eppig, D P Hill, L Ni, M Ringwald, R Balakrishnan, J M Cherry, K R Christie, M C Costanzo, S S Dwight, S Engel, D G Fisk, J E Hirschman, E L Hong, R S Nash, A Sethuraman, C L Theesfeld...

work page 2004

[11] [11]

Sofroniew, Deniz Oktay, Zeming Lin, Robert Verkuil, Vincent Q

Thomas Hayes, Roshan Rao, Halil Akin, Nicholas J. Sofroniew, Deniz Oktay, Zeming Lin, Robert Verkuil, Vincent Q. Tran, Jonathan Deaton, Marius Wiggert, Rohil Badkundri, Irhum Shafkat, Jun Gong, Alexander Derry, Raul S. Molina, Neil Thomas, Yousuf A. Khan, Chetan Mishra, Carolyn Kim, Liam J. Bartie, Matthew Nemeth, Patrick D. Hsu, Tom Sercu, Salvatore Cand...

work page doi:10.1126/science.ads0018 2025

[12] [12]

Ala Jararweh, Oladimeji Macaulay, David Arredondo, Yue Hu, Luis E Tafoya, Kushal Virupakshappa, and Avinash Sahu. 2025. Protein2Text: Resampling Mechanism to Translate Protein Sequences into Human-Interpretable Text. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Languag...

work page doi:10.18653/v1/2025.naacl-industry.68 2025

[13] [13]

Rusu, Kieran Milan, John Quan, Tiago Ramalho, Ag- nieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Ku- maran, and Raia Hadsell

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Ag- nieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Ku- maran, and Raia Hadsell. 2017. Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences114, ...

work page 2017

[14] [14]

arXiv:https://www.pnas.org/doi/pdf/10.1073/pnas.1611835114 doi:10.1073/ pnas.1611835114

work page doi:10.1073/pnas.1611835114

[15] [15]

David Lee, Oliver Redfern, and Christine Orengo. 2007. Predicting protein func- tion from sequence and structure.Nat. Rev. Mol. Cell Biol.8, 12 (Dec. 2007), 995–1005

work page 2007

[16] [16]

Weizhong Li and Adam Godzik. 2006. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.Bioinformatics22, 13 (05 2006), 1658–1659. arXiv:https://academic.oup.com/bioinformatics/article- pdf/22/13/1658/48838763/bioinformatics_22_13_1658.pdf doi:10.1093/ bioinformatics/btl158

work page 2006

[17] [17]

Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Salvatore Candido, and Alexander Rives. 2023. Evolutionary-scale prediction of atomic-level pro- tein structure with a language model.Science379, 6637 (2023), 1123–

work page 2023

[18] [18]

1126/science.ade2574

arXiv:https://www.science.org/doi/pdf/10.1126/science.ade2574 doi:10. 1126/science.ade2574

work page doi:10.1126/science.ade2574

[19] [19]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 [cs.CL] https://arxiv.org/abs/1907.11692

work page internal anchor Pith review Pith/arXiv arXiv 2019

[20] [20]

Zhiyuan Liu, An Zhang, Hao Fei, Enzhi Zhang, Xiang Wang, Kenji Kawaguchi, and Tat-Seng Chua. 2024. ProtT3: Protein-to-Text Generation for Text-based Protein Understanding. InProceedings of the 62nd Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association fo...

work page doi:10.18653/v1/2024.acl-long.324 2024

[21] [21]

Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon, and Tie-Yan Liu. 2022. BioGPT: generative pre-trained trans- former for biomedical text generation and mining.Briefings in Bioinfor- matics23, 6 (09 2022), bbac409. arXiv:https://academic.oup.com/bib/article- pdf/23/6/bbac409/47144271/bbac409.pdf doi:10.1093/bib/bbac409

work page doi:10.1093/bib/bbac409 2022

[22] [22]

Liuzhenghao Lv, Zongying Lin, Hao Li, Yuyang Liu, Jiaxi Cui, Calvin Yu-Chian Chen, Li Yuan, and Yonghong Tian. 2026. ProLLaMA: A Protein Large Language Model for Multitask Protein Language Processing.IEEE Transactions on Artificial Intelligence7, 2 (2026), 642–653. doi:10.1109/TAI.2025.3564914

work page doi:10.1109/tai.2025.3564914 2026

[23] [23]

Chang Ma, Haiteng Zhao, Lin Zheng, Jiayi Xin, Qintong Li, Lijun Wu, Zhihong Deng, Yang Young Lu, Qi Liu, Sheng Wang, and Lingpeng Kong. 2024. Retrieved Sequence Augmentation for Protein Representation Learning. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.)...

work page doi:10.18653/v1/ 2024

[24] [24]

A G Murzin, S E Brenner, T Hubbard, and C Chothia. 1995. SCOP: a structural clas- sification of proteins database for the investigation of sequences and structures. J. Mol. Biol.247, 4 (April 1995), 536–540

work page 1995

[25] [25]

B Rost, J Liu, R Nair, K O Wrzeszczynski, and Y Ofran. 2003. Automatic prediction of protein function.Cell. Mol. Life Sci.60, 12 (Dec. 2003), 2637–2650

work page 2003

[26] [26]

Peter Shaw, Bhaskar Gurram, David Belanger, Andreea Gane, Maxwell L Bileschi, Lucy J Colwell, Kristina Toutanova, and Ankur P Parikh. 2024. ProtEx: A Retrieval- Augmented Approach for Protein Function Prediction.bioRxiv(2024). https: //www.biorxiv.org/content/early/2024/06/02/2024.05.30.596539

work page 2024

[27] [27]

Duane Szafron, Paul Lu, Russell Greiner, David S Wishart, Brett Poulin, Ro- man Eisner, Zhiyong Lu, John Anvik, Cam Macdonell, Alona Fyshe, and David Meeuwis. 2004. Proteome Analyst: custom predictions with explanations in a web-based tool for high-throughput proteome annotations.Nucleic Acids Res.32, Web Server issue (July 2004), W365–71

work page 2004

[28] [28]

Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic

work page

[29] [29]

Galactica: A Large Language Model for Science

Galactica: A Large Language Model for Science. arXiv:2211.09085 [cs.CL] https://arxiv.org/abs/2211.09085

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

Chao Wang, Hehe Fan, Ruijie Quan, Lina Yao, and Yi Yang. 2025. ProtChat- GPT: Towards Understanding Proteins with Hybrid Representation and Large Language Models. InProceedings of the 48th International ACM SIGIR Confer- ence on Research and Development in Information Retrieval(Padua, Italy)(SIGIR ’25). Association for Computing Machinery, New York, NY, U...

work page doi:10.1145/3726302.3730064 2025

[31] [31]

Zihan Wang, Zihan Liang, Zhou Shao, Yufei Ma, Huangyu Dai, Ben Chen, Ling- tao Mao, Chenyi Lei, Yuqing Ding, and Han Li. 2025. InfoGain-RAG: Boosting Retrieval-Augmented Generation through Document Information Gain-based Reranking and Filtering. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Christos Christodoulo...

work page doi:10.18653/v1/2025.emnlp-main.365 2025

[32] [32]

Zhicong Wang, Zicheng Ma, Ziqiang Cao, Changlong Zhou, Jun Zhang, and Yi Qin Gao. 2025. Prot2Chat: protein large language model with early fusion of text, sequence, and structure.Bioinformatics41, 8 (07 2025), btaf396. arXiv:https://academic.oup.com/bioinformatics/article- pdf/41/8/btaf396/63866323/btaf396.pdf doi:10.1093/bioinformatics/btaf396

work page doi:10.1093/bioinformatics/btaf396 2025

[33] [33]

James C Whisstock and Arthur M Lesk. 2003. Prediction of protein function from protein sequence and structure.Q. Rev. Biophys.36, 3 (Aug. 2003), 307–340

work page 2003

[34] [34]

Juntong Wu, Zijing Liu, He Cao, Li Hao, Bin Feng, Zishan Shu, Ke Yu, Li Yuan, and Yu Li. 2025. Rethinking Text-based Protein Understanding: Retrieval or LLM?. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Conference acronym ’XX, June 03–05, 2018, Woo...

work page doi:10.18653/v1/2025.emnlp-main.1211 2025

[35] [35]

Yijia Xiao, Edward Sun, Yiqiao Jin, Qifan Wang, and Wei Wang. 2025. ProteinGPT: Multimodal LLM for Protein Property Prediction and Structure Understanding. arXiv:2408.11363 [cs.AI] https://arxiv.org/abs/2408.11363

work page arXiv 2025

[36] [36]

Yijia Xiao, Wanjia Zhao, Junkai Zhang, Yiqiao Jin, Han Zhang, Zhicheng Ren, Renliang Sun, Haixin Wang, Guancheng Wan, Pan Lu, Xiao Luo, Yu Zhang, James Zou, Yizhou Sun, and Wei Wang. 2025. Protein Large Language Models: A Com- prehensive Survey. InFindings of the Association for Computational Linguistics: EMNLP 2025, Christos Christodoulopoulos, Tanmoy Ch...

work page doi:10.18653/v1/2025.findings-emnlp.1255 2025

[37] [37]

Minghao Xu, Xinyu Yuan, Santiago Miret, and Jian Tang. 2023. ProtST: multi- modality learning of protein sequences and biomedical texts. InProceedings of the 40th International Conference on Machine Learning(Honolulu, Hawaii, USA) (ICML’23). JMLR.org, Article 1615, 19 pages. Unlocking Biological Workflows for Robust Protein-Text Question Answering: A Dual...

work page 2023

[38] [38]

Catalytic Ac- tivity

to supplement the GO cross-references found in UniProt an- notations, thereby maximizing the utilization of raw biological information. It is worth noting that we exclusively utilized the Swiss-Prot database, distinguished by its manual review and high-quality an- notations, rather than the TrEMBL dataset, which consists of unre- viewed, computationally g...

work page 2018