pith. sign in

arxiv: 2606.22823 · v1 · pith:BXN3AJEFnew · submitted 2026-06-22 · 💻 cs.LG · q-bio.QM

Retrieval-Augmented Multimodal Learning for Enzyme-Substrate Interaction Prediction Under Low-Homology Shift

Pith reviewed 2026-06-26 09:17 UTC · model grok-4.3

classification 💻 cs.LG q-bio.QM
keywords enzyme-substrate interactionretrieval-augmented learningmultimodal representationlow-homology distribution shiftbiocatalyst discoverysequence-identity splitpairwise prediction
0
0 comments X

The pith

Retrieval of neighboring enzymes improves substrate interaction predictions when test sequences share little identity with training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RAMMESI to handle enzyme-substrate interaction prediction when test enzymes have low sequence identity to training examples and positive labels are sparse. It builds explicit pairwise representations by modeling directional cross-modal interactions between enzymes and substrates then fuses them adaptively. At inference the model retrieves similar enzymes, recombines each with the query substrate, and aggregates the resulting predictions as extra evidence. An imbalance-aware weighted binary cross-entropy loss further stabilizes learning on rare positives. Experiments on two benchmarks with sequence-identity-aware splits show consistent gains that are largest in the hardest low-identity regimes, and the retrieval step improves other backbones in a plug-and-play way.

Core claim

RAMMESI learns explicit pairwise enzyme-substrate representations through directional cross-modal interaction modeling and adaptive fusion; at inference it retrieves neighboring enzymes, recombines each with the query substrate, aggregates the pairwise predictions as contextual evidence, and applies an imbalance-aware weighted-BCE objective, yielding stronger performance on ESI benchmarks under sequence-identity-aware splits especially in low-identity regimes while improving multiple ESI backbones in a plug-and-play manner.

What carries the argument

The retrieval module that retrieves neighboring enzymes at inference time, recombines them with the query substrate, and aggregates the resulting pairwise predictions as contextual evidence.

If this is right

  • Retrieval supplies a general mechanism that can be added to existing ESI backbones to increase robustness under homology shift.
  • Explicit pairwise representations combined with adaptive fusion capture directional enzyme-substrate interactions more effectively than unimodal baselines.
  • The imbalance-aware weighted-BCE objective mitigates the effect of sparse positive supervision on learning.
  • Performance advantages grow as sequence identity between train and test enzymes decreases, enabling more reliable screening in large biochemical spaces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same retrieval-at-inference pattern could be tested on other protein-ligand tasks that suffer from homology shift, such as predicting binding for novel protein families.
  • If the retrieved neighbors are drawn from public enzyme databases, the approach may reduce the labeled data needed to screen new biocatalysts.
  • Extending the retrieval step to also pull in substrate analogs might further improve generalization when both enzyme and substrate novelty are high.

Load-bearing premise

Retrieving neighboring enzymes at inference and aggregating their predictions supplies useful contextual evidence that improves robustness under low-homology shift rather than merely averaging correlated errors.

What would settle it

An experiment showing that replacing retrieved neighbors with random enzymes produces the same accuracy gain on low-identity test sets, or that removing the retrieval step leaves low-identity performance unchanged.

Figures

Figures reproduced from arXiv: 2606.22823 by Bingxin Zhou, Chen Liu, Guisheng Fan, Liang Hong, Ming Li, Xinyuan Wang.

Figure 1
Figure 1. Figure 1: Schematic illustration of substrate binding, complex [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed RAMMESI framework. (a) RAMMESI encodes enzyme and substrate inputs, models their [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of weighting functions under the weighted-BCE framework. Here, [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Recall and AUPRC performance on ESP-DB and [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Inference-time retrieval performance on two datasets with different numbers of neighbors. Base is the model without [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Docking visualization for two case-study pairs: (a) verified positive [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
read the original abstract

Enzyme substrate interaction (ESI) prediction is a fundamental computational task for biocatalyst discovery and reaction screening in large biochemical spaces. In practical settings, ESI prediction is challenged by sparse positive supervision and low-homology distribution shift, where test enzymes share limited sequence identity with those observed during training. To address these challenges, we propose RAMMESI, a retrieval-augmented multimodal framework for robust ESI prediction. RAMMESI learns explicit pairwise enzyme-substrate representations through directional cross-modal interaction modeling and adaptive fusion. To enhance robustness, RAMMESI retrieves neighboring enzymes at inference time, recombines them with the query substrate, and aggregates the resulting pairwise predictions as contextual evidence. To improve learning under sparse positive supervision, we further adopt an imbalance-aware weighted-BCE objective. Experiments on two ESI benchmarks under sequence-identity-aware splits demonstrate that RAMMESI achieves consistently strong performance, with particular advantages in more challenging low-identity regimes. In addition, the retrieval module improves multiple ESI backbones in a plug-and-play manner, suggesting that retrieval provides a general mechanism for improving robustness under homology shift.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes RAMMESI, a retrieval-augmented multimodal framework for enzyme-substrate interaction (ESI) prediction. It learns explicit pairwise enzyme-substrate representations via directional cross-modal interaction modeling and adaptive fusion, employs an imbalance-aware weighted-BCE loss for sparse supervision, and at inference retrieves neighboring enzymes, recombines them with the query substrate, and aggregates the resulting pairwise predictions to improve robustness under low-homology distribution shift. Experiments on two ESI benchmarks using sequence-identity-aware splits report consistently strong performance with particular gains in low-identity regimes; the retrieval module is shown to improve multiple ESI backbones in a plug-and-play manner.

Significance. If the central claims hold after addressing mechanism-isolation concerns, the work would provide a practical, general mechanism for robustness under homology shift in ESI prediction, which is relevant for biocatalyst discovery. The use of sequence-identity-aware splits and the plug-and-play demonstration on multiple backbones are concrete strengths that would support broader applicability if the gains are shown to arise from contextual evidence rather than variance reduction.

major comments (2)
  1. [§3.3 and §5] The central claim that retrieval supplies useful contextual evidence improving robustness specifically under low-homology shift (abstract; §3.3 retrieval module; §5 experiments) is load-bearing yet lacks controls to distinguish this from averaging of correlated errors. The manuscript should specify the retrieval criterion (sequence identity, embedding similarity, or other), the source pool for neighbors, and include ablations such as random retrieval, retrieval from functionally unrelated enzymes, or comparison against simple ensemble averaging; without these, reported advantages on sequence-identity-aware splits could be explained by statistical effects rather than shift robustness.
  2. [§5] Table 2 (or equivalent results table) and the plug-and-play experiments: the reported gains on multiple backbones must be accompanied by statistical tests (e.g., paired t-tests or Wilcoxon across runs) and ablation of the aggregation function; current presentation leaves open whether improvements are consistent or driven by particular backbones or identity bins.
minor comments (2)
  1. [Abstract] Abstract: while the high-level claims are clear, a single sentence summarizing the key quantitative improvements (e.g., average AUC or top-k recall deltas versus baselines in the <30% identity regime) would improve readability without lengthening the abstract.
  2. [§3.2] Notation in §3.2: the directional cross-modal interaction and adaptive fusion modules would benefit from an explicit equation for the final fused representation before the prediction head, to make the multimodal component self-contained.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will incorporate the requested controls, specifications, and statistical analyses in the revised manuscript to strengthen the evidence for the retrieval module.

read point-by-point responses
  1. Referee: [§3.3 and §5] The central claim that retrieval supplies useful contextual evidence improving robustness specifically under low-homology shift (abstract; §3.3 retrieval module; §5 experiments) is load-bearing yet lacks controls to distinguish this from averaging of correlated errors. The manuscript should specify the retrieval criterion (sequence identity, embedding similarity, or other), the source pool for neighbors, and include ablations such as random retrieval, retrieval from functionally unrelated enzymes, or comparison against simple ensemble averaging; without these, reported advantages on sequence-identity-aware splits could be explained by statistical effects rather than shift robustness.

    Authors: We agree that isolating the contribution of contextual evidence from statistical averaging is important. The retrieval criterion in RAMMESI is sequence identity (to match the sequence-identity-aware splits), with neighbors drawn from the training-set enzyme pool. In the revision we will explicitly state this in §3.3 and add the requested ablations in §5: (i) random retrieval, (ii) retrieval restricted to functionally unrelated enzymes (EC-number mismatch), and (iii) comparison against simple ensemble averaging of multiple independent predictions. These controls will demonstrate that performance gains arise from relevant neighbors rather than variance reduction alone. revision: yes

  2. Referee: [§5] Table 2 (or equivalent results table) and the plug-and-play experiments: the reported gains on multiple backbones must be accompanied by statistical tests (e.g., paired t-tests or Wilcoxon across runs) and ablation of the aggregation function; current presentation leaves open whether improvements are consistent or driven by particular backbones or identity bins.

    Authors: We acknowledge the need for statistical rigor and further dissection of the aggregation step. In the revised §5 we will report paired t-tests (or Wilcoxon signed-rank tests) across independent runs for all plug-and-play results on multiple backbones. We will also add an ablation of the aggregation function (mean vs. weighted or attention-based aggregation) and present results broken down by sequence-identity bins to confirm that gains are consistent rather than driven by specific backbones or bins. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical framework validated on benchmarks

full rationale

The paper introduces RAMMESI as a retrieval-augmented multimodal model for ESI prediction and reports its performance on two benchmarks under sequence-identity-aware splits. No derivation chain, equations, or first-principles results are presented that reduce to fitted inputs or self-citations. The retrieval mechanism is described as a plug-and-play addition whose benefit is shown experimentally rather than derived by construction from the inputs. The central claims rest on observed performance improvements rather than any self-referential loop, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The central claim rests on the unstated premise that the retrieval database contains sufficiently informative neighbors for the low-identity test cases.

pith-pipeline@v0.9.1-grok · 5737 in / 1222 out tokens · 18911 ms · 2026-06-26T09:17:01.189377+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 5 canonical work pages · 1 internal anchor

  1. [1]

    Industrial applications of enzymes: Recent advances, techniques, and outlooks,

    J. Chapman, A. E. Ismail, and C. Z. Dinu, “Industrial applications of enzymes: Recent advances, techniques, and outlooks,”Catalysts, vol. 8, no. 6, p. 238, 2018

  2. [2]

    From nature to industry: Harnessing enzymes for biocatalysis,

    R. Buller, S. Lutz, R. Kazlauskas, R. Snajdrova, J. Moore, and U. Born- scheuer, “From nature to industry: Harnessing enzymes for biocatalysis,” Science, vol. 382, no. 6673, p. eadh8615, 2023

  3. [3]

    Enzyme function prediction using contrastive learning,

    T. Yu, H. Cui, J. C. Li, Y . Luo, G. Jiang, and H. Zhao, “Enzyme function prediction using contrastive learning,”Science, vol. 379, no. 6639, pp. 1358–1363, 2023

  4. [4]

    Accelerating enzyme discovery and engineering with high-throughput screening,

    E. U. Bozkurt, E. C. Ørsted, D. C. V olke, and P. I. Nikel, “Accelerating enzyme discovery and engineering with high-throughput screening,” Natural Product Reports, 2026

  5. [5]

    Uniprot: the universal protein knowledgebase in 2025,

    “Uniprot: the universal protein knowledgebase in 2025,”Nucleic Acids Research, vol. 53, no. D1, pp. D609–D617, 2025

  6. [6]

    Combining structure and sequence information allows automated prediction of substrate speci- ficities within enzyme families,

    M. R ¨ottig, C. Rausch, and O. Kohlbacher, “Combining structure and sequence information allows automated prediction of substrate speci- ficities within enzyme families,”PLoS Computational Biology, vol. 6, no. 1, p. e1000636, 2010

  7. [7]

    Predicting novel substrates for enzymes with minimal ex- perimental effort with active learning,

    D. A. Pertusi, M. E. Moura, J. G. Jeffryes, S. Prabhu, B. W. Biggs, and K. E. Tyo, “Predicting novel substrates for enzymes with minimal ex- perimental effort with active learning,”Metabolic Engineering, vol. 44, pp. 171–181, 2017

  8. [8]

    Functional and informatics analysis enables glycosyltransferase activity prediction,

    M. Yang, C. Fehl, K. V . Lees, E.-K. Lim, W. A. Offen, G. J. Davies, D. J. Bowles, M. G. Davidson, S. J. Roberts, and B. G. Davis, “Functional and informatics analysis enables glycosyltransferase activity prediction,” Nature Chemical Biology, vol. 14, no. 12, pp. 1109–1117, 2018

  9. [9]

    Machine learning-based prediction of enzyme substrate scope: application to bacterial nitrilases,

    Z. Mou, J. Eakes, C. J. Cooper, C. M. Foster, R. F. Standaert, M. Podar, M. J. Doktycz, and J. M. Parks, “Machine learning-based prediction of enzyme substrate scope: application to bacterial nitrilases,”Proteins: Structure, Function, and Bioinformatics, vol. 89, no. 3, pp. 336–347, 2021

  10. [10]

    Enzymclass: Substrate specificity prediction tool of plant acyl-acp thioesterases based on ensemble learning,

    D. Banerjee, M. A. Jindra, A. J. Linot, B. F. Pfleger, and C. D. Maranas, “Enzymclass: Substrate specificity prediction tool of plant acyl-acp thioesterases based on ensemble learning,”Current Research in Biotechnology, vol. 4, pp. 1–9, 2022

  11. [11]

    A general model to predict small molecule substrates of enzymes based on machine and deep learning,

    A. Kroll, S. Ranjan, M. K. Engqvist, and M. J. Lercher, “A general model to predict small molecule substrates of enzymes based on machine and deep learning,”Nature Communications, vol. 14, no. 1, p. 2787, 2023

  12. [12]

    GIaNt: Protein-ligand binding affinity prediction via geometry-aware interactive graph neural network,

    S. Li, J. Zhou, T. Xu, L. Huang, F. Wang, H. Xiong, W. Huang, D. Dou, and H. Xiong, “GIaNt: Protein-ligand binding affinity prediction via geometry-aware interactive graph neural network,”IEEE Transactions on Knowledge and Data Engineering, vol. 36, no. 5, pp. 1991–2008, 2023

  13. [13]

    KG-MTL: knowledge graph enhanced multi-task learning for molecular interaction,

    T. Ma, X. Lin, B. Song, P. S. Yu, and X. Zeng, “KG-MTL: knowledge graph enhanced multi-task learning for molecular interaction,”IEEE Transactions on Knowledge and Data Engineering, vol. 35, no. 7, pp. 7068–7081, 2022

  14. [14]

    Learning to denoise biomedical knowledge graph for robust molecular interaction prediction,

    T. Ma, Y . Chen, W. Tao, D. Zheng, X. Lin, P. C.-I. Pang, Y . Liu, Y . Wang, L. Wang, B. Songet al., “Learning to denoise biomedical knowledge graph for robust molecular interaction prediction,”IEEE Transactions on Knowledge and Data Engineering, vol. 36, no. 12, pp. 8682–8694, 2024

  15. [15]

    Local–global structure-aware geometric equivariant graph representation learning for predicting protein–ligand binding affinity,

    S. Chen, H. Yi, Z. You, X. Shang, Y .-A. Huang, L. Wang, and Z. Wang, “Local–global structure-aware geometric equivariant graph representation learning for predicting protein–ligand binding affinity,” IEEE Transactions on Neural Networks and Learning Systems, 2025

  16. [16]

    SSPPI: Cross- modality enhanced protein–protein interaction prediction from sequence and structure perspectives,

    X. Bi, W. Ma, H. Jiang, W. Lu, Z. Wei, and S. Zhang, “SSPPI: Cross- modality enhanced protein–protein interaction prediction from sequence and structure perspectives,”IEEE Transactions on Neural Networks and Learning Systems, 2025

  17. [17]

    Exploring molecular pretraining model at scale,

    X. Ji, Z. Wang, Z. Gao, H. Zheng, L. Zhang, G. Ke, and W. E, “Exploring molecular pretraining model at scale,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  18. [18]

    Protein language pragmatic analysis and progressive transfer learning for profiling peptide–protein interactions,

    S. Chen, K. Yan, X. Li, and B. Liu, “Protein language pragmatic analysis and progressive transfer learning for profiling peptide–protein interactions,”IEEE Transactions on Neural Networks and Learning Systems, 2025

  19. [19]

    Accurate protein– protein interaction prediction: Based on multiview heterogeneous graph autoencoders and random masking,

    S. Chen, Z. Tang, L. You, and C. Y .-C. Chen, “Accurate protein– protein interaction prediction: Based on multiview heterogeneous graph autoencoders and random masking,”IEEE Transactions on Neural Networks and Learning Systems, 2025

  20. [20]

    FusionESP: Improved enzyme–substrate pair prediction by fusing protein and chemical knowl- edge,

    Z. Du, W. Fu, X. Guo, D. Caragea, and Y . Li, “FusionESP: Improved enzyme–substrate pair prediction by fusing protein and chemical knowl- edge,”Journal of Chemical Information and Modeling, vol. 65, no. 6, pp. 2806–2817, 2025

  21. [21]

    SEFP: Structure-based enzyme function prediction,

    Z. Zhang, G. Yu, Z. Deng, C. Luo, C. Cai, W. Zhang, F. Hu, K.- S. Choi, Z. Wei, L. Wanget al., “SEFP: Structure-based enzyme function prediction,”IEEE Transactions on Computational Biology and Bioinformatics, 2025

  22. [22]

    Enzyme promiscuity: mechanism and appli- cations,

    K. Hult and P. Berglund, “Enzyme promiscuity: mechanism and appli- cations,”Trends in Biotechnology, vol. 25, no. 5, pp. 231–238, 2007

  23. [23]

    Multi-modal deep learning enables efficient and accurate annotation of enzymatic active sites,

    X. Wang, X. Yin, D. Jiang, H. Zhao, Z. Wu, O. Zhang, J. Wang, Y . Li, Y . Deng, H. Liuet al., “Multi-modal deep learning enables efficient and accurate annotation of enzymatic active sites,”Nature Communications, vol. 15, no. 1, p. 7348, 2024

  24. [24]

    VIPER: A general model for prediction of enzyme substrates,

    M. J. Campbell, “VIPER: A general model for prediction of enzyme substrates,”bioRxiv, pp. 2024–06, 2024

  25. [25]

    Using deep learning to annotate the protein universe,

    M. L. Bileschi, D. Belanger, D. H. Bryant, T. Sanderson, B. Carter, D. Sculley, A. Bateman, M. A. DePristo, and L. J. Colwell, “Using deep learning to annotate the protein universe,”Nature Biotechnology, vol. 40, no. 6, pp. 932–937, 2022

  26. [26]

    Connecting chemical and protein sequence space to predict biocatalytic reactions,

    A. E. Paton, D. A. Boiko, J. C. Perkins, N. I. Cemalovic, T. Resch¨utzegger, G. Gomes, and A. R. Narayan, “Connecting chemical and protein sequence space to predict biocatalytic reactions,”Nature, vol. 646, no. 8083, pp. 108–116, 2025

  27. [27]

    VenusX: Unlocking fine-grained functional understanding of proteins,

    Y . Tan, W. Gou, B. Zhong, H. Yu, L. Hong, and B. Zhou, “VenusX: Unlocking fine-grained functional understanding of proteins,” inThe Fourteenth International Conference on Learning Representations, 2026. [Online]. Available: https://openreview.net/forum?id=zcmL592XRG

  28. [28]

    Reactzyme: A benchmark for enzyme-reaction prediction,

    C. Hua, B. Zhong, S. Luan, L. Hong, G. Wolf, D. Precup, and S. Zheng, “Reactzyme: A benchmark for enzyme-reaction prediction,”Advances in Neural Information Processing Systems, vol. 37, pp. 26 415–26 442, 2024

  29. [29]

    A comprehensive survey of deep learning techniques in protein function prediction,

    R. Dhanuka, J. P. Singh, and A. Tripathi, “A comprehensive survey of deep learning techniques in protein function prediction,”IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 20, no. 3, pp. 2291–2301, 2023

  30. [30]

    Protein function prediction as approximate semantic entailment,

    M. Kulmanov, F. J. Guzm ´an-Vega, P. Duek Roggli, L. Lane, S. T. Arold, and R. Hoehndorf, “Protein function prediction as approximate semantic entailment,”Nature Machine Intelligence, vol. 6, no. 2, pp. 220–228, 2024

  31. [31]

    VenusFactory: An integrated system for protein engineering with data retrieval and language model fine- tuning,

    Y . Tan, C. Liu, J. Gao, W. Banghao, M. Li, R. Wang, L. Zhang, H. Yu, G. Fan, L. Honget al., “VenusFactory: An integrated system for protein engineering with data retrieval and language model fine- tuning,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025, pp. 230–241

  32. [32]

    Deep learning-driven insights into enzyme–substrate interaction discovery,

    W. Qian, X. Wang, Y . Huang, Y . Kang, P. Pan, C.-Y . Hsieh, and T. Hou, “Deep learning-driven insights into enzyme–substrate interaction discovery,”Journal of Chemical Information and Modeling, vol. 65, no. 1, pp. 187–200, 2024

  33. [33]

    OmniESI: A unified framework for enzyme-substrate interaction prediction with progressive conditional deep learning,

    Z. Nie, H. Zhang, H. Jiang, Y . Liu, X. Huang, F. Xu, J. Fu, Z. Ren, Y . Tian, W.-B. Zhanget al., “OmniESI: A unified framework for enzyme-substrate interaction prediction with progressive conditional deep learning,”arXiv:2506.17963, 2025

  34. [34]

    Enzyme specificity prediction using cross attention graph neural networks,

    H. Cui, Y . Su, T. J. Dean, T. Yu, Z. Zhang, J. Peng, D. Shukla, and H. Zhao, “Enzyme specificity prediction using cross attention graph neural networks,”Nature, pp. 1–3, 2025

  35. [35]

    Retrieval- augmented generation for knowledge-intensive NLP tasks,

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschelet al., “Retrieval- augmented generation for knowledge-intensive NLP tasks,”Advances in Neural Information Processing Systems, vol. 33, pp. 9459–9474, 2020

  36. [36]

    A survey on RAG meeting LLMs: Towards retrieval-augmented large language models,

    W. Fan, Y . Ding, L. Ning, S. Wang, H. Li, D. Yin, T.-S. Chua, and Q. Li, “A survey on RAG meeting LLMs: Towards retrieval-augmented large language models,” inProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2024, pp. 6491–6501

  37. [37]

    Protriever: End-to-end differentiable protein homology search for fitness prediction,

    R. Weitzman, P. M. Groth, L. V . Niekerk, A. Otani, Y . Gal, D. S. Marks, and P. Notin, “Protriever: End-to-end differentiable protein homology search for fitness prediction,” inForty-second International Conference on Machine Learning, 2025

  38. [38]

    Large language models are in-context molecule learners,

    J. Li, W. Liu, Z. Ding, W. Fan, Y . Li, and Q. Li, “Large language models are in-context molecule learners,”IEEE Transactions on Knowledge and Data Engineering, 2025

  39. [39]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Y . Gao, Y . Xiong, X. Gao, K. Jia, J. Pan, Y . Bi, Y . Dai, J. Sun, H. Wang, H. Wanget al., “Retrieval-augmented generation for large language models: A survey,”arXiv:2312.10997, vol. 2, no. 1, p. 32, 2023

  40. [40]

    Graph retrieval-augmented generation: A survey,

    B. Peng, Y . Zhu, Y . Liu, X. Bo, H. Shi, C. Hong, Y . Zhang, and S. Tang, “Graph retrieval-augmented generation: A survey,”ACM Transactions on Information Systems, vol. 44, no. 2, pp. 1–52, 2025

  41. [41]

    A deep retrieval-enhanced meta-learning framework for enzyme opti- mum ph prediction,

    L. Zhang, K. Luo, Z. Zhou, Y . Yu, F. Jiang, B. Wu, M. Li, and L. Hong, “A deep retrieval-enhanced meta-learning framework for enzyme opti- mum ph prediction,”Journal of Chemical Information and Modeling, vol. 65, no. 7, pp. 3761–3770, 2025. SUBMITTED TO IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 13

  42. [42]

    Tranception: protein fitness prediction with au- toregressive transformers and inference-time retrieval,

    P. Notin, M. Dias, J. Frazer, J. Marchena-Hurtado, A. N. Gomez, D. Marks, and Y . Gal, “Tranception: protein fitness prediction with au- toregressive transformers and inference-time retrieval,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 16 990–17 017

  43. [43]

    Embedding is (almost) all you need: Retrieval-augmented inference for generalizable genomic prediction tasks,

    N. Datta, S. Shatabda, and M. S. Rahman, “Embedding is (almost) all you need: Retrieval-augmented inference for generalizable genomic prediction tasks,”arXiv:2508.04757, 2025

  44. [44]

    From high-throughput evaluation to wet-lab studies: advancing mutation effect prediction with a retrieval-enhanced model,

    Y . Tan, R. Wang, B. Wu, L. Hong, and B. Zhou, “From high-throughput evaluation to wet-lab studies: advancing mutation effect prediction with a retrieval-enhanced model,”Bioinformatics, vol. 41, 07 2025. [Online]. Available: https://doi.org/10.1093/bioinformatics/btaf189

  45. [45]

    Billion-scale similarity search with GPUs,

    J. Johnson, M. Douze, and H. J ´egou, “Billion-scale similarity search with GPUs,”IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535– 547, 2019

  46. [46]

    Focal loss for dense object detection,

    T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar, “Focal loss for dense object detection,” inProceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2980–2988

  47. [47]

    MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets,

    M. Steinegger and J. S ¨oding, “MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets,”Nature Biotechnology, vol. 35, no. 11, pp. 1026–1028, 2017

  48. [48]

    Decoupled weight decay regularization,

    I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/forum?id=Bkg6RiCqY7

  49. [49]

    Random forests,

    L. Breiman, “Random forests,”Machine Learning, vol. 45, no. 1, pp. 5–32, 2001

  50. [50]

    LightGBM: A highly efficient gradient boosting decision tree,

    G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.- Y . Liu, “LightGBM: A highly efficient gradient boosting decision tree,” Advances in Neural Information Processing Systems, vol. 30, 2017

  51. [51]

    Evolutionary-scale prediction of atomic- level protein structure with a language model,

    Z. Lin, H. Akin, R. Rao, B. Hie, Z. Zhu, W. Lu, N. Smetanin, R. Verkuil, O. Kabeli, Y . Shmueliet al., “Evolutionary-scale prediction of atomic- level protein structure with a language model,”Science, vol. 379, no. 6637, pp. 1123–1130, 2023

  52. [52]

    Uni-mol2: Exploring molecular pretraining model at scale,

    X. Ji, Z. Wang, Z. Gao, H. Zheng, L. Zhang, G. Keet al., “Uni-mol2: Exploring molecular pretraining model at scale,”arXiv:2406.14969, 2024

  53. [53]

    A multimodal transformer network for protein-small molecule interactions enhances predictions of kinase inhibition and enzyme-substrate relationships,

    A. Kroll, S. Ranjan, and M. J. Lercher, “A multimodal transformer network for protein-small molecule interactions enhances predictions of kinase inhibition and enzyme-substrate relationships,”PLOS Computa- tional Biology, vol. 20, no. 5, p. e1012100, 2024

  54. [54]

    KEGG: kyoto encyclopedia of genes and genomes,

    M. Kanehisa and S. Goto, “KEGG: kyoto encyclopedia of genes and genomes,”Nucleic Acids Research, vol. 28, no. 1, pp. 27–30, 2000

  55. [55]

    Autodock vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading,

    O. Trott and A. J. Olson, “Autodock vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading,”Journal of Computational Chemistry, vol. 31, no. 2, pp. 455–461, 2010