pith. sign in

arxiv: 2503.17656 · v4 · submitted 2025-03-22 · 🧬 q-bio.QM · cs.AI· cs.LG

Pretraining a Foundation Model for Small-Molecule Natural Products

Pith reviewed 2026-05-22 23:30 UTC · model grok-4.3

classification 🧬 q-bio.QM cs.AIcs.LG
keywords natural productsfoundation modelpretrainingcontrastive learningmasked graph learningdrug discoverymolecular representationstaxonomy classification
0
0 comments X

The pith

Pretraining with scaffold-focused contrastive and masked graph learning produces representations that reach state-of-the-art on natural product mining and drug discovery tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that supervised, task-specific models for natural products lack generalizability and that existing molecular methods fail to handle their unique evolutionary and structural traits. The authors therefore pretrain a foundation model using contrastive learning and masked graph objectives that deliberately highlight evolutionary information in molecular scaffolds while still encoding side-chain details. If this strategy succeeds, the resulting representations should transfer across taxonomy classification, gene-level and microbe-level evolutionary analysis, and virtual screening without retraining from scratch for each new task. A reader would care because natural products supply many drug leads, and a single reusable model could reduce the need for separate labeled datasets on every downstream problem.

Core claim

The authors pretrain a foundation model for small-molecule natural products. Their pretraining strategy combines contrastive learning and masked graph learning objectives that emphasize evolutionary information from molecular scaffolds while capturing side-chain information. The resulting model achieves state-of-the-art performance on taxonomy classification, fine-grained evolutionary analysis at gene and microbial levels, and virtual screening for drug candidates, outperforming both synthesized-molecule baselines and standard supervised approaches.

What carries the argument

The novel pretraining strategy that combines contrastive learning and masked graph learning to emphasize evolutionary scaffold information.

If this is right

  • Current models are shown to be inadequate for understanding natural synthesis through taxonomy classification comparisons.
  • The model captures evolutionary information at both gene and microbial levels.
  • Virtual screening experiments show the representations help identify potential drug candidates more effectively.
  • The approach moves beyond one-model-for-one-task paradigms to a more generalizable foundation model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same scaffold-focused pretraining might improve prediction of natural-product interactions with human targets not examined in the reported experiments.
  • Combining the learned representations with genomic sequence data could enable more accurate microbial source attribution for newly discovered metabolites.
  • Evaluating the model on larger or chemically more diverse natural-product collections would test whether the reported gains persist outside the current evaluation sets.

Load-bearing premise

That a pretraining strategy focused on evolutionary scaffold information will produce more generalizable representations than standard supervised or general-molecule models for natural product tasks.

What would settle it

A competing model trained without the scaffold-emphasizing pretraining objectives that matches or exceeds the reported performance on the same taxonomy classification, evolutionary analysis, and virtual screening benchmarks would falsify the central claim.

read the original abstract

Natural products, as metabolites from microorganisms, animals, or plants, exhibit diverse biological activities, making them crucial for drug discovery. Nowadays, existing deep learning methods for natural products research primarily rely on supervised learning approaches designed for specific downstream tasks. However, such one-model-for-a-task paradigm often lacks generalizability and leaves significant room for performance improvement. Additionally, existing molecular characterization methods are not well-suited for the unique tasks associated with natural products. To address these limitations, we have pre-trained a foundation model for natural products based on their unique properties. Our approach employs a novel pretraining strategy that is especially tailored to natural products. By incorporating contrastive learning and masked graph learning objectives, we emphasize evolutional information from molecular scaffolds while capturing side-chain information. Our framework achieves state-of-the-art (SOTA) results in various downstream tasks related to natural product mining and drug discovery. We first compare taxonomy classification with synthesized molecule-focused baselines to demonstrate that current models are inadequate for understanding natural synthesis. Furthermore, by diving into a fine-grained analysis at both the gene and microbial levels, NaFM demonstrates the ability to capture evolutionary information. Eventually, our method is experimented with virtual screening, illustrating informative natural product representations that can lead to more effective identification of potential drug candidates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces NaFM, a foundation model for small-molecule natural products pretrained via a novel combination of contrastive learning (emphasizing evolutionary information from molecular scaffolds) and masked graph learning (capturing side-chain information). It claims this addresses limitations of supervised task-specific models and general molecular characterization methods, demonstrating SOTA results on taxonomy classification (vs. synthesized-molecule baselines), fine-grained gene- and microbial-level evolutionary analysis, and virtual screening for drug discovery.

Significance. If the performance gains are shown to arise specifically from the scaffold-evolutionary contrastive objective rather than domain-specific training data alone, the work could provide more generalizable representations for natural-product tasks and improve downstream applications in drug discovery.

major comments (1)
  1. [Abstract (downstream evaluation description) and Results (taxonomy classification comparison)] The central SOTA claim on taxonomy classification, gene-level analysis, and virtual screening requires evidence that the evolutionary contrastive term contributes beyond standard pretraining on the same natural-product structures; the manuscript provides no ablations or controls that isolate this objective from simply training a GNN on natural-product data with conventional objectives.
minor comments (1)
  1. [Abstract] The abstract asserts SOTA performance but supplies no quantitative metrics, baseline descriptions, dataset sizes, or statistical details, which should be added to the main text or a dedicated results table for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed and constructive review. The primary concern regarding the need to isolate the contribution of the scaffold-focused contrastive objective is well-taken, and we address it directly below.

read point-by-point responses
  1. Referee: [Abstract (downstream evaluation description) and Results (taxonomy classification comparison)] The central SOTA claim on taxonomy classification, gene-level analysis, and virtual screening requires evidence that the evolutionary contrastive term contributes beyond standard pretraining on the same natural-product structures; the manuscript provides no ablations or controls that isolate this objective from simply training a GNN on natural-product data with conventional objectives.

    Authors: We agree that explicit ablations are required to demonstrate that performance gains arise specifically from the evolutionary contrastive term rather than from domain-specific natural-product data alone. Our existing comparisons to synthesized-molecule baselines establish that general molecular models are inadequate for natural-product tasks, but they do not isolate the effect of the contrastive objective within the natural-product domain. In the revised manuscript we will add the requested controls: (i) a GNN pretrained on the identical natural-product corpus using only the masked-graph objective, and (ii) direct head-to-head comparisons of this baseline against the full NaFM objective on taxonomy classification, gene/microbial analysis, and virtual screening. These results will be reported in a new subsection of the Results and discussed in the context of the referee’s point. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical pretraining and downstream evaluation

full rationale

The paper presents an empirical framework that pretrains a graph neural network on natural-product structures using contrastive and masked objectives, then evaluates on taxonomy classification, gene-level analysis, and virtual screening tasks. No equations, derivations, or first-principles claims appear; performance is measured by standard supervised fine-tuning metrics on held-out data. The central premise that the tailored objectives capture evolutionary scaffold information is tested via ablation-style comparisons to baselines, not asserted by construction or reduced to fitted parameters. Self-citations, if present, are not load-bearing for any mathematical step. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the model presumably inherits standard graph-neural-network hyperparameters and contrastive-loss formulations from prior work, but none are enumerated here.

pith-pipeline@v0.9.0 · 5787 in / 1226 out tokens · 46069 ms · 2026-05-22T23:30:54.687313+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

90 extracted references · 90 canonical work pages · 4 internal anchors

  1. [1]

    Nucleic Acids Research53(D1), 634–643 (2025)

    Chandrasekhar, V., Rajan, K., Kanakam, S.R.S., Sharma, N., Weißenborn, V., Schaub, J., Steinbeck, C.: Coconut 2.0: a comprehensive overhaul and curation of the collection of open natural products database. Nucleic Acids Research53(D1), 634–643 (2025)

  2. [2]

    Journal of natural products79(3), 629–661 (2016)

    Newman, D.J., Cragg, G.M.: Natural products as sources of new drugs from 1981 to 2014. Journal of natural products79(3), 629–661 (2016)

  3. [3]

    Pharmaceutical research13(8), 1133–1141 (1996)

    Clark, A.M.: Natural products as a resource for new drugs. Pharmaceutical research13(8), 1133–1141 (1996)

  4. [4]

    Drug discovery today13(19-20), 894–901 (2008)

    Harvey, A.L.: Natural products in drug discovery. Drug discovery today13(19-20), 894–901 (2008)

  5. [5]

    Li, J.W.-H., Vederas, J.C.: Drug discovery and natural products: end of an era or an endless frontier? Science 325(5937), 161–165 (2009)

  6. [6]

    Nature reviews Drug discovery20(3), 200–216 (2021)

    Atanasov, A.G., Zotchev, S.B., Dirsch, V.M., Supuran, C.T.: Natural products in drug discovery: advances and opportunities. Nature reviews Drug discovery20(3), 200–216 (2021)

  7. [7]

    Cell130(5), 769–774 (2007)

    Corson, T.W., Crews, C.M.: Molecular understanding and modern application of traditional medicines: triumphs and trials. Cell130(5), 769–774 (2007)

  8. [8]

    Nucleic acids research43(D1), 935–939 (2015)

    Banerjee, P., Erehman, J., Gohlke, B.-O., Wilhelm, T., Preissner, R., Dunkel, M.: Super natural ii—a database of natural products. Nucleic acids research43(D1), 935–939 (2015)

  9. [9]

    Journal of Cheminformatics13(1), 2 (2021)

    Sorokina, M., Merseburger, P., Rajan, K., Yirik, M.A., Steinbeck, C.: Coconut online: collection of open natural products database. Journal of Cheminformatics13(1), 2 (2021)

  10. [10]

    Elife11, 70780 (2022)

    Rutz, A., Sorokina, M., Galgonek, J., Mietchen, D., Willighagen, E., Gaudry, A., Graham, J.G., Stephan, R., Page, R., Vondrášek, J.,et al.: The lotus initiative for open knowledge management in natural products research. Elife11, 70780 (2022)

  11. [11]

    Nucleic acids research46(D1), 1217–1222 (2018)

    Zeng, X., Zhang, P., He, W., Qin, C., Chen, S., Tao, L., Wang, Y., Tan, Y., Gao, D., Wang, B.,et al.: Npass: natural product activity and species source database for natural product research, discovery and tool development. Nucleic acids research46(D1), 1217–1222 (2018)

  12. [12]

    ACS central science5(11), 1824–1833 (2019)

    Van Santen, J.A., Jacob, G., Singh, A.L., Aniebok, V., Balunas, M.J., Bunsko, D., Neto, F.C., Castaño- Espriu, L., Chang, C., Clark, T.N.,et al.: The natural products atlas: an open access knowledge base for microbial natural products discovery. ACS central science5(11), 1824–1833 (2019)

  13. [13]

    : Mibig 3.0: a community-driven effort to annotate experimentally validated biosynthetic gene clusters

    Terlouw, B.R., Blin, K., Navarro-Munoz, J.C., Avalon, N.E., Chevrette, M.G., Egbert, S., Lee, S., Meijer, D., Recchia, M.J., Reitz, Z.L., et al. : Mibig 3.0: a community-driven effort to annotate experimentally validated biosynthetic gene clusters. Nucleic acids research51(D1), 603–610 (2023)

  14. [14]

    Journal of chemical information and computer sciences 42(3), 742–748 (2002)

    Lei, J., Zhou, J.: A marine natural product database. Journal of chemical information and computer sciences 42(3), 742–748 (2002)

  15. [15]

    Biotechnology journal14(11), 1800607 (2019)

    Barbosa, A.J., Roque, A.C.: Free marine natural products databases for biotechnology and bioengi- neering. Biotechnology journal14(11), 1800607 (2019)

  16. [16]

    Nucleic Acids Research 49(D1), 509–515 (2021)

    Lyu, C., Chen, T., Qiang, B., Liu, N., Wang, H., Zhang, L., Liu, Z.: Cmnpd: a comprehensive marine natural products database towards facilitating drug discovery from the ocean. Nucleic Acids Research 49(D1), 509–515 (2021)

  17. [17]

    Environmental microbiome16(1), 6 (2021)

    Aghdam, S.A., Brown, A.M.V.: Deep learning approaches for natural product discovery from plant endophytic microbiomes. Environmental microbiome16(1), 6 (2021)

  18. [18]

    Nature Communications13(1), 3342 (2022) 16

    Zheng, S., Zeng, T., Li, C., Chen, B., Coley, C.W., Yang, Y., Wu, R.: Deep learning driven biosynthetic pathways navigation for natural products with bionavi-np. Nature Communications13(1), 3342 (2022) 16

  19. [19]

    Molecular informatics39(11), 2000057 (2020)

    Lai, J., Hu, J., Wang, Y., Zhou, X., Li, Y., Zhang, L., Liu, Z.: Privileged scaffold analysis of natural products with deep learning-based indication prediction model. Molecular informatics39(11), 2000057 (2020)

  20. [20]

    Frontiers in Pharmacology11, 584875 (2020)

    Yoo,S.,Yang,H.C.,Lee,S.,Shin,J.,Min,S.,Lee,E.,Song,M.,Lee,D.:Adeeplearning-basedapproach for identifying the medicinal uses of plant-derived natural compounds. Frontiers in Pharmacology11, 584875 (2020)

  21. [21]

    Nucleic acids research47(18), 110–110 (2019)

    Hannigan, G.D., Prihoda, D., Palicka, A., Soukup, J., Klempir, O., Rampula, L., Durcak, J., Wurst, M., Kotowski, J., Chang, D.,et al.: A deep learning genome-mining strategy for biosynthetic gene cluster prediction. Nucleic acids research47(18), 110–110 (2019)

  22. [22]

    European Journal of Medicinal Chemistry 210, 112982 (2021)

    Liu, Z., Huang, D., Zheng, S., Song, Y., Liu, B., Sun, J., Niu, Z., Gu, Q., Xu, J., Xie, L.: Deep learning enables discovery of highly potent anti-osteoporosis natural products. European Journal of Medicinal Chemistry 210, 112982 (2021)

  23. [23]

    Digital Discovery (2024)

    Xu, Q., Tan, A.K., Guo, L., Lim, Y.H., Tay, D.W., Ang, S.J.: Composite machine learning strategy for natural products taxonomical classification and structural insights. Digital Discovery (2024)

  24. [24]

    Cell 180(4), 688–702 (2020)

    Stokes, J.M., Yang, K., Swanson, K., Jin, W., Cubillos-Ruiz, A., Donghia, N.M., MacNair, C.R., French, S., Carfrae, L.A., Bloom-Ackermann, Z.,et al.: A deep learning approach to antibiotic discovery. Cell 180(4), 688–702 (2020)

  25. [25]

    : Classyfire: automated chemical classification with a comprehensive, computable taxonomy

    Djoumbou Feunang, Y., Eisner, R., Knox, C., Chepelev, L., Hastings, J., Owen, G., Fahy, E., Stein- beck, C., Subramanian, S., Bolton, E., et al. : Classyfire: automated chemical classification with a comprehensive, computable taxonomy. Journal of cheminformatics8, 1–20 (2016)

  26. [26]

    Journal of Natural Products84(11), 2795–2807 (2021)

    Kim, H.W., Wang, M., Leber, C.A., Nothias, L.-F., Reher, R., Kang, K.B., Van Der Hooft, J.J., Dorrestein, P.C., Gerwick, W.H., Cottrell, G.W.: Npclassifier: a deep neural network-based structural classification tool for natural products. Journal of Natural Products84(11), 2795–2807 (2021)

  27. [27]

    Briefings in functional genomics20(5), 323–332 (2021)

    Yu, L., Su, Y., Liu, Y., Zeng, X.: Review of unsupervised pretraining strategies for molecules representation. Briefings in functional genomics20(5), 323–332 (2021)

  28. [28]

    Weininger, D., Weininger, A., Weininger, J.L.: Smiles. 2. algorithm for generation of unique smiles notation. Journal of chemical information and computer sciences29(2), 97–101 (1989)

  29. [29]

    Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

    Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)

  30. [30]

    Advances in neural information processing systems30 (2017)

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems30 (2017)

  31. [31]

    In: Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, pp

    Xu, Z., Wang, S., Zhu, F., Huang, J.: Seq2seq fingerprint: An unsupervised deep molecular embed- ding for drug discovery. In: Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, pp. 285–294 (2017)

  32. [32]

    Learning to SMILE(S)

    Jastrzębski, S., Leśniak, D., Czarnecki, W.M.: Learning to smile (s). arXiv preprint arXiv:1602.06289 (2016)

  33. [33]

    Journal of computer-aided molecular design30, 595–608 (2016)

    Kearnes, S., McCloskey, K., Berndl, M., Pande, V., Riley, P.: Molecular graph convolutions: moving beyond fingerprints. Journal of computer-aided molecular design30, 595–608 (2016)

  34. [34]

    Advances in neural information processing systems30 (2017)

    Schütt, K., Kindermans, P.-J., Sauceda Felix, H.E., Chmiela, S., Tkatchenko, A., Müller, K.-R.: Schnet: A continuous-filter convolutional neural network for modeling quantum interactions. Advances in neural information processing systems30 (2017)

  35. [35]

    Nature Machine Intelligence4(3), 279–287 (2022) 17

    Wang, Y., Wang, J., Cao, Z., Barati Farimani, A.: Molecular contrastive learning of representations via graph neural networks. Nature Machine Intelligence4(3), 279–287 (2022) 17

  36. [36]

    Strategies for pre-training graph neural networks.arXiv preprint arXiv:1905.12265, 2019

    Hu, W., Liu, B., Gomes, J., Zitnik, M., Liang, P., Pande, V., Leskovec, J.: Strategies for pre-training graph neural networks. arXiv preprint arXiv:1905.12265 (2019)

  37. [37]

    Xia, J., Zhao, C., Hu, B., Gao, Z., Tan, C., Liu, Y., Li, S., Li, S.Z.: Mole-bert: Rethinking pre-training graph neural networks for molecules (2023)

  38. [38]

    Pre-training molecular graph representation with 3d geometry

    Liu, S., Wang, H., Liu, W., Lasenby, J., Guo, H., Tang, J.: Pre-training molecular graph representation with 3d geometry. arXiv preprint arXiv:2110.07728 (2021)

  39. [39]

    In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp

    Zhu, J., Xia, Y., Wu, L., Xie, S., Qin, T., Zhou, W., Li, H., Liu, T.-Y.: Unified 2d and 3d pre-training of molecular representations. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 2626–2636 (2022)

  40. [40]

    Nature Communications14(1), 7568 (2023)

    Li, H., Zhang, R., Min, Y., Ma, D., Zhao, D., Zeng, J.: A knowledge-guided pre-training framework for improving molecular representation learning. Nature Communications14(1), 7568 (2023)

  41. [41]

    Nature Machine Intelligence, 1–10 (2024)

    Ni, Y., Feng, S., Hong, X., Sun, Y., Ma, W.-Y., Ma, Z.-M., Ye, Q., Lan, Y.: Pre-training with fractional denoising to enhance molecular property prediction. Nature Machine Intelligence, 1–10 (2024)

  42. [42]

    Nature Reviews Drug Discovery22(11), 895–916 (2023)

    Mullowney, M.W., Duncan, K.R., Elsayed, S.S., Garg, N., Hooft, J.J., Martin, N.I., Meijer, D., Terlouw, B.R., Biermann, F., Blin, K.,et al.: Artificial intelligence for natural product drug discovery. Nature Reviews Drug Discovery22(11), 895–916 (2023)

  43. [43]

    Angewandte Chemie International Edition55(27), 7586–7605 (2016)

    Garcia-Castro, M., Zimmermann, S., Sankar, M.G., Kumar, K.: Scaffold diversity synthesis and its application in probe and drug discovery. Angewandte Chemie International Edition55(27), 7586–7605 (2016)

  44. [44]

    Cruz-Monteagudo,M.,Medina-Franco,J.L.,Pérez-Castillo,Y.,Nicolotti,O.,Cordeiro,M.N.D.,Borges, F.: Activity cliffs in drug discovery: Dr jekyll or mr hyde? Drug Discovery Today19(8), 1069–1080 (2014)

  45. [45]

    ACS omega4(11), 14360–14368 (2019)

    Stumpfe, D., Hu, H., Bajorath, J.: Evolving concept of activity cliffs. ACS omega4(11), 14360–14368 (2019)

  46. [46]

    Journal of chemical information and modeling62(23), 5938–5951 (2022)

    Van Tilborg, D., Alenicheva, A., Grisoni, F.: Exposing the limitations of molecular machine learning with activity cliffs. Journal of chemical information and modeling62(23), 5938–5951 (2022)

  47. [47]

    Shen, W.X., Cui, C., Shi, X.C., Zhang, Y.B., Wu, J., Chen, Y.Z.: Online triplet contrastive learning enables efficient cliff awareness in molecular activity prediction (2023)

  48. [48]

    Sun, R., Dai, H., Yu, A.W.: Does gnn pretraining help molecular representation? Advances in Neural Information Processing Systems35, 12096–12109 (2022)

  49. [49]

    Proceedings of the National Academy of Sciences102(48), 17272–17277 (2005)

    Koch,M.A.,Schuffenhauer,A.,Scheck,M.,Wetzel,S.,Casaulta,M.,Odermatt,A.,Ertl,P.,Waldmann, H.: Charting biologically relevant chemical space: a structural classification of natural products (sconp). Proceedings of the National Academy of Sciences102(48), 17272–17277 (2005)

  50. [50]

    Journal of Chemical Information and Modeling60(7), 3376–3386 (2020)

    Martinez-Trevino, S.H., Uc-Cetina, V., Fernández-Herrera, M.A., Merino, G.: Prediction of natu- ral product classes using machine learning and 13c nmr spectroscopic data. Journal of Chemical Information and Modeling60(7), 3376–3386 (2020)

  51. [51]

    Journal of Cheminformatics12(1), 12 (2020)

    Probst,D.,Reymond,J.-L.:Visualizationofverylargehigh-dimensionaldatasetsasminimumspanning trees. Journal of Cheminformatics12(1), 12 (2020)

  52. [52]

    Bioinformatics34(8), 1433–1435 (2018)

    Probst, D., Reymond, J.-L.: Fun: a framework for interactive visualizations of large, high-dimensional datasets on the web. Bioinformatics34(8), 1433–1435 (2018)

  53. [53]

    Biomolecules10(10), 1385 (2020)

    Capecchi, A., Reymond, J.-L.: Assigning the origin of microbial natural products by chemical space map and machine learning. Biomolecules10(10), 1385 (2020)

  54. [54]

    Journal of cheminformatics13, 1–11 (2021)

    Capecchi, A., Reymond, J.-L.: Classifying natural products from plants, fungi or bacteria using the 18 coconut database and machine learning. Journal of cheminformatics13, 1–11 (2021)

  55. [55]

    Current opinion in biotechnology23(5), 736–743 (2012)

    Winter, J.M., Tang, Y.: Synthetic biological approaches to natural product biosynthesis. Current opinion in biotechnology23(5), 736–743 (2012)

  56. [56]

    Annual review of microbiology43(1), 173–206 (1989)

    Martin, J.F., Liras, P.: Organization and expression of genes involved in the biosynthesis of antibiotics and other secondary metabolites. Annual review of microbiology43(1), 173–206 (1989)

  57. [57]

    Journal of industrial microbiology9, 73–90 (1992)

    Martin, J.F.: Clusters of genes for the biosynthesis of antibiotics: regulatory genes and overproduction of pharmaceuticals. Journal of industrial microbiology9, 73–90 (1992)

  58. [58]

    BioRxiv, 2021–05 (2021)

    Carroll, L.M., Larralde, M., Fleck, J.S., Ponnudurai, R., Milanese, A., Cappio, E., Zeller, G.: Accurate de novo identification of biosynthetic gene clusters with gecco. BioRxiv, 2021–05 (2021)

  59. [59]

    bioRxiv, 2023–05 (2023)

    Sanchez, S., Rogers, J.D., Rogers, A.B., Nassar, M., McEntyre, J., Welch, M., Hollfelder, F., Finn, R.D.: Expansion of novel biosynthetic gene clusters from diverse environments using sanntis. bioRxiv, 2023–05 (2023)

  60. [60]

    Nucleic acids research49(D1), 412–419 (2021)

    Mistry, J., Chuguransky, S., Williams, L., Qureshi, M., Salazar, G.A., Sonnhammer, E.L., Tosatto, S.C., Paladin, L., Raj, S., Richardson, L.J.,et al.: Pfam: The protein families database in 2021. Nucleic acids research49(D1), 412–419 (2021)

  61. [61]

    Nucleic acids research35(suppl_1), 237–240 (2007)

    Marchler-Bauer, A., Anderson, J.B., Derbyshire, M.K., DeWeese-Scott, C., Gonzales, N.R., Gwadz, M., Hao, L., He, S., Hurwitz, D.I., Jackson, J.D.,et al.: Cdd: a conserved domain database for interactive domain family analysis. Nucleic acids research35(suppl_1), 237–240 (2007)

  62. [62]

    Nucleic acids research38(suppl_1), 401–407 (2010)

    Ulrich, L.E., Zhulin, I.B.: The mist2 database: a comprehensive genomics resource on microbial signal transduction. Nucleic acids research38(suppl_1), 401–407 (2010)

  63. [63]

    Pharmaceutical Science Advances, 100050 (2024)

    Zeng, T., Li, J., Wu, R.: Natural product databases for drug discovery: Features and applications. Pharmaceutical Science Advances, 100050 (2024)

  64. [64]

    Frontiers in chemistry8, 343 (2020)

    Maia, E.H.B., Assis, L.C., De Oliveira, T.A., Da Silva, A.M., Taranto, A.G.: Structure-based virtual screening: from classical to artificial intelligence. Frontiers in chemistry8, 343 (2020)

  65. [65]

    Friesner, R.A., Banks, J.L., Murphy, R.B., Halgren, T.A., Klicic, J.J., Mainz, D.T., Repasky, M.P., Knoll, E.H., Shelley, M., Perry, J.K.,et al.: Glide: a new approach for rapid, accurate docking and scoring.1.methodandassessmentofdockingaccuracy.Journalofmedicinalchemistry 47(7),1739–1749 (2004)

  66. [66]

    Journal of computational chemistry31(2), 455–461 (2010)

    Trott, O., Olson, A.J.: Autodock vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. Journal of computational chemistry31(2), 455–461 (2010)

  67. [67]

    Proteins: Structure, Function, and Bioinformatics52(4), 609–623 (2003)

    Verdonk, M.L., Cole, J.C., Hartshorn, M.J., Murray, C.W., Taylor, R.D.: Improved protein–ligand docking using gold. Proteins: Structure, Function, and Bioinformatics52(4), 609–623 (2003)

  68. [68]

    International journal of molecular sciences22(9), 4435 (2021)

    Kimber, T.B., Chen, Y., Volkamer, A.: Deep learning in virtual screening: recent applications and developments. International journal of molecular sciences22(9), 4435 (2021)

  69. [69]

    Journal of Chemical Information and Modeling62(19), 4642–4659 (2022)

    Krasoulis, A., Antonopoulos, N., Pitsikalis, V., Theodorakis, S.: Denvis: scalable and high-throughput virtual screening using graph neural networks with atomic and surface protein pocket features. Journal of Chemical Information and Modeling62(19), 4642–4659 (2022)

  70. [70]

    Nature Machine Intelligence2(2), 134–140 (2020)

    Zheng, S., Li, Y., Chen, S., Xu, J., Yang, Y.: Predicting drug–protein interaction using quasi-visual question answering system. Nature Machine Intelligence2(2), 134–140 (2020)

  71. [71]

    Advances in Neural Information Processing Systems36 (2024) 19

    Gao, B., Qiang, B., Tan, H., Jia, Y., Ren, M., Lu, M., Liu, J., Ma, W.-Y., Lan, Y.: Drugclip: Con- trasive protein-molecule representation learning for virtual screening. Advances in Neural Information Processing Systems36 (2024) 19

  72. [72]

    Chemical science2(9), 1656–1665 (2011)

    Ma, D.-L., Chan, D.S.-H., Leung, C.-H.: Molecular docking for virtual screening of natural product databases. Chemical science2(9), 1656–1665 (2011)

  73. [73]

    Nature Reviews Neuroscience 2(4), 294–302 (2001)

    Soreq, H., Seidman, S.: Acetylcholinesterase—new roles for an old actor. Nature Reviews Neuroscience 2(4), 294–302 (2001)

  74. [74]

    Xu, K., Hu, W., Leskovec, J., Jegelka, S.: How powerful are graph neural networks? arXiv preprint arXiv:1810.00826 (2018)

  75. [75]

    arXiv preprint arXiv:2103.09430 (2021)

    Hu, W., Fey, M., Ren, H., Nakata, M., Dong, Y., Leskovec, J.: Ogb-lsc: A large-scale challenge for machine learning on graphs. arXiv preprint arXiv:2103.09430 (2021)

  76. [76]

    Journal of Chemical Information and Modeling 62(11), 2713–2725 (2022)

    Wang, Y., Magar, R., Liang, C., Barati Farimani, A.: Improving molecular contrastive learning via faulty negative mitigation and decomposed fragment contrast. Journal of Chemical Information and Modeling 62(11), 2713–2725 (2022)

  77. [77]

    Journal of chemical information and computer sciences42(6), 1273–1280 (2002)

    Durant, J.L., Leland, B.A., Henry, D.R., Nourse, J.G.: Reoptimization of mdl keys for use in drug discovery. Journal of chemical information and computer sciences42(6), 1273–1280 (2002)

  78. [78]

    Advances in neural information processing systems32 (2019)

    Liu, S., Demirel, M.F., Liang, Y.: N-gram graph: Simple unsupervised representation for graphs, with applications to molecules. Advances in neural information processing systems32 (2019)

  79. [79]

    Journal of chemical information and modeling59(8), 3370–3388 (2019)

    Yang, K., Swanson, K., Jin, W., Coley, C., Eiden, P., Gao, H., Guzman-Perez, A., Hopper, T., Kelley, B., Mathea, M.,et al.: Analyzing learned molecular representations for property prediction. Journal of chemical information and modeling59(8), 3370–3388 (2019)

  80. [80]

    Journal of chemical information and modeling 50(5), 742–754 (2010)

    Rogers, D., Hahn, M.: Extended-connectivity fingerprints. Journal of chemical information and modeling 50(5), 742–754 (2010)

Showing first 80 references.