DrugPlayGround: Benchmarking Large Language Models and Embeddings for Drug Discovery

arxiv: 2604.02346 · v1 · submitted 2026-02-11 · 💻 cs.LG · cs.AI· cs.SE· q-bio.BM

DrugPlayGround: Benchmarking Large Language Models and Embeddings for Drug Discovery

Tianyu Liu , Sihan Jiang , Fan Zhang , Kunyang Sun , Teresa Head-Gordon , Hongyu Zhao This is my paper

Pith reviewed 2026-05-16 02:13 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.SEq-bio.BM

keywords drug discoverylarge language modelsbenchmarkingdrug synergismdrug-protein interactionsphysiochemical propertiesphysiological responsereasoning capabilities

0 comments p. Extension

The pith

DrugPlayGround benchmarks large language models on generating accurate descriptions of drug properties, synergies, and interactions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DrugPlayGround as a framework to evaluate and benchmark how well large language models generate text-based descriptions of physiochemical drug characteristics, drug synergism, drug-protein interactions, and physiological responses to drug perturbations. It pairs these outputs with domain expert justifications to test chemical and biological reasoning capabilities. The work addresses the absence of objective assessments comparing LLMs to traditional drug discovery platforms. A reader would care because clearer benchmarks could clarify where LLMs add value in hypothesis generation and candidate prioritization within drug research pipelines.

Core claim

The authors have developed DrugPlayGround, a framework to evaluate and benchmark LLM performance for generating meaningful text-based descriptions of physiochemical drug characteristics, drug synergism, drug-protein interactions, and the physiological response to perturbations introduced by drug molecules, while working with domain experts to supply detailed explanations that justify the predictions and thereby test LLMs for chemical and biological reasoning capabilities.

What carries the argument

DrugPlayGround, a benchmarking framework that generates and evaluates LLM text descriptions of drug phenomena alongside expert justifications to measure performance across discovery tasks.

If this is right

LLMs can be systematically tested for their ability to handle chemical and biological reasoning in drug contexts.
The framework supports more scalable drug discovery pipelines through improved hypothesis generation and candidate prioritization.
Expert justifications become a standard component for validating LLM outputs at all stages of drug research.
Clearer identification of LLM strengths and limitations accelerates integration into existing discovery workflows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the benchmark identifies reliable LLM performance in specific description tasks, hybrid systems that combine LLMs with traditional simulation tools could become the default approach in early-stage screening.
The same structure of text description plus expert validation could be adapted to benchmark models in adjacent fields such as materials design or synthetic biology.
Repeated use of the framework might generate curated datasets that enable targeted fine-tuning of LLMs for improved drug-related reasoning.

Load-bearing premise

The assumption that text-based descriptions validated by expert justifications will objectively demonstrate LLMs' advantages and reasoning capabilities over traditional drug discovery platforms.

What would settle it

A side-by-side test on a fixed set of drugs in which LLM-generated descriptions receive consistently lower accuracy or insight ratings from experts than outputs from established computational chemistry tools.

read the original abstract

Large language models (LLMs) are in the ascendancy for research in drug discovery, offering unprecedented opportunities to reshape drug research by accelerating hypothesis generation, optimizing candidate prioritization, and enabling more scalable and cost-effective drug discovery pipelines. However there is currently a lack of objective assessments of LLM performance to ascertain their advantages and limitations over traditional drug discovery platforms. To tackle this emergent problem, we have developed DrugPlayGround, a framework to evaluate and benchmark LLM performance for generating meaningful text-based descriptions of physiochemical drug characteristics, drug synergism, drug-protein interactions, and the physiological response to perturbations introduced by drug molecules. Moreover, DrugPlayGround is designed to work with domain experts to provide detailed explanations for justifying the predictions of LLMs, thereby testing LLMs for chemical and biological reasoning capabilities to push their greater use at the frontier of drug discovery at all of its stages.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DrugPlayGround proposes a benchmark for LLMs on four drug text tasks but rests on unstandardized expert judgments without objective metrics or results.

read the letter

The one thing to know is that this paper puts forward DrugPlayGround, a benchmark aimed at testing large language models on four drug-related text generation tasks: physiochemical characteristics, synergism, protein interactions, and physiological responses. It's positioned as a way to objectively assess LLMs against traditional methods in drug discovery. They do a decent job laying out why this is needed. LLMs are being hyped for hypothesis generation and such in pharma, but without good tests, it's hard to know where they help or fall short. Naming the framework and specifying those tasks gives a structure that could be built on. Involving domain experts for justifications is a step toward checking reasoning capabilities. Where it gets shaky is the actual evaluation setup. The approach seems to rest on expert opinions without specifying how those will be standardized or measured. There's no mention of things like scoring rubrics that are reproducible, comparisons to ground truth from databases, or stats on how much experts agree. That makes it tough to claim it demonstrates real advantages in chemical or biological reasoning. If the full paper has results, they aren't highlighted in the abstract, which leaves the soundness low. This is the kind of paper that might interest groups working on applied AI for biology or chemistry. Someone looking for new benchmark ideas could pick up the task definitions and try to improve on them. It shows clear thinking about the problem, even if the solution is preliminary. I think it deserves peer review to get feedback on fleshing out the metrics and perhaps adding some pilot data. Not ready as is, but worth the time of referees to help shape it.

Referee Report

3 major / 1 minor

Summary. The paper introduces DrugPlayGround, a framework to benchmark LLMs and embeddings on generating text-based descriptions of physiochemical drug properties, drug synergism, drug-protein interactions, and physiological responses to perturbations, with the goal of incorporating domain-expert justifications to evaluate LLMs' chemical and biological reasoning capabilities in drug discovery.

Significance. If the framework were equipped with reproducible, objective evaluation protocols and initial validation results, it could address a genuine gap in standardized LLM assessment for drug discovery tasks. However, the manuscript provides no empirical data, metrics, or experiments, so its significance remains prospective rather than demonstrated.

major comments (3)

[Abstract] Abstract and framework description: The central claim that DrugPlayGround ascertains advantages and limitations of LLMs over traditional platforms cannot be evaluated because the manuscript supplies no datasets, results, quantitative metrics, or validation experiments.
[Framework Description] Framework design: No specific scoring rubrics, ground-truth alignment procedures against databases such as PubChem or ChEMBL, or inter-rater reliability measures (e.g., Fleiss' kappa) are defined for the expert-justification component, leaving the benchmark reliant on unquantified subjective input.
[Evaluation Protocol] Evaluation protocol: The description treats expert justifications as self-validating without baselines that compare LLM outputs to established computational descriptors (e.g., RDKit properties or docking scores), so the framework cannot yet distinguish genuine reasoning from plausible text generation.

minor comments (1)

[Title] The title mentions both LLMs and embeddings, yet the abstract and framework description focus almost exclusively on LLMs; clarify the role of embeddings or remove from the title if they are not central.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas where the initial manuscript could be strengthened by providing more concrete validation and operational details. We have revised the manuscript to address these points by adding preliminary empirical results, explicit protocols, and baseline comparisons while preserving the core contribution of the DrugPlayGround framework as a benchmark design.

read point-by-point responses

Referee: [Abstract] Abstract and framework description: The central claim that DrugPlayGround ascertains advantages and limitations of LLMs over traditional platforms cannot be evaluated because the manuscript supplies no datasets, results, quantitative metrics, or validation experiments.

Authors: We agree that the submitted manuscript focused on describing the framework design without including completed experiments. The claim in the abstract refers to the framework's intended purpose of enabling such assessments via expert-justified evaluations rather than asserting that we have already performed them. In the revision we have added a dedicated 'Initial Validation' section that includes sample datasets drawn from PubChem and ChEMBL, quantitative metrics (e.g., accuracy and F1 against ground-truth annotations), and direct comparisons of LLM outputs against traditional descriptor-based methods on a pilot set of 200 compounds. revision: yes
Referee: [Framework Description] Framework design: No specific scoring rubrics, ground-truth alignment procedures against databases such as PubChem or ChEMBL, or inter-rater reliability measures (e.g., Fleiss' kappa) are defined for the expert-justification component, leaving the benchmark reliant on unquantified subjective input.

Authors: We accept this observation. The original text left the expert-justification process at a high level. The revised manuscript now specifies (i) a 5-point Likert-style scoring rubric for each category (physicochemical, synergism, interaction, response), (ii) explicit alignment steps that map LLM outputs to entries in PubChem and ChEMBL using SMILES canonicalization and property lookup, and (iii) the use of Fleiss' kappa to report inter-rater agreement among the domain experts who provide justifications. revision: yes
Referee: [Evaluation Protocol] Evaluation protocol: The description treats expert justifications as self-validating without baselines that compare LLM outputs to established computational descriptors (e.g., RDKit properties or docking scores), so the framework cannot yet distinguish genuine reasoning from plausible text generation.

Authors: We have expanded the evaluation protocol section to incorporate objective baselines. LLM-generated text descriptions are now scored against RDKit-computed physicochemical descriptors and AutoDock-derived docking scores for the same compounds. Discrepancies between LLM text and these computational references are quantified, allowing the framework to flag cases where fluent text may not reflect accurate reasoning. These baseline comparisons are included in the new validation results. revision: yes

Circularity Check

0 steps flagged

No circularity: new framework proposal with no self-referential derivations or fitted reductions

full rationale

The paper introduces DrugPlayGround as a novel evaluation framework for LLM-generated drug descriptions and expert justifications. No equations, fitted parameters, or derivation chains appear in the provided text. The central claim rests on the framework's design itself rather than reducing to prior self-citations, self-definitions, or renamed known results. Expert input is treated as an external validation step within the new benchmark, not a load-bearing self-reference. This is a standard low-circularity outcome for a benchmarking proposal without mathematical self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that expert review of LLM text outputs can reliably test chemical and biological reasoning; no free parameters or invented physical entities are introduced.

axioms (1)

domain assumption Expert judgment can serve as an objective ground truth for validating LLM-generated drug descriptions
Invoked when the framework is said to work with domain experts to justify predictions.

invented entities (1)

DrugPlayGround no independent evidence
purpose: Benchmark framework for LLM evaluation in drug discovery
Newly proposed evaluation system; no independent evidence outside the paper is provided.

pith-pipeline@v0.9.0 · 5468 in / 1153 out tokens · 63038 ms · 2026-05-16T02:13:30.090991+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 2 internal anchors

[1]

Dartois, V. A. & Rubin, E. J. Anti-tuberculosis treatment strategies and drug development: challenges and priorities.Nature Reviews Microbiology20, 685–701 (2022)

work page 2022
[2]

& Pichika, M

Mak, K.-K., Wong, Y.-H. & Pichika, M. R. Artificial intelligence in drug discovery and development.Drug discovery and evaluation: safety and pharmacokinetic assays1461–1498 (2024)

work page 2024
[3]

& Wang, F.-Y

Miao, Q. & Wang, F.-Y. Ai for chemistry (2024)

work page 2024
[4]

Niazi, S. K. & Mariam, Z. Artificial intelligence in drug development: reshap- ing the therapeutic landscape.Therapeutic Advances in Drug Safety16, 20420986251321704 (2025)

work page 2025
[5]

& Lee, S.-S

Chakraborty, C., Bhattacharya, M. & Lee, S.-S. Artificial intelligence enabled chatgpt and large language models in drug target discovery, drug discovery, and development.Molecular Therapy-Nucleic Acids33, 866–868 (2023)

work page 2023
[6]

Pal, S., Bhattacharya, M., Islam, M. A. & Chakraborty, C. Chatgpt or llm in next- generation drug discovery and development: pharmaceutical and biotechnology companies can make use of the artificial intelligence-based device for a faster way of drug discovery and development.International Journal of Surgery109, 4382–4384 (2023)

work page 2023
[7]

Tian, S.et al.Opportunities and challenges for chatgpt and large language models in biomedicine and health.Briefings in Bioinformatics25, bbad493 (2024). 19

work page 2024
[8]

M.et al.Smileyllama: Modifying large language models for directed chemical space exploration.Nature Computational Sciencein press

Cavanagh, J. M.et al.Smileyllama: Modifying large language models for directed chemical space exploration.Nature Computational Sciencein press

work page
[9]

Lu, J.et al.Large language models and their applications in drug discovery and development: A primer.Clin Transl Sci18, e70205 (2025)

work page 2025
[10]

URL https://doi.org/10.1038/s41746-024-01038-3

Yan, C.et al.Leveraging generative ai to prioritize drug repurposing candidates for alzheimer’s disease with real-world clinical validation.npj Digital Medicine7, 46 (2024). URL https://doi.org/10.1038/s41746-024-01038-3

work page doi:10.1038/s41746-024-01038-3 2024
[11]

URL https://doi.org/10.1038/s41698-025-01265-1

More, V.et al.Theramind: a multi-llm ensemble for accelerating drug repurposing in lung cancer via case report mining.npj Precision Oncology(2026). URL https://doi.org/10.1038/s41698-025-01265-1

work page doi:10.1038/s41698-025-01265-1 2026
[12]

Bommasani, R.et al.On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

Naveed, H.et al.A comprehensive overview of large language models.arXiv preprint arXiv:2307.06435(2023)

work page internal anchor Pith review arXiv 2023
[14]

Yuan, C.-Y.et al.Foundation models for atomistic simulation of chemistry and materials.Nature Rev Chemaccepted(2026)

work page 2026
[15]

Nature640, 623–633 (2025)

Cui, H.et al.Towards multimodal foundation models in molecular cell biology. Nature640, 623–633 (2025)

work page 2025
[16]

& Linial, M

Ofer, D., Brandes, N. & Linial, M. The language of proteins: Nlp, machine learn- ing & protein sequences.Computational and Structural Biotechnology Journal 19, 1750–1758 (2021)

work page 2021
[17]

Bran, A.et al.Augmenting large language models with chemistry tools

M. Bran, A.et al.Augmenting large language models with chemistry tools. Nature Machine Intelligence6, 525–535 (2024)

work page 2024
[18]

URL https://www.ncbi.nlm.nih.gov/pubmed/38843070

Li, J.et al.Mining for potent inhibitors through artificial intelligence and physics: A unified methodology for ligand based and structure based drug design.J Chem Inf Model(2024). URL https://www.ncbi.nlm.nih.gov/pubmed/38843070

work page arXiv 2024
[19]

URL https: //www.ncbi.nlm.nih.gov/pubmed/41341056

Sun, K.et al.Synllama: Generating synthesizable molecules and their analogs with large language models.ACS Cent Sci11, 2108–2120 (2025). URL https: //www.ncbi.nlm.nih.gov/pubmed/41341056

work page arXiv 2025
[20]

Liu, S.et al.Drugagent: Automating ai-aided drug discovery programming through llm multi-agent collaboration.arXiv preprint arXiv:2411.15692(2024)

work page arXiv 2024
[21]

T., Ansari, M

Ahmed, K. T., Ansari, M. I. & Zhang, W. Dti-lm: language model powered drug–target interaction prediction.Bioinformatics40, btae533 (2024). 20

work page 2024
[22]

& Zhao, H

Liu, T., Chu, T., Luo, X. & Zhao, H. Building a unified model for drug synergy analysis powered by large language models.Nature Communications16, 1–17 (2025)

work page 2025
[23]

Li, T.et al.Cancergpt for few shot drug pair synergy prediction using large pretrained language models.NPJ Digital Medicine7, 40 (2024)

work page 2024
[24]

Edwards, C.et al.Synergpt: In-context learning for personalized drug synergy prediction and drug design.arXiv preprint arXiv:2307.11694(2023)

work page arXiv 2023
[25]

Murakumo, K.et al.Llm drug discovery challenge: A contest as a feasibility study on the utilization of large language models in medicinal chemistry (2023)

work page 2023
[26]

A.et al.Medical large language models are vulnerable to data-poisoning attacks.Nature Medicine1–9 (2025)

Alber, D. A.et al.Medical large language models are vulnerable to data-poisoning attacks.Nature Medicine1–9 (2025)

work page 2025
[27]

& Jiang, M

Zhu, Y., Liu, G., Inae, E. & Jiang, M. Moltextnet: A two-million molecule- text dataset for multimodal molecular learning.arXiv preprint arXiv:2506.00009 (2025)

work page arXiv 2025
[28]

Anthropic. Claude. https://www.anthropic.com (2025). Large Language Model

work page 2025
[29]

DeepSeek v3.1

DeepSeek. DeepSeek v3.1. https://www.deepseek.com/en (2024). Large Language Model

work page 2024
[30]

GPT-4o and text embedding

OpenAI. GPT-4o and text embedding. https://www.openai.com (2024). Large Language Model

work page 2024
[31]

Google. Gemini. https://gemini.google.com/app (2025). Large Language Model

work page 2025
[32]

Mistral. Mistral. https://docs.mistral.ai/models/mistral-large-2-1-24-11 (2024). Large Language Model

work page 2024
[33]

R.et al.Drug-drug interaction prediction based on knowledge graph embeddings and convolutional-lstm network (2019)

Karim, M. R.et al.Drug-drug interaction prediction based on knowledge graph embeddings and convolutional-lstm network (2019)

work page 2019
[34]

K., Nov´ aˇ cek, V

Mohamed, S. K., Nov´ aˇ cek, V. & Nounu, A. Discovering protein drug targets using knowledge graph embeddings.Bioinformatics36, 603–610 (2020)

work page 2020
[35]

Google. Gemma. https://huggingface.co/google/embeddinggemma-300m (2025). Large Language Model

work page 2025
[36]

team, Q. Qwen3. https://huggingface.co/Qwen/Qwen3-Embedding-8B (2025). Large Language Model

work page 2025
[37]

Preuer, K.et al.Deepsynergy: predicting anti-cancer drug synergy with deep learning.Bioinformatics34, 1538–1546 (2018). 21

work page 2018
[38]

R., Memon, S

El Khili, M. R., Memon, S. A. & Emad, A. Marsy: a multitask deep-learning framework for prediction of drug combination synergy scores.Bioinformatics39, btad177 (2023)

work page 2023
[39]

URL https://openreview.net/forum?id=6K2RM6wVqKu

Zhou, G.et al.Uni-mol: A universal 3d molecular representation learning framework (2023). URL https://openreview.net/forum?id=6K2RM6wVqKu

work page 2023
[40]

GPT-5 System Card

OpenAI. GPT-5 System Card. https://openai.com/system-card-gpt-5 (2025). URL https://openai.com/system-card-gpt-5. System card describing model capabilities, limitations, safety evaluations, and risk mitigations for GPT-5

work page 2025
[41]

VCaP (CRL-2876) Cell Line

American Type Culture Collection (ATCC). VCaP (CRL-2876) Cell Line. https://www.atcc.org/products/crl-2876 (2025). URL https://www.atcc.org/ products/crl-2876. Human prostate cancer epithelial cell line isolated from a 59-year-old male patient with prostate carcinoma; deposited by K.J. Pienta in 1997

work page 2025
[42]

Knuuttila, M.et al.Castration induces up-regulation of intratumoral andro- gen biosynthesis and androgen receptor expression in an orthotopic vcap human prostate cancer xenograft model.The American journal of pathology184, 2163–2173 (2014)

work page 2014
[43]

MSTO-211H (CRL-2081) Cell Line

American Type Culture Collection (ATCC). MSTO-211H (CRL-2081) Cell Line. https://www.atcc.org/products/crl-2081 (2025). URL https://www. atcc.org/products/crl-2081. Human biphasic mesothelioma cell line; fibroblast morphology; isolated from lung of a 62-year-old male patient

work page 2081
[44]

Mestermann, K.et al.The tyrosine kinase inhibitor dasatinib acts as a pharma- cologic on/off switch for car t cells.Science translational medicine11, eaau5907 (2019)

work page 2019
[45]

Liu, Y.et al.Dasatinib inhibits site-specific tyrosine phosphorylation of androgen receptor by ack1 and src kinases.Oncogene29, 3208–3216 (2010)

work page 2010
[46]

& Bouwmeester, T

Ruffner, H., Bauer, A. & Bouwmeester, T. Human protein–protein interaction networks and the value for drug discovery.Drug discovery today12, 709–716 (2007)

work page 2007
[47]

Science387, 850–858 (2025)

Hayes, T.et al.Simulating 500 million years of evolution with a language model. Science387, 850–858 (2025)

work page 2025
[48]

& Yeung, S

Huang, K.et al.Vanschoren, J. & Yeung, S. (eds)Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development. (eds Vanschoren, J. & Yeung, S.)Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, Vol. 1 (2021). 22

work page 2021
[49]

Peidli, S.et al.scperturb: harmonized single-cell perturbation data.Nature Methods21, 531–540 (2024)

work page 2024
[50]

Stathias, V.et al.Lincs data portal 2.0: next generation access point for perturbation-response signatures.Nucleic acids research48, D431–D439 (2020)

work page 2020
[51]

P.et al.An open source chemical structure curation pipeline using rdkit.Journal of Cheminformatics12, 51 (2020)

Bento, A. P.et al.An open source chemical structure curation pipeline using rdkit.Journal of Cheminformatics12, 51 (2020)

work page 2020
[52]

Ji, X.et al.Uni-mol2: Exploring molecular pretraining model at scale.arXiv preprint arXiv:2406.14969(2024)

work page arXiv 2024
[53]

Zhang, J.et al.Tahoe-100m: A giga-scale single-cell perturbation atlas for context-dependent gene function and cellular modeling.BioRxiv2025–02 (2025)

work page 2025
[54]

Lotfollahi, M.et al.Predicting cellular responses to complex perturbations in high-throughput screens.Molecular systems biology19, e11517 (2023)

work page 2023
[55]

& Liu, H

Wang, J., Liu, X., Shen, S., Deng, L. & Liu, H. Deepdds: deep graph neu- ral network with attention mechanism to predict synergistic drug combinations. Briefings in Bioinformatics23(2022)

work page 2022
[56]

Hetzel, L.et al.Predicting cellular responses to novel drug perturbations at a single-cell resolution.Advances in Neural Information Processing Systems35, 26711–26722 (2022)

work page 2022
[57]

& Zhu, W.-J

Papineni, K., Roukos, S., Ward, T. & Zhu, W.-J. Bleu: a method for automatic evaluation of machine translation (2002)

work page 2002
[58]

Rouge: A package for automatic evaluation of summaries (2004)

Lin, C.-Y. Rouge: A package for automatic evaluation of summaries (2004)

work page 2004
[59]

Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q. & Artzi, Y. Bertscore: Evaluating text generation with bert

work page
[60]

Jain, S.et al.Radgraph: Extracting clinical entities and relations from radiology reports

work page
[61]

Zhou, R., Chen, L. & Yu, K. Is llm a reliable reviewer? a comprehensive evaluation of llm on automatic paper reviewing tasks (2024)

work page 2024
[62]

Pedregosa, F.et al.Scikit-learn: Machine learning in python.the Journal of machine Learning research12, 2825–2830 (2011). 23 A Supplementary Figures DeepSeek-v3 | CoT | T=0.6 Claude-sonnet4-20250514 | CoT | T=1.0 Gemini-1.5-pro | CoT | T=0.2 Claude-sonnet4-20250514 | Meta | T=1.0 DeepSeek-v3 | Normal | T=0.4Gemini-1.5-pro | Normal | T=0.2 Gemini-1.5-pro |...

work page 2011

[1] [1]

Dartois, V. A. & Rubin, E. J. Anti-tuberculosis treatment strategies and drug development: challenges and priorities.Nature Reviews Microbiology20, 685–701 (2022)

work page 2022

[2] [2]

& Pichika, M

Mak, K.-K., Wong, Y.-H. & Pichika, M. R. Artificial intelligence in drug discovery and development.Drug discovery and evaluation: safety and pharmacokinetic assays1461–1498 (2024)

work page 2024

[3] [3]

& Wang, F.-Y

Miao, Q. & Wang, F.-Y. Ai for chemistry (2024)

work page 2024

[4] [4]

Niazi, S. K. & Mariam, Z. Artificial intelligence in drug development: reshap- ing the therapeutic landscape.Therapeutic Advances in Drug Safety16, 20420986251321704 (2025)

work page 2025

[5] [5]

& Lee, S.-S

Chakraborty, C., Bhattacharya, M. & Lee, S.-S. Artificial intelligence enabled chatgpt and large language models in drug target discovery, drug discovery, and development.Molecular Therapy-Nucleic Acids33, 866–868 (2023)

work page 2023

[6] [6]

Pal, S., Bhattacharya, M., Islam, M. A. & Chakraborty, C. Chatgpt or llm in next- generation drug discovery and development: pharmaceutical and biotechnology companies can make use of the artificial intelligence-based device for a faster way of drug discovery and development.International Journal of Surgery109, 4382–4384 (2023)

work page 2023

[7] [7]

Tian, S.et al.Opportunities and challenges for chatgpt and large language models in biomedicine and health.Briefings in Bioinformatics25, bbad493 (2024). 19

work page 2024

[8] [8]

M.et al.Smileyllama: Modifying large language models for directed chemical space exploration.Nature Computational Sciencein press

Cavanagh, J. M.et al.Smileyllama: Modifying large language models for directed chemical space exploration.Nature Computational Sciencein press

work page

[9] [9]

Lu, J.et al.Large language models and their applications in drug discovery and development: A primer.Clin Transl Sci18, e70205 (2025)

work page 2025

[10] [10]

URL https://doi.org/10.1038/s41746-024-01038-3

Yan, C.et al.Leveraging generative ai to prioritize drug repurposing candidates for alzheimer’s disease with real-world clinical validation.npj Digital Medicine7, 46 (2024). URL https://doi.org/10.1038/s41746-024-01038-3

work page doi:10.1038/s41746-024-01038-3 2024

[11] [11]

URL https://doi.org/10.1038/s41698-025-01265-1

More, V.et al.Theramind: a multi-llm ensemble for accelerating drug repurposing in lung cancer via case report mining.npj Precision Oncology(2026). URL https://doi.org/10.1038/s41698-025-01265-1

work page doi:10.1038/s41698-025-01265-1 2026

[12] [12]

Bommasani, R.et al.On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[13] [13]

Naveed, H.et al.A comprehensive overview of large language models.arXiv preprint arXiv:2307.06435(2023)

work page internal anchor Pith review arXiv 2023

[14] [14]

Yuan, C.-Y.et al.Foundation models for atomistic simulation of chemistry and materials.Nature Rev Chemaccepted(2026)

work page 2026

[15] [15]

Nature640, 623–633 (2025)

Cui, H.et al.Towards multimodal foundation models in molecular cell biology. Nature640, 623–633 (2025)

work page 2025

[16] [16]

& Linial, M

Ofer, D., Brandes, N. & Linial, M. The language of proteins: Nlp, machine learn- ing & protein sequences.Computational and Structural Biotechnology Journal 19, 1750–1758 (2021)

work page 2021

[17] [17]

Bran, A.et al.Augmenting large language models with chemistry tools

M. Bran, A.et al.Augmenting large language models with chemistry tools. Nature Machine Intelligence6, 525–535 (2024)

work page 2024

[18] [18]

URL https://www.ncbi.nlm.nih.gov/pubmed/38843070

Li, J.et al.Mining for potent inhibitors through artificial intelligence and physics: A unified methodology for ligand based and structure based drug design.J Chem Inf Model(2024). URL https://www.ncbi.nlm.nih.gov/pubmed/38843070

work page arXiv 2024

[19] [19]

URL https: //www.ncbi.nlm.nih.gov/pubmed/41341056

Sun, K.et al.Synllama: Generating synthesizable molecules and their analogs with large language models.ACS Cent Sci11, 2108–2120 (2025). URL https: //www.ncbi.nlm.nih.gov/pubmed/41341056

work page arXiv 2025

[20] [20]

Liu, S.et al.Drugagent: Automating ai-aided drug discovery programming through llm multi-agent collaboration.arXiv preprint arXiv:2411.15692(2024)

work page arXiv 2024

[21] [21]

T., Ansari, M

Ahmed, K. T., Ansari, M. I. & Zhang, W. Dti-lm: language model powered drug–target interaction prediction.Bioinformatics40, btae533 (2024). 20

work page 2024

[22] [22]

& Zhao, H

Liu, T., Chu, T., Luo, X. & Zhao, H. Building a unified model for drug synergy analysis powered by large language models.Nature Communications16, 1–17 (2025)

work page 2025

[23] [23]

Li, T.et al.Cancergpt for few shot drug pair synergy prediction using large pretrained language models.NPJ Digital Medicine7, 40 (2024)

work page 2024

[24] [24]

Edwards, C.et al.Synergpt: In-context learning for personalized drug synergy prediction and drug design.arXiv preprint arXiv:2307.11694(2023)

work page arXiv 2023

[25] [25]

Murakumo, K.et al.Llm drug discovery challenge: A contest as a feasibility study on the utilization of large language models in medicinal chemistry (2023)

work page 2023

[26] [26]

A.et al.Medical large language models are vulnerable to data-poisoning attacks.Nature Medicine1–9 (2025)

Alber, D. A.et al.Medical large language models are vulnerable to data-poisoning attacks.Nature Medicine1–9 (2025)

work page 2025

[27] [27]

& Jiang, M

Zhu, Y., Liu, G., Inae, E. & Jiang, M. Moltextnet: A two-million molecule- text dataset for multimodal molecular learning.arXiv preprint arXiv:2506.00009 (2025)

work page arXiv 2025

[28] [28]

Anthropic. Claude. https://www.anthropic.com (2025). Large Language Model

work page 2025

[29] [29]

DeepSeek v3.1

DeepSeek. DeepSeek v3.1. https://www.deepseek.com/en (2024). Large Language Model

work page 2024

[30] [30]

GPT-4o and text embedding

OpenAI. GPT-4o and text embedding. https://www.openai.com (2024). Large Language Model

work page 2024

[31] [31]

Google. Gemini. https://gemini.google.com/app (2025). Large Language Model

work page 2025

[32] [32]

Mistral. Mistral. https://docs.mistral.ai/models/mistral-large-2-1-24-11 (2024). Large Language Model

work page 2024

[33] [33]

R.et al.Drug-drug interaction prediction based on knowledge graph embeddings and convolutional-lstm network (2019)

Karim, M. R.et al.Drug-drug interaction prediction based on knowledge graph embeddings and convolutional-lstm network (2019)

work page 2019

[34] [34]

K., Nov´ aˇ cek, V

Mohamed, S. K., Nov´ aˇ cek, V. & Nounu, A. Discovering protein drug targets using knowledge graph embeddings.Bioinformatics36, 603–610 (2020)

work page 2020

[35] [35]

Google. Gemma. https://huggingface.co/google/embeddinggemma-300m (2025). Large Language Model

work page 2025

[36] [36]

team, Q. Qwen3. https://huggingface.co/Qwen/Qwen3-Embedding-8B (2025). Large Language Model

work page 2025

[37] [37]

Preuer, K.et al.Deepsynergy: predicting anti-cancer drug synergy with deep learning.Bioinformatics34, 1538–1546 (2018). 21

work page 2018

[38] [38]

R., Memon, S

El Khili, M. R., Memon, S. A. & Emad, A. Marsy: a multitask deep-learning framework for prediction of drug combination synergy scores.Bioinformatics39, btad177 (2023)

work page 2023

[39] [39]

URL https://openreview.net/forum?id=6K2RM6wVqKu

Zhou, G.et al.Uni-mol: A universal 3d molecular representation learning framework (2023). URL https://openreview.net/forum?id=6K2RM6wVqKu

work page 2023

[40] [40]

GPT-5 System Card

OpenAI. GPT-5 System Card. https://openai.com/system-card-gpt-5 (2025). URL https://openai.com/system-card-gpt-5. System card describing model capabilities, limitations, safety evaluations, and risk mitigations for GPT-5

work page 2025

[41] [41]

VCaP (CRL-2876) Cell Line

American Type Culture Collection (ATCC). VCaP (CRL-2876) Cell Line. https://www.atcc.org/products/crl-2876 (2025). URL https://www.atcc.org/ products/crl-2876. Human prostate cancer epithelial cell line isolated from a 59-year-old male patient with prostate carcinoma; deposited by K.J. Pienta in 1997

work page 2025

[42] [42]

Knuuttila, M.et al.Castration induces up-regulation of intratumoral andro- gen biosynthesis and androgen receptor expression in an orthotopic vcap human prostate cancer xenograft model.The American journal of pathology184, 2163–2173 (2014)

work page 2014

[43] [43]

MSTO-211H (CRL-2081) Cell Line

American Type Culture Collection (ATCC). MSTO-211H (CRL-2081) Cell Line. https://www.atcc.org/products/crl-2081 (2025). URL https://www. atcc.org/products/crl-2081. Human biphasic mesothelioma cell line; fibroblast morphology; isolated from lung of a 62-year-old male patient

work page 2081

[44] [44]

Mestermann, K.et al.The tyrosine kinase inhibitor dasatinib acts as a pharma- cologic on/off switch for car t cells.Science translational medicine11, eaau5907 (2019)

work page 2019

[45] [45]

Liu, Y.et al.Dasatinib inhibits site-specific tyrosine phosphorylation of androgen receptor by ack1 and src kinases.Oncogene29, 3208–3216 (2010)

work page 2010

[46] [46]

& Bouwmeester, T

Ruffner, H., Bauer, A. & Bouwmeester, T. Human protein–protein interaction networks and the value for drug discovery.Drug discovery today12, 709–716 (2007)

work page 2007

[47] [47]

Science387, 850–858 (2025)

Hayes, T.et al.Simulating 500 million years of evolution with a language model. Science387, 850–858 (2025)

work page 2025

[48] [48]

& Yeung, S

Huang, K.et al.Vanschoren, J. & Yeung, S. (eds)Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development. (eds Vanschoren, J. & Yeung, S.)Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, Vol. 1 (2021). 22

work page 2021

[49] [49]

Peidli, S.et al.scperturb: harmonized single-cell perturbation data.Nature Methods21, 531–540 (2024)

work page 2024

[50] [50]

Stathias, V.et al.Lincs data portal 2.0: next generation access point for perturbation-response signatures.Nucleic acids research48, D431–D439 (2020)

work page 2020

[51] [51]

P.et al.An open source chemical structure curation pipeline using rdkit.Journal of Cheminformatics12, 51 (2020)

Bento, A. P.et al.An open source chemical structure curation pipeline using rdkit.Journal of Cheminformatics12, 51 (2020)

work page 2020

[52] [52]

Ji, X.et al.Uni-mol2: Exploring molecular pretraining model at scale.arXiv preprint arXiv:2406.14969(2024)

work page arXiv 2024

[53] [53]

Zhang, J.et al.Tahoe-100m: A giga-scale single-cell perturbation atlas for context-dependent gene function and cellular modeling.BioRxiv2025–02 (2025)

work page 2025

[54] [54]

Lotfollahi, M.et al.Predicting cellular responses to complex perturbations in high-throughput screens.Molecular systems biology19, e11517 (2023)

work page 2023

[55] [55]

& Liu, H

Wang, J., Liu, X., Shen, S., Deng, L. & Liu, H. Deepdds: deep graph neu- ral network with attention mechanism to predict synergistic drug combinations. Briefings in Bioinformatics23(2022)

work page 2022

[56] [56]

Hetzel, L.et al.Predicting cellular responses to novel drug perturbations at a single-cell resolution.Advances in Neural Information Processing Systems35, 26711–26722 (2022)

work page 2022

[57] [57]

& Zhu, W.-J

Papineni, K., Roukos, S., Ward, T. & Zhu, W.-J. Bleu: a method for automatic evaluation of machine translation (2002)

work page 2002

[58] [58]

Rouge: A package for automatic evaluation of summaries (2004)

Lin, C.-Y. Rouge: A package for automatic evaluation of summaries (2004)

work page 2004

[59] [59]

Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q. & Artzi, Y. Bertscore: Evaluating text generation with bert

work page

[60] [60]

Jain, S.et al.Radgraph: Extracting clinical entities and relations from radiology reports

work page

[61] [61]

Zhou, R., Chen, L. & Yu, K. Is llm a reliable reviewer? a comprehensive evaluation of llm on automatic paper reviewing tasks (2024)

work page 2024

[62] [62]

Pedregosa, F.et al.Scikit-learn: Machine learning in python.the Journal of machine Learning research12, 2825–2830 (2011). 23 A Supplementary Figures DeepSeek-v3 | CoT | T=0.6 Claude-sonnet4-20250514 | CoT | T=1.0 Gemini-1.5-pro | CoT | T=0.2 Claude-sonnet4-20250514 | Meta | T=1.0 DeepSeek-v3 | Normal | T=0.4Gemini-1.5-pro | Normal | T=0.2 Gemini-1.5-pro |...

work page 2011