pith. sign in

arxiv: 2604.26498 · v2 · pith:3JNQYK2Snew · submitted 2026-04-29 · 💻 cs.LG · q-bio.QM

Do Larger Models Really Win in Drug Discovery? A Benchmark Assessment of Model Scaling in AI-Driven Molecular Property and Activity Prediction

Pith reviewed 2026-05-19 17:37 UTC · model grok-4.3

classification 💻 cs.LG q-bio.QM
keywords molecular property predictiondrug discoverymodel scalinggraph neural networkslarge language modelscheminformaticsbenchmarkAI for chemistry
0
0 comments X

The pith

Classical ML models outperform larger pretrained and LLM approaches in most molecular prediction tasks for drug discovery

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper challenges the idea that bigger models will automatically win in AI-driven drug discovery by running a large-scale benchmark. It evaluates classical machine learning, graph neural networks, pretrained sequence models, and LLM-based baselines on 26 endpoints spanning ADME properties, toxicity, and bioactivity. The tests use 78 endpoint-split combinations with random, Murcko scaffold, and structure-separated 5-fold cross-validation to simulate easy retrospective checks through to hard novel-chemotype scenarios. Across 156 comparisons, compact classical models win the large majority of cases, showing that performance depends on matching model family to task and split difficulty rather than on increasing scale.

Core claim

Across 78 endpoint and split entries for molecular properties, toxicity, safety liabilities and biological activity, classical ML such as RF(ECFP4) and ExtraTrees(RDKit) win 116 out of 156 fold mean comparisons, GNNs such as GIN and Ligandformer win 25, pretrained sequence models such as MoLFormer and ChemBERTa2 win 12, and LLM based SAR baselines win three. ML dominates random split interpolation but loses part of this advantage under harder splits; GNN and sequence models also decline but gain relative ground, whereas LLM based SAR is weaker in absolute terms yet less sensitive to the split axis. SAR knowledge derived from training folds improves many GPT5.5-SAR and Opus4.7-SAR metrics but

What carries the argument

The tiered cross-validation benchmark using random, Murcko scaffold and structure-separated 5-fold splits on 26 endpoints grouped into ADME, toxicity and bioactivity classes to compare four model families under increasing generalization demands

If this is right

  • Classical ML models achieve highest accuracy on easier random splits but their lead narrows on scaffold and structure-separated splits
  • GNNs and pretrained sequence models lose ground in absolute terms on harder splits yet improve their relative ranking against classical ML
  • LLM-based SAR baselines deliver lower absolute performance but remain more stable when split difficulty increases
  • Incorporating SAR knowledge from the training folds raises LLM metrics without turning rule-based reasoning into a universal replacement for supervised predictors
  • Overall success depends on the fit between model family, task type and validation scenario rather than on model scale

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Teams running routine high-throughput screens could default to fast classical models to reduce compute costs without sacrificing accuracy
  • LLMs may still add value in very low-data regimes for generating SAR hypotheses even if they trail on raw prediction metrics
  • Future work should test these patterns on time-based or truly prospective splits to check whether the observed family rankings hold in live discovery campaigns

Load-bearing premise

The 78 endpoint and split entries using random, Murcko scaffold and structure-separated 5-fold CV adequately represent the spectrum of real-world drug discovery challenges from closed-library retrospective evaluation to novel chemotype library expansion

What would settle it

A follow-up study on a fresh set of endpoints or on prospectively collected compounds where larger pretrained or LLM models consistently beat classical ML across all three split types would show the scaling assumption holds after all

Figures

Figures reproduced from arXiv: 2604.26498 by Jinjiang Guo.

Figure 1
Figure 1. Figure 1: Model taxonomy for the benchmark. The figure summarizes small ML, GNN, pretrained view at source ↗
Figure 2
Figure 2. Figure 2: Molecular representation pathways compared in the benchmark. Fingerprints and de view at source ↗
Figure 3
Figure 3. Figure 3: Structure-similarity-separated five-fold cross-validation workflow. Molecules are stan view at source ↗
Figure 4
Figure 4. Figure 4: Proportional summary of model-family wins across ADMET, Tox21 and anti-infective view at source ↗
Figure 5
Figure 5. Figure 5: Effect of train-fold-derived SAR knowledge on LLM-SAR performance across task groups view at source ↗
read the original abstract

The rapid growth of molecular foundation models and large language models has encouraged a scale centred view of AI in drug discovery, in which larger pretrained models are expected to supersede compact cheminformatics models and graph neural networks (GNNs) trained for individual tasks. We test this assumption across 26 endpoints for molecular properties, toxicity, safety liabilities and biological activity, grouped into ADME, toxicity and bioactivity classes. The benchmark contains 78 endpoint and split entries spanning random, Murcko scaffold and structure separated 5-fold CV. Ordered from easiest to hardest, these splits approximate retrospective evaluation on a closed library, scaffold expansion in hit to lead, and library expansion on novel chemotypes. Each entry includes ML, GNN, pretrained molecular sequence and LLM based SAR families. Across 156 fold mean comparisons, classical ML such as RF(ECFP4) and ExtraTrees(RDKit) win 116, GNNs such as GIN and Ligandformer win 25, pretrained sequence models such as MoLFormer and ChemBERTa2 win 12, and LLM based SAR baselines win three. ML dominates random split interpolation but loses part of this advantage under harder splits; GNN and sequence models also decline but gain relative ground, whereas LLM based SAR is weaker in absolute terms yet less sensitive to the split axis. Paired bootstrap analyses support family level trends more strongly than individual model rankings. SAR knowledge derived from training folds improves many GPT5.5-SAR and Opus4.7-SAR metrics but does not make rule based reasoning a universal substitute for supervised predictors. Compact specialized models remain highly effective for molecular property and activity prediction. Larger models add value for SAR interpretation and reasoning in low data settings, but predictive performance depends on the fit among model, task and validation scenario, not on scale alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript benchmarks classical ML models (e.g., RF(ECFP4), ExtraTrees(RDKit)), GNNs (e.g., GIN, Ligandformer), pretrained sequence models (e.g., MoLFormer, ChemBERTa2), and LLM-based SAR baselines across 26 endpoints in ADME, toxicity, and bioactivity categories. Using 78 endpoint/split entries with random, Murcko scaffold, and structure-separated 5-fold CV, it reports classical ML winning 116 of 156 fold-mean comparisons, GNNs winning 25, pretrained models 12, and LLMs 3. The central claim is that compact specialized models remain highly effective for predictive performance, while larger models add value mainly for SAR interpretation in low-data settings, with performance depending on model-task-validation fit rather than scale alone.

Significance. If the empirical results hold under scrutiny, the work is significant for providing a large-scale, multi-family comparison that challenges scale-centric assumptions in AI for drug discovery. It supplies concrete win-rate data and notes relative robustness of LLM-SAR to split difficulty, which could inform model selection. The use of paired bootstrap analyses for family trends and the ordering of splits by difficulty are positive features that increase the benchmark's reference value.

major comments (2)
  1. [Results] Results section (win-count reporting): The abstract and results state classical ML wins 116 of 156 comparisons, yet no per-family variance, confidence intervals, or explicit tie-handling rules are supplied alongside the bootstrap analyses; without these, the strength of the family-level dominance claim cannot be fully evaluated from the numbers alone.
  2. [Methods] Methods (validation splits): The structure-separated 5-fold CV is used to approximate novel-chemotype library expansion, but the manuscript provides no quantitative checks for residual chemical similarity or analog leakage between folds, nor any comparison against strict temporal splits by assay or patent date; this leaves open whether the observed classical-ML advantage would persist under the distribution shifts typical of prospective drug-discovery validation.
minor comments (2)
  1. [Abstract] Abstract: The terms 'GPT5.5-SAR' and 'Opus4.7-SAR' appear without prior definition or reference to the underlying LLM versions and prompting strategy.
  2. [Tables] Tables: Ensure every results table lists the exact number of comparisons contributing to each win count so readers can verify the 156 total and the per-class breakdowns.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments and for recognizing the benchmark's scope and reference value. We address each major comment below with clarifications and indicate where revisions will be made.

read point-by-point responses
  1. Referee: [Results] Results section (win-count reporting): The abstract and results state classical ML wins 116 of 156 comparisons, yet no per-family variance, confidence intervals, or explicit tie-handling rules are supplied alongside the bootstrap analyses; without these, the strength of the family-level dominance claim cannot be fully evaluated from the numbers alone.

    Authors: The win counts are presented as descriptive aggregates, while the paired bootstrap analyses were used to evaluate family-level trends. We agree that explicit reporting of per-family variance, confidence intervals, and tie-handling rules would strengthen interpretability. In the revised manuscript we will add bootstrap-derived confidence intervals for each family's win rate and specify the tie-handling procedure (ties assigned proportionally to the tied families). revision: yes

  2. Referee: [Methods] Methods (validation splits): The structure-separated 5-fold CV is used to approximate novel-chemotype library expansion, but the manuscript provides no quantitative checks for residual chemical similarity or analog leakage between folds, nor any comparison against strict temporal splits by assay or patent date; this leaves open whether the observed classical-ML advantage would persist under the distribution shifts typical of prospective drug-discovery validation.

    Authors: We will incorporate quantitative checks for residual similarity, including mean and distribution of ECFP4 Tanimoto similarities between training and test folds, to document the degree of analog leakage. However, the source datasets do not contain consistent assay or patent dates for all 26 endpoints, precluding a uniform temporal-split comparison. We will note this data limitation explicitly and discuss the structure-separated split as a practical proxy for prospective validation. revision: partial

standing simulated objections not resolved
  • Direct comparison to strict temporal splits by assay or patent date, because consistent temporal metadata is unavailable across the full set of public datasets used.

Circularity Check

0 steps flagged

No circularity: empirical benchmark rests on direct held-out comparisons

full rationale

The paper reports model performance via explicit 5-fold CV on 78 endpoint/split combinations, counting wins across 156 comparisons and supporting trends with paired bootstrap. No equations, fitted parameters, or derivations are present; results are computed directly from held-out test folds rather than being redefined or predicted from the training statistics themselves. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claim therefore remains an independent empirical observation rather than a self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical benchmark study. It introduces no new mathematical derivations, fitted constants, or postulated physical entities. The main untested premise is that the chosen endpoints and splits stand in for real drug-discovery scenarios.

axioms (1)
  • domain assumption The random, Murcko scaffold, and structure-separated 5-fold CV splits approximate retrospective closed-library evaluation, scaffold expansion in hit-to-lead, and library expansion on novel chemotypes.
    Explicitly stated in the abstract as the ordering from easiest to hardest splits.

pith-pipeline@v0.9.0 · 5873 in / 1389 out tokens · 60076 ms · 2026-05-19T17:37:02.926593+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 4 internal anchors

  1. [1]

    Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S

    Zhenqin Wu, Bharath Ramsundar, Evan N. Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, and Vijay Pande. Moleculenet: A benchmark for molec- ular machine learning.Chemical Science, 9:513–530, 2018. doi: 10.1039/C7SC02664A

  2. [2]

    Huang, T

    Kexin Huang, Tianfan Fu, Wenhao Gao, Yue Zhao, Yusuf Roohani, Jure Leskovec, Connor W. Coley, Cao Xiao, Jimeng Sun, and Marinka Zitnik. Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021. URLhttps:// arxiv.org/a...

  3. [3]

    Bronskill, Krzysztof Maziarz, Henryk Misztela, Julien Lanini, Marwin Segler, Nadine Schneider, and Marc Brockschmidt

    Megan Stanley, John F. Bronskill, Krzysztof Maziarz, Henryk Misztela, Julien Lanini, Marwin Segler, Nadine Schneider, and Marc Brockschmidt. Fs-mol: A few-shot learning dataset of molecules. InNeurIPS Datasets and Benchmarks Track, 2021. URLhttps://openreview. net/forum?id=701FtuyLlAd

  4. [4]

    Limitations of representation learning in small molecule property prediction.Nature Communications, 14:6394, 2023

    Ana Laura Dias, Latimah Bustillo, and Tiago Rodrigues. Limitations of representation learning in small molecule property prediction.Nature Communications, 14:6394, 2023. doi: 10.1038/ s41467-023-41967-3. URLhttps://www.nature.com/articles/s41467-023-41967-3

  5. [5]

    Jun Xia, Lecheng Zhang, Xiao Zhu, and Stan Z. Li. Why deep models often cannot beat non-deep counterparts on molecular property prediction?, 2023. URLhttps://arxiv.org/ abs/2306.17702

  6. [6]

    Benchmarking ma- chine learning in admet predictions: The practical impact of feature representations in ligand- based models.Journal of Cheminformatics, 17:108, 2025

    Gintautas Kamuntavicius, Tanya Paquet, Orestis Bastas, Dainius Salkauskas, Alvaro Prat, Hisham Abdel Aty, Aurimas Pabrinkis, Povilas Norvaisas, and Roy Tal. Benchmarking ma- chine learning in admet predictions: The practical impact of feature representations in ligand- based models.Journal of Cheminformatics, 17:108, 2025. doi: 10.1186/s13321-025-01041-0....

  7. [7]

    ChemBERTa: large -scale self -supervised pretraining fo r molecular property prediction

    Seyone Chithrananda, Gabriel Grand, and Bharath Ramsundar. Chemberta: Large-scale self- supervised pretraining for molecular property prediction, 2020. URLhttps://arxiv.org/ abs/2010.09885

  8. [8]

    ChemBERTa- 2: Towards chemical foundation models.arXiv preprint arXiv:2209.01712, 2022

    Walid Ahmad, Eric Simon, Seyone Chithrananda, Gabriel Grand, and Bharath Ramsundar. Chemberta-2: Towards chemical foundation models, 2022. URLhttps://arxiv.org/abs/ 2209.01712

  9. [9]

    Large-scale chemical language representations capture molecular structure and properties.Nature Machine Intelligence, 2022

    Jerret Ross, Brian Belgodere, Vijil Chenthamarakshan, Inkit Padhi, Youssef Mroueh, and Payel Das. Large-scale chemical language representations capture molecular structure and properties.Nature Machine Intelligence, 2022. URLhttps://www.nature.com/articles/ s42256-022-00580-7

  10. [10]

    Tice, Christopher P

    Raymond R. Tice, Christopher P. Austin, Robert J. Kavlock, and John R. Bucher. Tox21 challenge to build predictive models of nuclear receptor and stress response pathways as medi- ated by exposure to environmental chemicals and drugs.Frontiers in Environmental Science, 16

  11. [11]

    URLhttps://www.frontiersin.org/journals/environmental-science/articles/ 10.3389/fenvs.2015.00085/full

  12. [12]

    Qsar modeling of tox21 challenge stress response and nuclear receptor signaling toxicity assays.Frontiers in Environmental Science, 2016

    Andreas Mayr, Gunter Klambauer, Thomas Unterthiner, and Sepp Hochreiter. Qsar modeling of tox21 challenge stress response and nuclear receptor signaling toxicity assays.Frontiers in Environmental Science, 2016. URLhttps://www.frontiersin.org/articles/10.3389/ fenvs.2016.00003/full

  13. [13]

    Lemenze, Emily C

    Poonam Chitale, Alexander D. Lemenze, Emily C. Fogarty, Avi Shah, Courtney Grady, Aubrey R. Odom-Mabey, W. Evan Johnson, Jason H. Yang, A. Murat Eren, Roland Brosch, Pradeep Kumar, and David Alland. A comprehensive update to the mycobac- terium tuberculosis h37rv reference genome.Nature Communications, 13:7068, 2022. doi: 10.1038/s41467-022-34853-x

  14. [14]

    Wallace, Vineet Kumar, Ursula Pieper, Andrej Sali, Jeremy R

    Francisco Mart ’inez-Jim ’enez, George Papadatos, Li Yang, Iain M. Wallace, Vineet Kumar, Ursula Pieper, Andrej Sali, Jeremy R. Brown, John P. Overington, and Marc A. Marti-Renom. Target prediction for an open access set of compounds active against mycobacterium tuberculosis.PLoS Computational Biology, 9(10):e1003253, 2013. doi: 10.1371/journal.pcbi.1003253

  15. [15]

    Garai, S

    Thomas Lane, Daniel P. Russo, Kimberley M. Zorn, Alex M. Clark, Alexandru Korotcov, Valery Tkachenko, Robert C. Reynolds, Alexander L. Perryman, Joel S. Freundlich, and Sean Ekins. Comparing and validating machine learning models for mycobacterium tuber- culosis drug discovery.Molecular Pharmaceutics, 15(10):4346–4360, 2018. doi: 10.1021/acs. molpharmaceu...

  16. [16]

    Genome-wide functional screening of drug-resistance genes in plasmodium falciparum.Nature Communications, 13:6163, 2022

    Shiroh Iwanaga, Rie Kubota, Tsubasa Nishi, Sumalee Kamchonwongpaisan, Somdet Srichairatanakool, Naoaki Shinzawa, Din Syafruddin, Masao Yuda, and Chairat Uthaipibull. Genome-wide functional screening of drug-resistance genes in plasmodium falciparum.Nature Communications, 13:6163, 2022. doi: 10.1038/s41467-022-33804-w

  17. [17]

    Cluster Computing 6(3), 215–226 (Jul 2003), https://doi.org/10.1023/A: 1023588520138

    Leo Breiman. Random forests.Machine Learning, 45(1):5–32, 2001. doi: 10.1023/A: 1010933404324

  18. [18]

    Extremely randomized trees.Machine Learning, 63(1):3–42, 2006

    Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomized trees.Machine Learning, 63(1):3–42, 2006. doi: 10.1007/s10994-006-6226-1

  19. [19]

    , month = oct, year =

    Jerome H. Friedman. Greedy function approximation: A gradient boosting machine.The Annals of Statistics, 29(5):1189–1232, 2001. doi: 10.1214/aos/1013203451

  20. [20]

    Cortes, V

    Corinna Cortes and Vladimir Vapnik. Support-vector networks.Machine Learning, 20:273– 297, 1995. doi: 10.1007/BF00994018

  21. [21]

    Chen and C

    Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794, 2016. doi: 10.1145/2939672.2939785

  22. [22]

    Extended-connectivity fingerprints.J

    David Rogers and Mathew Hahn. Extended-connectivity fingerprints.Journal of Chemical Information and Modeling, 50(5):742–754, 2010. doi: 10.1021/ci100050t. 17

  23. [23]

    Sereina Riniker and Gregory A. Landrum. Similarity maps – a visualization strategy for molecular fingerprints and machine-learning methods.Journal of Cheminformatics, 5:43, 2013. doi: 10.1186/1758-2946-5-43

  24. [24]

    Schoenholz, Patrick F

    Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural message passing for quantum chemistry. InProceedings of the 34th International Con- ference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1263–1272, 2017. URLhttps://proceedings.mlr.press/v70/gilmer17a.html

  25. [25]

    Geometric deep learning for molecular property prediction: A review.Nature Machine Intelligence, 2023

    Xiaomin Fang, Lihang Liu, et al. Geometric deep learning for molecular property prediction: A review.Nature Machine Intelligence, 2023

  26. [26]

    Semi-Supervised Classification with Graph Convolutional Networks

    Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. InInternational Conference on Learning Representations, 2017. URLhttps:// arxiv.org/abs/1609.02907

  27. [27]

    Graph Attention Networks

    Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. InInternational Conference on Learning Repre- sentations, 2018. URLhttps://arxiv.org/abs/1710.10903

  28. [28]

    How Powerful are Graph Neural Networks?

    Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? InInternational Conference on Learning Representations, 2019. URLhttps: //arxiv.org/abs/1810.00826

  29. [29]

    Ligandformer: A Graph Neural Network for Predicting Compound Property with Robust Interpretation

    Jinjiang Guo, Qi Liu, Han Guo, and Xi Lu. Ligandformer: A graph neural network for predicting compound property with robust interpretation, 2022. URLhttps://arxiv.org/ abs/2202.10873

  30. [30]

    Chawla, Olaf Wiest, and Xiangliang Zhang

    Taicheng Guo, Kehan Guo, Bozhao Nan, Zixing Liang, Zhichun Guo, Nitesh V. Chawla, Olaf Wiest, and Xiangliang Zhang. What can large language models do in chemistry? a comprehensive benchmark on eight tasks.arXiv preprint arXiv:2305.18365, 2023. doi: 10. 48550/arXiv.2305.18365

  31. [31]

    Weininger

    David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules.Journal of Chemical Information and Computer Sciences, 28(1):31–36, 1988. doi: 10.1021/ci00057a005

  32. [32]

    Bemis and Mark A

    Guy W. Bemis and Mark A. Murcko. The properties of known drugs. 1. molecular frameworks. Journal of Medicinal Chemistry, 39(15):2887–2893, 1996. doi: 10.1021/jm9602928

  33. [33]

    Best practices for qsar model development, validation, and exploitation

    Alexander Tropsha. Best practices for qsar model development, validation, and exploitation. Molecular Informatics, 29(6–7):476–488, 2010. doi: 10.1002/minf.201000061

  34. [34]

    Drug discovery with explain- able artificial intelligence.Nature Machine Intelligence, 2:573–584, 2020

    Jos ’e Jim ’enez-Luna, Francesca Grisoni, and Gisbert Schneider. Drug discovery with explain- able artificial intelligence.Nature Machine Intelligence, 2:573–584, 2020. doi: 10.1038/ s42256-020-00236-4. 18