pith. machine review for the scientific record. sign in

arxiv: 2604.26498 · v1 · submitted 2026-04-29 · 💻 cs.LG · q-bio.QM

Recognition: unknown

Do Larger Models Really Win in Drug Discovery? A Benchmark Assessment of Model Scaling in AI-Driven Molecular Property and Activity Prediction

Authors on Pith no claims yet

Pith reviewed 2026-05-07 11:31 UTC · model grok-4.3

classification 💻 cs.LG q-bio.QM
keywords molecular property predictiondrug discoverymodel scalingbenchmarkgraph neural networkspretrained modelsADMET predictionstructure-activity relationship
0
0 comments X

The pith

Compact classical and graph models outperform larger pretrained ones on most molecular prediction tasks in a large benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether bigger pretrained models automatically beat smaller, specialized ones for predicting molecular properties and activities in drug discovery. It runs a large-scale comparison across 22 endpoints using structure-similarity-separated cross-validation that covers public ADMET and Tox21 data plus internal anti-infective sets. Classical machine-learning models win the most tasks, followed by graph neural networks, while pretrained sequence models win fewer; rule-based large-language-model reasoning baselines do not win under the main metrics. The results show that predictive gains are modest and depend on the specific endpoint, data regime, and validation setup rather than raw model size. A sympathetic reader sees this as evidence that scaling alone does not guarantee better molecular predictions and that alignment between representation and task matters more.

Core claim

Across 167,056 held-out evaluations on 22 molecular property and activity endpoints, classical ML models such as random forests on ECFP4 fingerprints and ExtraTrees on RDKit descriptors win ten primary-metric tasks, GNNs such as GIN and Ligandformer win nine, and pretrained molecular sequence models such as MoLFormer and ChemBERTa2 win three. Rule-based SAR reasoning with large language models does not win under the prespecified primary metrics, although it shows some gains when supplied with training-fold knowledge. These outcomes indicate that compact, specialized models remain highly effective, performance differences among model classes are often modest and endpoint-dependent, and larger

What carries the argument

Structure-similarity-separated five-fold cross-validation benchmark on 22 endpoints comparing classical ML, GNNs, pretrained sequence models, and rule-based LLM baselines.

If this is right

  • Performance gains from model class are modest and vary strongly with the biological endpoint.
  • Compact specialized models can match or exceed larger general models for standard predictive tasks.
  • Pretrained molecular models do not deliver a universal advantage in supervised property prediction.
  • Large models may still be useful for zero-shot reasoning and SAR interpretation rather than direct prediction.
  • Model selection in drug discovery should prioritize alignment of representation, inductive bias, and data regime over scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Drug-discovery teams could reduce compute costs by preferring compact models for routine property prediction while reserving larger models for hypothesis generation.
  • Future benchmarks should include more internal pharmaceutical data and real-world deployment metrics to test whether the observed pattern holds outside public datasets.
  • The modest performance gaps suggest that hybrid approaches combining classical fingerprints with lightweight graph layers may be sufficient for many endpoints.

Load-bearing premise

The chosen endpoints, similarity-separated validation protocol, and selected model implementations fairly represent typical molecular prediction problems without systematically favoring any model class.

What would settle it

A follow-up study on the same or expanded endpoints that uses a different validation split or includes more diverse external test sets and finds larger pretrained models winning a clear majority of tasks under the same primary metrics.

Figures

Figures reproduced from arXiv: 2604.26498 by Jinjiang Guo.

Figure 1
Figure 1. Figure 1: Model taxonomy for the benchmark. The figure summarizes small ML, GNN, pretrained view at source ↗
Figure 2
Figure 2. Figure 2: Molecular representation pathways compared in the benchmark. Fingerprints and de view at source ↗
Figure 3
Figure 3. Figure 3: Structure-similarity-separated five-fold cross-validation workflow. Molecules are stan view at source ↗
Figure 4
Figure 4. Figure 4: Proportional summary of model-family wins across ADMET, Tox21 and anti-infective view at source ↗
Figure 5
Figure 5. Figure 5: Effect of train-fold-derived SAR knowledge on LLM-SAR performance across task groups view at source ↗
read the original abstract

The rapid growth of molecular foundation models and general-purpose large language models has encouraged a scale-centric view of artificial intelligence in drug discovery, in which larger pretrained models are expected to supersede compact cheminformatics models and task-specific graph neural networks (GNNs). We test this assumption on 22 molecular property and activity endpoints, including public ADMET and Tox21 benchmarks and two internal anti-infective activity datasets. Across 167,056 held-out task--molecule evaluations under structure-similarity-separated five-fold cross-validation (37,756 ADMET, 77,946 Tox21, 49,266 anti-TB and 2,088 antimalaria), classical machine-learning (ML) models such as RF(ECFP4) and ExtraTrees(RDKit descriptors) win ten primary-metric tasks, GNNs such as GIN and Ligandformer win nine, and pretrained molecular sequence models such as MoLFormer and ChemBERTa2 win three. Rule-based SAR reasoning baselines, represented by GPT5.5-SAR and Opus4.7-SAR, do not win under the prespecified primary metrics, although train-fold-derived SAR knowledge provides measurable but uneven gains for SAR reasoning and interpretation. These results indicate that compact, specialized models remain highly effective for molecular property and activity prediction. The performance differences among classical ML, GNN and pretrained sequence models are often modest and endpoint-dependent, whereas larger or more general models do not provide a universal predictive advantage. Large models may still add value for zero-shot reasoning, SAR interpretation and hypothesis generation, but the results suggest that predictive performance depends on the alignment among molecular representation, inductive bias, data regime, endpoint biology and validation protocol.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript benchmarks classical ML (RF(ECFP4), ExtraTrees(RDKit)), GNNs (GIN, Ligandformer), and pretrained sequence models (MoLFormer, ChemBERTa2) plus rule-based SAR baselines on 22 molecular property/activity endpoints (public ADMET, Tox21, and two internal anti-infective sets). Using structure-similarity-separated 5-fold CV across 167,056 held-out evaluations, it reports classical ML winning 10 primary-metric tasks, GNNs winning 9, and sequence models winning 3, concluding that larger or more general models do not confer a universal predictive advantage and that compact specialized models remain highly effective, with differences being modest and endpoint-dependent.

Significance. If the results are robust, the work supplies a large-scale empirical counterpoint to scale-centric assumptions in AI-driven drug discovery. The volume of evaluations (37k ADMET + 78k Tox21 + 51k internal) and inclusion of both public and proprietary endpoints provide concrete data on when representation-inductive-bias alignment matters more than model size. The explicit separation of predictive performance from zero-shot/SAR-interpretation uses of large models is a useful distinction for practitioners.

major comments (2)
  1. [Abstract] Abstract and validation protocol: the central claim that classical ML wins 10 tasks versus 3 for pretrained sequence models rests on the structure-similarity-separated 5-fold CV. Because the split metric is likely computed from fingerprints or descriptors that directly overlap with the input representations of RF(ECFP4) and ExtraTrees(RDKit), the held-out folds may be systematically easier for classical models than for SMILES-sequence models whose pretraining does not optimize for the same local substructure similarity. This coupling risks confounding the reported win distribution with validation design rather than intrinsic scaling behavior.
  2. [Methods] Methods (CV and model details): the manuscript should specify the exact similarity metric and threshold used for fold separation, report per-endpoint performance tables with confidence intervals or standard deviations across folds, and include at least one control experiment (e.g., random splits or similarity-agnostic splits) to test whether the observed advantage for fingerprint-based models persists.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'train-fold-derived SAR knowledge provides measurable but uneven gains' would benefit from a quantitative summary or supplementary table showing the magnitude of those gains across endpoints.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our benchmark study. We address the concerns about the validation protocol and methods below, with revisions to improve transparency and reporting.

read point-by-point responses
  1. Referee: [Abstract] Abstract and validation protocol: the central claim that classical ML wins 10 tasks versus 3 for pretrained sequence models rests on the structure-similarity-separated 5-fold CV. Because the split metric is likely computed from fingerprints or descriptors that directly overlap with the input representations of RF(ECFP4) and ExtraTrees(RDKit), the held-out folds may be systematically easier for classical models than for SMILES-sequence models whose pretraining does not optimize for the same local substructure similarity. This coupling risks confounding the reported win distribution with validation design rather than intrinsic scaling behavior.

    Authors: The structure-similarity split follows standard practice in molecular machine learning to simulate realistic generalization to novel chemical matter and avoid leakage from near-duplicates. We have revised the Methods to explicitly state that fold separation uses Tanimoto similarity on ECFP4 fingerprints (threshold now specified). While this representation overlaps with one classical baseline, the same partitions are used for all models, and pretrained sequence models are expected to capture substructure patterns from their large-scale pretraining. The endpoint-dependent results and modest effect sizes indicate that the win distribution reflects inductive bias alignment rather than an artifact of the split. We added a clarifying sentence to the abstract and a short discussion paragraph on validation design. revision: partial

  2. Referee: [Methods] Methods (CV and model details): the manuscript should specify the exact similarity metric and threshold used for fold separation, report per-endpoint performance tables with confidence intervals or standard deviations across folds, and include at least one control experiment (e.g., random splits or similarity-agnostic splits) to test whether the observed advantage for fingerprint-based models persists.

    Authors: We agree on the need for greater detail. The revised Methods now specifies the exact metric (Tanimoto similarity on ECFP4 fingerprints) and the threshold used for fold separation. We have added supplementary tables reporting mean performance and standard deviation across the five folds for every endpoint, model, and primary/secondary metric. For the control, we performed an additional analysis using random splits; under this protocol all models improve but the relative ordering (classical ML and GNNs competitive with or ahead of sequence models on most tasks) is preserved. These results are reported in a new subsection and support that the main findings are not driven solely by the similarity-based split. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with direct data-driven comparisons

full rationale

The paper reports model performance rankings across 22 endpoints under a fixed structure-similarity-separated 5-fold CV protocol. No derivations, equations, or fitted parameters are present that could reduce to self-definition or prediction-by-construction. Central claims rest on tabulated win counts (RF/ECFP4 wins 10, GNNs win 9, sequence models win 3) obtained from explicit held-out evaluations, not from any ansatz, uniqueness theorem, or self-citation chain. The validation design is stated explicitly and applied uniformly; any potential representation-split alignment is an external methodological concern, not a reduction of the reported results to their own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical axioms, free parameters, or invented entities; the work is an empirical benchmark study.

pith-pipeline@v0.9.0 · 5619 in / 967 out tokens · 37268 ms · 2026-05-07T11:31:54.467959+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 27 canonical work pages · 4 internal anchors

  1. [1]

    doi: 10.1039/c7sc02664a

    Zhenqin Wu, Bharath Ramsundar, Evan N. Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, and Vijay Pande. Moleculenet: A benchmark for molec- ular machine learning.Chemical Science, 9:513–530, 2018. doi: 10.1039/C7SC02664A

  2. [2]

    Huang, T

    Kexin Huang, Tianfan Fu, Wenhao Gao, Yue Zhao, Yusuf Roohani, Jure Leskovec, Connor W. Coley, Cao Xiao, Jimeng Sun, and Marinka Zitnik. Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021. URLhttps:// arxiv.org/a...

  3. [3]

    Bronskill, Krzysztof Maziarz, Henryk Misztela, Julien Lanini, Marwin Segler, Nadine Schneider, and Marc Brockschmidt

    Megan Stanley, John F. Bronskill, Krzysztof Maziarz, Henryk Misztela, Julien Lanini, Marwin Segler, Nadine Schneider, and Marc Brockschmidt. Fs-mol: A few-shot learning dataset of molecules. InNeurIPS Datasets and Benchmarks Track, 2021. URLhttps://openreview. net/forum?id=701FtuyLlAd

  4. [4]

    Limitations of representation learning in small molecule property prediction.Nature Communications, 14:6394, 2023

    Ana Laura Dias, Latimah Bustillo, and Tiago Rodrigues. Limitations of representation learning in small molecule property prediction.Nature Communications, 14:6394, 2023. doi: 10.1038/ s41467-023-41967-3. URLhttps://www.nature.com/articles/s41467-023-41967-3

  5. [5]

    Jun Xia, Lecheng Zhang, Xiao Zhu, and Stan Z. Li. Why deep models often cannot beat non-deep counterparts on molecular property prediction?, 2023. URLhttps://arxiv.org/ abs/2306.17702

  6. [6]

    Benchmarking ma- chine learning in admet predictions: The practical impact of feature representations in ligand- based models.Journal of Cheminformatics, 17:108, 2025

    Gintautas Kamuntavicius, Tanya Paquet, Orestis Bastas, Dainius Salkauskas, Alvaro Prat, Hisham Abdel Aty, Aurimas Pabrinkis, Povilas Norvaisas, and Roy Tal. Benchmarking ma- chine learning in admet predictions: The practical impact of feature representations in ligand- based models.Journal of Cheminformatics, 17:108, 2025. doi: 10.1186/s13321-025-01041-0....

  7. [7]

    arXiv preprint arXiv:2010.09885 (2020)

    Seyone Chithrananda, Gabriel Grand, and Bharath Ramsundar. Chemberta: Large-scale self- supervised pretraining for molecular property prediction, 2020. URLhttps://arxiv.org/ abs/2010.09885

  8. [8]

    Seyone Chithrananda, Gabriel Grand, Bharath Ramsun- dar, et al

    Walid Ahmad, Eric Simon, Seyone Chithrananda, Gabriel Grand, and Bharath Ramsundar. Chemberta-2: Towards chemical foundation models, 2022. URLhttps://arxiv.org/abs/ 2209.01712

  9. [9]

    Large-scale chemical language representations capture molecular structure and properties.Nature Machine Intelligence, 2022

    Jerret Ross, Brian Belgodere, Vijil Chenthamarakshan, Inkit Padhi, Youssef Mroueh, and Payel Das. Large-scale chemical language representations capture molecular structure and properties.Nature Machine Intelligence, 2022. URLhttps://www.nature.com/articles/ s42256-022-00580-7

  10. [10]

    Tice, Christopher P

    Raymond R. Tice, Christopher P. Austin, Robert J. Kavlock, and John R. Bucher. Tox21 challenge to build predictive models of nuclear receptor and stress response pathways as medi- ated by exposure to environmental chemicals and drugs.Frontiers in Environmental Science, 16

  11. [11]

    URLhttps://www.frontiersin.org/journals/environmental-science/articles/ 10.3389/fenvs.2015.00085/full

  12. [12]

    Qsar modeling of tox21 challenge stress response and nuclear receptor signaling toxicity assays.Frontiers in Environmental Science, 2016

    Andreas Mayr, Gunter Klambauer, Thomas Unterthiner, and Sepp Hochreiter. Qsar modeling of tox21 challenge stress response and nuclear receptor signaling toxicity assays.Frontiers in Environmental Science, 2016. URLhttps://www.frontiersin.org/articles/10.3389/ fenvs.2016.00003/full

  13. [13]

    Lemenze, Emily C

    Poonam Chitale, Alexander D. Lemenze, Emily C. Fogarty, Avi Shah, Courtney Grady, Aubrey R. Odom-Mabey, W. Evan Johnson, Jason H. Yang, A. Murat Eren, Roland Brosch, Pradeep Kumar, and David Alland. A comprehensive update to the mycobac- terium tuberculosis h37rv reference genome.Nature Communications, 13:7068, 2022. doi: 10.1038/s41467-022-34853-x

  14. [14]

    Wallace, Vineet Kumar, Ursula Pieper, Andrej Sali, Jeremy R

    Francisco Mart ’inez-Jim ’enez, George Papadatos, Li Yang, Iain M. Wallace, Vineet Kumar, Ursula Pieper, Andrej Sali, Jeremy R. Brown, John P. Overington, and Marc A. Marti-Renom. Target prediction for an open access set of compounds active against mycobacterium tuberculosis.PLoS Computational Biology, 9(10):e1003253, 2013. doi: 10.1371/journal.pcbi.1003253

  15. [15]

    Performance and Analysis of the Alchemical Transfer Method for Binding-Free-Energy Predictions of Diverse Ligands

    Thomas Lane, Daniel P. Russo, Kimberley M. Zorn, Alex M. Clark, Alexandru Korotcov, Valery Tkachenko, Robert C. Reynolds, Alexander L. Perryman, Joel S. Freundlich, and Sean Ekins. Comparing and validating machine learning models for mycobacterium tuber- culosis drug discovery.Molecular Pharmaceutics, 15(10):4346–4360, 2018. doi: 10.1021/acs. molpharmaceu...

  16. [16]

    Genome-wide functional screening of drug-resistance genes in plasmodium falciparum.Nature Communications, 13:6163, 2022

    Shiroh Iwanaga, Rie Kubota, Tsubasa Nishi, Sumalee Kamchonwongpaisan, Somdet Srichairatanakool, Naoaki Shinzawa, Din Syafruddin, Masao Yuda, and Chairat Uthaipibull. Genome-wide functional screening of drug-resistance genes in plasmodium falciparum.Nature Communications, 13:6163, 2022. doi: 10.1038/s41467-022-33804-w

  17. [17]

    Celestial Mechan- ics and Dynamical Astronomy83, 155–169 (2002) https://doi.org/10.1023/A: 1020143116091

    Leo Breiman. Random forests.Machine Learning, 45(1):5–32, 2001. doi: 10.1023/A: 1010933404324

  18. [18]

    Extremely randomized trees.Machine Learning, 63(1):3–42, 2006

    Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomized trees.Machine Learning, 63(1):3–42, 2006. doi: 10.1007/s10994-006-6226-1

  19. [19]

    Friedman

    Jerome H. Friedman. Greedy function approximation: A gradient boosting machine.The Annals of Statistics, 29(5):1189–1232, 2001. doi: 10.1214/aos/1013203451

  20. [20]

    Machine Learning , year =

    Corinna Cortes and Vladimir Vapnik. Support-vector networks.Machine Learning, 20:273– 297, 1995. doi: 10.1007/BF00994018

  21. [21]

    XGBoost: A Scalable Tree Boosting System

    Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794, 2016. doi: 10.1145/2939672.2939785

  22. [22]

    Extended-connectivity fingerprints

    David Rogers and Mathew Hahn. Extended-connectivity fingerprints.Journal of Chemical Information and Modeling, 50(5):742–754, 2010. doi: 10.1021/ci100050t. 17

  23. [23]

    Sereina Riniker and Gregory A. Landrum. Similarity maps – a visualization strategy for molecular fingerprints and machine-learning methods.Journal of Cheminformatics, 5:43, 2013. doi: 10.1186/1758-2946-5-43

  24. [24]

    Schoenholz, Patrick F

    Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural message passing for quantum chemistry. InProceedings of the 34th International Con- ference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1263–1272, 2017. URLhttps://proceedings.mlr.press/v70/gilmer17a.html

  25. [25]

    Geometric deep learning for molecular property prediction: A review.Nature Machine Intelligence, 2023

    Xiaomin Fang, Lihang Liu, et al. Geometric deep learning for molecular property prediction: A review.Nature Machine Intelligence, 2023

  26. [26]

    Semi-Supervised Classification with Graph Convolutional Networks

    Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. InInternational Conference on Learning Representations, 2017. URLhttps:// arxiv.org/abs/1609.02907

  27. [27]

    Graph Attention Networks

    Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. InInternational Conference on Learning Repre- sentations, 2018. URLhttps://arxiv.org/abs/1710.10903

  28. [28]

    How Powerful are Graph Neural Networks?

    Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? InInternational Conference on Learning Representations, 2019. URLhttps: //arxiv.org/abs/1810.00826

  29. [29]

    Ligandformer: A Graph Neural Network for Predicting Compound Property with Robust Interpretation

    Jinjiang Guo, Qi Liu, Han Guo, and Xi Lu. Ligandformer: A graph neural network for predicting compound property with robust interpretation, 2022. URLhttps://arxiv.org/ abs/2202.10873

  30. [30]

    V., Wiest, O., and Zhang, X

    Taicheng Guo, Kehan Guo, Bozhao Nan, Zixing Liang, Zhichun Guo, Nitesh V. Chawla, Olaf Wiest, and Xiangliang Zhang. What can large language models do in chemistry? a comprehensive benchmark on eight tasks.arXiv preprint arXiv:2305.18365, 2023. doi: 10. 48550/arXiv.2305.18365

  31. [31]

    Smiles, a chemical language and information system

    David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules.Journal of Chemical Information and Computer Sciences, 28(1):31–36, 1988. doi: 10.1021/ci00057a005

  32. [32]

    Bemis and Mark A

    Guy W. Bemis and Mark A. Murcko. The properties of known drugs. 1. molecular frameworks. Journal of Medicinal Chemistry, 39(15):2887–2893, 1996. doi: 10.1021/jm9602928

  33. [33]

    Best practices for qsar model development, validation, and exploitation

    Alexander Tropsha. Best practices for qsar model development, validation, and exploitation. Molecular Informatics, 29(6–7):476–488, 2010. doi: 10.1002/minf.201000061

  34. [34]

    Drug discovery with explain- able artificial intelligence.Nature Machine Intelligence, 2:573–584, 2020

    Jos ’e Jim ’enez-Luna, Francesca Grisoni, and Gisbert Schneider. Drug discovery with explain- able artificial intelligence.Nature Machine Intelligence, 2:573–584, 2020. doi: 10.1038/ s42256-020-00236-4. 18