pith:3JNQYK2S
Do Larger Models Really Win in Drug Discovery? A Benchmark Assessment of Model Scaling in AI-Driven Molecular Property and Activity Prediction
Classical ML models outperform larger pretrained and LLM approaches in most molecular prediction tasks for drug discovery
arxiv:2604.26498 v2 · 2026-04-29 · cs.LG · q-bio.QM
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{3JNQYK2S4I7WI4PFLAQSS34ZUR}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
Claims
Across 156 fold mean comparisons, classical ML such as RF(ECFP4) and ExtraTrees(RDKit) win 116, GNNs such as GIN and Ligandformer win 25, pretrained sequence models such as MoLFormer and ChemBERTa2 win 12, and LLM based SAR baselines win three. Compact specialized models remain highly effective for molecular property and activity prediction.
The 78 endpoint and split entries, grouped into ADME, toxicity and bioactivity classes and using random, Murcko scaffold, and structure-separated 5-fold CV, adequately represent the spectrum of real-world drug discovery challenges from closed-library retrospective evaluation to novel chemotype library expansion.
A benchmark across 156 comparisons finds classical ML models win 116 times while larger pretrained and LLM models win far fewer, showing predictive performance depends on model-task fit rather than scale.
References
Receipt and verification
| First computed | 2026-05-20T00:00:39.586358Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
da5b0c2b52e23f6471e55821296f99a46da4ca73fb416bcd32bdc54cec0ed4c3
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/3JNQYK2S4I7WI4PFLAQSS34ZUR \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: da5b0c2b52e23f6471e55821296f99a46da4ca73fb416bcd32bdc54cec0ed4c3
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "68f33d041722283f2502d820cb08e271088a761ed6bb332dd7e8585e5f5012a8",
"cross_cats_sorted": [
"q-bio.QM"
],
"license": "http://creativecommons.org/licenses/by/4.0/",
"primary_cat": "cs.LG",
"submitted_at": "2026-04-29T10:01:16Z",
"title_canon_sha256": "e9b2b6c7870d2c326f35efcd8a570b734b63e70cbab790c4ffccb8165d2683fb"
},
"schema_version": "1.0",
"source": {
"id": "2604.26498",
"kind": "arxiv",
"version": 2
}
}