Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development

Cao Xiao; Connor W. Coley; Jimeng Sun; Jure Leskovec; Kexin Huang; Marinka Zitnik; Tianfan Fu; Wenhao Gao; Yue Zhao; Yusuf Roohani

arxiv: 2102.09548 · v2 · pith:444GTIQDnew · submitted 2021-02-18 · 💻 cs.LG · cs.CY· q-bio.BM· q-bio.QM

Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development

Kexin Huang , Tianfan Fu , Wenhao Gao , Yue Zhao , Yusuf Roohani , Jure Leskovec , Connor W. Coley , Cao Xiao

show 2 more authors

Jimeng Sun Marinka Zitnik

This is my paper

classification 💻 cs.LG cs.CYq-bio.BMq-bio.QM

keywords datalearningtherapeuticsdatasetsdevelopmentmachinetasksacross

0 comments

read the original abstract

Therapeutics machine learning is an emerging field with incredible opportunities for innovatiaon and impact. However, advancement in this field requires formulation of meaningful learning tasks and careful curation of datasets. Here, we introduce Therapeutics Data Commons (TDC), the first unifying platform to systematically access and evaluate machine learning across the entire range of therapeutics. To date, TDC includes 66 AI-ready datasets spread across 22 learning tasks and spanning the discovery and development of safe and effective medicines. TDC also provides an ecosystem of tools and community resources, including 33 data functions and types of meaningful data splits, 23 strategies for systematic model evaluation, 17 molecule generation oracles, and 29 public leaderboards. All resources are integrated and accessible via an open Python library. We carry out extensive experiments on selected datasets, demonstrating that even the strongest algorithms fall short of solving key therapeutics challenges, including real dataset distributional shifts, multi-scale modeling of heterogeneous data, and robust generalization to novel data points. We envision that TDC can facilitate algorithmic and scientific advances and considerably accelerate machine-learning model development, validation and transition into biomedical and clinical implementation. TDC is an open-science initiative available at https://tdcommons.ai.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SupraBench: A Benchmark for Supramolecular Chemistry
cs.LG 2026-06 unverdicted novelty 7.0

SupraBench introduces four core tasks and a curated corpus to benchmark LLMs on host-guest chemistry reasoning, showing substantial remaining headroom and task-specific failure modes.
Benchmarking AI Agents for Addressing Scientific Challenges Across Scales
cs.AI 2026-06 unverdicted novelty 7.0

SciAgentArena is a new interactive benchmark for AI agents on scientific tasks that finds agents handle clear data-analysis workflows but struggle with novel insights, self-directed exploration, and open-ended questions.
A Systematic Evaluation of Molecular Mixture Behavior Prediction
cs.LG 2026-05 unverdicted novelty 7.0

Strong absolute accuracy on mixture properties often masks poor recovery of non-ideal behavior, with large drops under strict molecule splits, making transfer to unseen molecules the central challenge.
MolLingo: Molecule-Native Representations for LLM-Powered Scientific Agents
cs.AI 2026-05 unverdicted novelty 7.0

MolLingo introduces a multi-agent framework with BFE molecular representation and docking-grounded reasoning to outperform frontier LLMs on molecular design benchmarks including fourfold docking score gains.
BioXArena: Benchmarking LLM Agents on Multi-Modal Biomedical Machine Learning Tasks
cs.CE 2026-05 unverdicted novelty 7.0

BioXArena benchmarks LLM agents on generating end-to-end ML pipelines for 76 multi-modal biomedical tasks, with MLEvolve plus Gemini-3.1-Pro scoring highest at 0.666.
Chem-GMNet: A Sphere-Native Geometric Transformer for Molecular Property Prediction
cs.LG 2026-05 unverdicted novelty 7.0

Chem-GMNet uses sphere-native embeddings, DualSKA attention, and SH-FFN layers to match or beat ChemBERTa-2 on MoleculeNet tasks with fewer parameters and sometimes no pretraining.
How Creative Are Large Language Models in Generating Molecules?
cs.CL 2026-04 unverdicted novelty 7.0

Large language models exhibit distinct creative patterns in molecule generation, including higher constraint satisfaction when more constraints are added, and this is the first work to reframe molecule generation abil...
Active-GRPO: Adaptive Imitation and Self-Improving Reasoning for Molecular Optimization
cs.LG 2026-07 unverdicted novelty 6.0

Active-GRPO reaches 0.1773 average SRxSim on TOMG-Bench MOLOPT by adaptively switching between imitation and self-reinforcement while upgrading references, outperforming GRPO and RePO.
What Does a Chemical Language Model Know About Molecules?
cs.LG 2026-06 unverdicted novelty 6.0

Sparse autoencoders on MolFormer reveal position-tracking latents in early layers and atom-in-substructure plus pharmacologically relevant features in later layers, with non-canonical SMILES causing greater representa...
Towards Diverse Scientific Hypothesis Search with Large Language Models
cs.LG 2026-06 unverdicted novelty 6.0

A parallel-tempering evolutionary framework for LLM hypothesis search improves both quality and diversity of candidates in molecular, equation, and algorithm discovery under fixed validation budgets.
AutoScientists: Self-Organizing Agent Teams for Long-Running Scientific Experimentation
cs.AI 2026-05 unverdicted novelty 6.0

Decentralized AI agent teams self-organize around hypotheses, critique proposals, and share knowledge to outperform single-agent baselines on biomedical ML, language-model optimization, and protein fitness tasks.
Do Larger Models Really Win in Drug Discovery? A Benchmark Assessment of Model Scaling in AI-Driven Molecular Property and Activity Prediction
cs.LG 2026-04 unverdicted novelty 6.0

Benchmark across 78 endpoint-split entries finds classical ML winning 47.4% of best performances over pretrained models, GNNs, and LLMs, with performance depending on model-task-split fit rather than scale.
Tabular foundation models for in-context prediction of molecular properties
cs.LG 2026-04 unverdicted novelty 6.0

Tabular foundation models achieve high accuracy in molecular property prediction through in-context learning, with up to 100% win rates on MoleculeACE tasks when paired with CheMeleon embeddings.
Suiren-1.0 Technical Report: A Family of Molecular Foundation Models
physics.chem-ph 2026-03 unverdicted novelty 6.0

Suiren-1.0 is a family of three molecular foundation models (Base, Dimer, ConfAvg) pre-trained on 70M+ DFT samples and distilled to achieve claimed state-of-the-art performance on quantum property prediction tasks fro...
GraphPINE: Graph Importance Propagation for Interpretable Drug Response Prediction
cs.LG 2025-04 unverdicted novelty 6.0

GraphPINE is a GNN architecture that initializes node importance from prior knowledge graphs and propagates updates via an importance propagation layer for interpretable drug response prediction on over 5,000 genes an...
Data Collaboration Analysis with Orthonormal Basis Selection and Alignment
cs.LG 2024-03 unverdicted novelty 6.0

Orthonormal Data Collaboration (ODC) enforces orthonormal secret and target bases so that alignment reduces to the Orthogonal Procrustes problem, yielding O(acl^2) complexity, orthogonal concordance, and downstream pe...
Do Larger Models Really Win in Drug Discovery? A Benchmark Assessment of Model Scaling in AI-Driven Molecular Property and Activity Prediction
cs.LG 2026-04 unverdicted novelty 5.0

A benchmark across 156 comparisons finds classical ML models win 116 times while larger pretrained and LLM models win far fewer, showing predictive performance depends on model-task fit rather than scale.
Do Larger Models Really Win in Drug Discovery? A Benchmark Assessment of Model Scaling in AI-Driven Molecular Property and Activity Prediction
cs.LG 2026-04 unverdicted novelty 4.0

Large benchmark shows classical ML and GNNs outperform pretrained large models on most of 22 drug-discovery endpoints under strict cross-validation.
Quantum-inspired Reinforcement Learning for Synthesizable Drug Design
cs.LG 2024-09 unverdicted novelty 4.0

Reinforcement learning with a quantum-inspired simulated annealing policy neural network is applied to synthesizable molecular optimization and reports competitive results against genetic algorithm baselines on the PM...