GemNet-OC: Developing Graph Neural Networks for Large and Diverse Molecular Simulation Datasets

Abhishek Das; Anuroop Sriram; C. Lawrence Zitnick; Johannes Gasteiger; Muhammed Shuaibi; Stephan G\"unnemann; Zachary Ulissi

arxiv: 2204.02782 · v3 · pith:DX2TFTK2new · submitted 2022-04-06 · 💻 cs.LG · cond-mat.mtrl-sci· physics.chem-ph· physics.comp-ph

GemNet-OC: Developing Graph Neural Networks for Large and Diverse Molecular Simulation Datasets

Johannes Gasteiger , Muhammed Shuaibi , Anuroop Sriram , Stephan G\"unnemann , Zachary Ulissi , C. Lawrence Zitnick , Abhishek Das This is my paper

classification 💻 cs.LG cond-mat.mtrl-sciphysics.chem-phphysics.comp-ph

keywords datasetsdatasetmodelgemnet-ococ20developinglargemolecular

0 comments

read the original abstract

Recent years have seen the advent of molecular simulation datasets that are orders of magnitude larger and more diverse. These new datasets differ substantially in four aspects of complexity: 1. Chemical diversity (number of different elements), 2. system size (number of atoms per sample), 3. dataset size (number of data samples), and 4. domain shift (similarity of the training and test set). Despite these large differences, benchmarks on small and narrow datasets remain the predominant method of demonstrating progress in graph neural networks (GNNs) for molecular simulation, likely due to cheaper training compute requirements. This raises the question -- does GNN progress on small and narrow datasets translate to these more complex datasets? This work investigates this question by first developing the GemNet-OC model based on the large Open Catalyst 2020 (OC20) dataset. GemNet-OC outperforms the previous state-of-the-art on OC20 by 16% while reducing training time by a factor of 10. We then compare the impact of 18 model components and hyperparameter choices on performance in multiple datasets. We find that the resulting model would be drastically different depending on the dataset used for making model choices. To isolate the source of this discrepancy we study six subsets of the OC20 dataset that individually test each of the above-mentioned four dataset aspects. We find that results on the OC-2M subset correlate well with the full OC20 dataset while being substantially cheaper to train on. Our findings challenge the common practice of developing GNNs solely on small datasets, but highlight ways of achieving fast development cycles and generalizable results via moderately-sized, representative datasets such as OC-2M and efficient models such as GemNet-OC. Our code and pretrained model weights are open-sourced.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ConSolv: Solvent-Conditional Machine Learning Implicit Solvent Potential
physics.chem-ph 2026-06 unverdicted novelty 7.0

ConSolv is a solvent-conditional attention-based MLP trained on experimental and ab initio solvation free energies that generalizes across 66 organic solvents and matches some experimental NMR data.
DPA4: Pushing the Accuracy-Cost Frontier of Interatomic Potentials with EMFA SO(2) Convolution
physics.chem-ph 2026-06 unverdicted novelty 7.0

DPA4 is a new SE(3)-equivariant interatomic potential with EMFA SO(2) convolution that sets new accuracy-cost records on Matbench Discovery and SPICE benchmarks using fewer parameters than prior models.
TSAgent: An Agentic Workflow for Autonomous Transition State Search
physics.chem-ph 2026-05 unverdicted novelty 6.0

TSAgent automates transition state searches at DFT accuracy via an agentic loop, reaching 83% success on 100 OC20NEB examples and 70% on 10 held-out cases versus 73% for human experts.
Benchmarking Compositional Generalisation for Machine Learning Interatomic Potentials
cs.LG 2026-05 unverdicted novelty 6.0

A new benchmark finds that state-of-the-art ML interatomic potentials struggle with compositional generalization, producing errors an order of magnitude higher on unseen molecular combinations than on training-like cases.
Selectivity- and Activity-Aware Catalyst Descriptors for CO$_2$ Hydrogenation on Alloy Nanocatalysts using Machine-Learned Force Fields
cond-mat.mtrl-sci 2026-05 unverdicted novelty 6.0

A facet-resolved adsorption energy distribution method with ML force fields identifies active and methanol-selective alloy nanocatalyst surfaces for CO2 hydrogenation.