arxiv: 2601.16331 · v2 · submitted 2026-01-22 · ⚛️ physics.chem-ph

Recognition: no theorem link

Accuracy and Efficiency Benchmarks of Pretrained Machine Learning Potentials for Molecular Simulations

Peter Eastman , Evan Pretti , Thomas E. Markland

Authors on Pith no claims yet

Pith reviewed 2026-05-16 11:25 UTC · model grok-4.3

classification ⚛️ physics.chem-ph

keywords machine learning interatomic potentialsMLIP benchmarksmolecular simulationsmodel accuracycomputational efficiencytraining set sizeCoulomb interactionspretrained models

0 comments

The pith

Benchmarks of 15 pretrained MLIPs show accuracy rises with parameter count and training set size, with no gain from explicit Coulomb terms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper evaluates fifteen pretrained machine learning interatomic potentials on molecular systems for accuracy against reference calculations, computational speed, memory consumption, and the ability to maintain stable simulations over time. The evaluation reveals that both higher parameter counts and larger training datasets correlate strongly with better accuracy across the tested models. Adding explicit Coulomb energy terms to the models produced no measurable improvement in accuracy. Speed and memory use turned out to depend as much on the specific model architecture as on the total number of parameters. These findings supply concrete guidance for selecting among existing pretrained models and for prioritizing design choices in future development.

Core claim

Fifteen pretrained MLIPs were benchmarked on accuracy for molecular systems, speed, memory requirements, and long-term simulation stability. Accuracy showed strong positive correlation with both the number of model parameters and the size of the training set. Models that included explicit Coulomb electrostatic energy terms did not outperform those that learned electrostatics implicitly. Speed and memory consumption were shaped comparably by architecture details and by overall model scale.

What carries the argument

Comparative benchmarking of fifteen pretrained machine learning interatomic potentials on standardized accuracy, efficiency, memory, and stability metrics for molecular simulations.

If this is right

Increasing the number of parameters and the size of the training set reliably improves accuracy.
Explicit Coulomb energy terms confer no accuracy advantage in the tested molecular systems.
Architecture choices influence speed and memory use at least as strongly as raw model size.
The observed correlations supply a practical basis for choosing among available pretrained MLIPs.
Stable molecular dynamics simulations remain feasible with the evaluated models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future MLIP development can focus resources on scaling training data and parameter counts rather than on hand-crafted electrostatic terms.
The same scaling trends may appear in other interatomic potential families or when applied to more complex reactive chemistry.
Targeted architecture improvements could support larger models without proportional increases in memory or slowdowns.
Repeating the benchmark on condensed-phase or reactive systems would test how far the size and data correlations extend.

Load-bearing premise

The fifteen selected pretrained MLIPs together with the chosen molecular test systems and performance metrics are representative enough to reveal general trends about what controls accuracy and efficiency.

What would settle it

A pretrained MLIP with relatively few parameters that nevertheless matches or exceeds the accuracy of the largest models on the same test systems, or a model with explicit Coulomb terms that clearly outperforms otherwise similar models, would falsify the reported correlations.

Figures

Figures reproduced from arXiv: 2601.16331 by Evan Pretti, Peter Eastman, Thomas E. Markland.

**Figure 2.** Figure 2: Speed versus number of atoms [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Memory required versus number of atoms. 50 75 100 774 2661 6282 12,255 21,384 AIMNet2 90.54 85.00 132.22 48.24 19.89 8.84 4.59 2.59 AceFF-1.1 110.11 109.44 72.56 42.56 13.78 5.41 2.77 - AceFF-2.0 38.05 54.18 53.21 33.24 9.31 2.92 - - Egret-1 43.50 40.70 28.09 3.80 1.11 - - - FeNNix-Bio1(M) 122.44 27.76 70.01 119.69 50.51 17.92 6.18 2.38 FeNNix-Bio1(S) 124.73 36.87 124.19 114.09 63.12 19.72 6.47 2.46 MACE-M… view at source ↗

**Figure 4.** Figure 4: Model error on the test set versus speed on a) a 50 atom molecule and b) a 2661 atom [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

read the original abstract

The rapid development of pretrained Machine Learning Interatomic Potentials (MLIPs) that cover a wide range of molecular species has made it challenging to select the best model for a given application. We benchmark 15 pretrained MLIPs, evaluating each one on accuracy, speed, memory use, and ability to produce stable simulations. This provides an objective basis for practitioners to select the most appropriate MLIP for their own simulations, and offers insight into which factors most strongly influence model accuracy. We find that the number of model parameters and the size of the training set are both strongly correlated with accuracy, but observe no benefit from including explicit Coulomb energy terms. Speed and memory use are determined as much by the model architecture as by the size of the model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This benchmark of 15 pretrained MLIPs gives useful practical data on accuracy versus speed tradeoffs, with parameter count and training set size correlating to better results and no clear gain from explicit Coulomb terms.

read the letter

The main thing to know is that this paper runs a head-to-head test of 15 pretrained MLIPs on accuracy, speed, memory, and long-term simulation stability. It reports that larger parameter counts and bigger training sets track with lower errors, while adding explicit Coulomb terms shows no measurable improvement. Architecture choices affect runtime and memory use at least as much as raw size does. That gives chemists a clearer basis for picking a model instead of relying on marketing claims or single-paper results. The empirical comparisons are the new part; prior work has evaluated individual MLIPs, but pulling 15 together with consistent metrics on the same systems produces concrete selection guidance that was not available before. The authors focus on stability during actual dynamics runs, which is a practical strength because many benchmarks stop at static energy errors. The work stays empirical with no circular derivations, so the correlations stand or fall on the data itself. The main soft spot is representativeness. The claims about what drives accuracy assume the 15 models and the chosen test molecules cover enough architectural variety and chemical space. If the set leans heavily toward one family of equivariant networks or stays with small organic molecules, the observed patterns could be narrower than they appear. I would want to see the exact list of models, the element coverage, the size range of the test systems, and any error bars or statistical tests on the correlations before treating the findings as general. Minor details like data exclusion rules and exact timing protocols also matter for reproducibility. This paper is aimed at people who run molecular dynamics and need to choose an off-the-shelf MLIP today. A reader who wants data on real tradeoffs will get value even if the correlations turn out to be somewhat context-specific. It deserves a serious referee because the topic is timely, the empirical contribution is straightforward, and the practical framing is honest. Revisions would likely focus on clearer documentation of model selection and test coverage rather than any fundamental flaw in the setup.

Referee Report

2 major / 2 minor

Summary. The manuscript benchmarks 15 pretrained MLIPs on accuracy, speed, memory consumption, and long-term simulation stability across molecular systems. It reports strong positive correlations between accuracy and both model parameter count and training-set size, no accuracy gain from explicit Coulomb terms, and that runtime/memory depend on architecture as well as size.

Significance. If the model selection and test-suite choices prove representative, the work supplies immediately usable selection criteria for practitioners and identifies two controllable design variables (parameter count, data volume) that future MLIP developers can target.

major comments (2)

[Abstract and §2] Abstract and §2 (Methods): the central correlations are stated without reported error bars, p-values, or explicit exclusion criteria for outliers or failed simulations. The strength of the claimed “strong correlation” cannot be assessed from the provided text.
[§3 and Table 2] §3 (Results) and Table 2: the 15 models appear clustered in a few architectural families (mostly equivariant GNNs trained on similar organic datasets). No quantitative diversity metric (e.g., pairwise architectural distance or elemental coverage) is supplied, so the claim that parameter count and training-set size are general drivers rather than selection artifacts remains untested.

minor comments (2)

[Figure 3] Figure 3: axis labels and units for memory usage are missing; add them for reproducibility.
[§4] §4 (Discussion): the statement that “no benefit from explicit Coulomb terms” should be qualified by the specific charge models and cutoff radii used in the compared potentials.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our benchmark study of pretrained MLIPs. We have revised the manuscript to address the major comments by adding statistical rigor to the correlation analysis and a quantitative assessment of model diversity. These changes strengthen the presentation without altering the core findings.

read point-by-point responses

Referee: [Abstract and §2] Abstract and §2 (Methods): the central correlations are stated without reported error bars, p-values, or explicit exclusion criteria for outliers or failed simulations. The strength of the claimed “strong correlation” cannot be assessed from the provided text.

Authors: We agree that statistical measures are needed to substantiate the correlations. In the revised manuscript we have added error bars to all relevant figures, reported Pearson correlation coefficients together with p-values for the relationships between accuracy and both parameter count and training-set size, and inserted explicit text in §2 stating that all completed simulation runs were retained with no post-hoc exclusion of outliers or failures. revision: yes
Referee: [§3 and Table 2] §3 (Results) and Table 2: the 15 models appear clustered in a few architectural families (mostly equivariant GNNs trained on similar organic datasets). No quantitative diversity metric (e.g., pairwise architectural distance or elemental coverage) is supplied, so the claim that parameter count and training-set size are general drivers rather than selection artifacts remains untested.

Authors: The 15 models comprise essentially all publicly available pretrained MLIPs at the time of submission, so the observed clustering reflects the current state of the field rather than arbitrary selection. To address the concern directly we have added a supplementary table that quantifies architectural diversity (message-passing type, number of layers, invariance/equivariance properties) and elemental coverage across the models. We have also revised the language in §3 to frame the correlations as trends within the representative set of available pretrained models. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with observed correlations

full rationale

The paper is an empirical benchmark of 15 pretrained MLIPs on accuracy, speed, memory, and stability across molecular systems. Claims that parameter count and training-set size correlate with accuracy, and that explicit Coulomb terms show no benefit, are direct observations from the evaluation results rather than any derivation, fitted prediction, or self-referential construction. No equations, ansatzes, uniqueness theorems, or load-bearing self-citations are present. The study reports measured outcomes on chosen models and systems; representativeness is a scope limitation but does not create circularity in the reported findings.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical benchmarking study with no free parameters, axioms, or invented entities required for the central claims.

pith-pipeline@v0.9.0 · 5423 in / 994 out tokens · 30458 ms · 2026-05-16T11:25:25.921514+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 1 internal anchor

[1]

All of these techniques appear to be beneficial, and we encourage the use of all of them

including an energy term that scales with the distance between atoms as 1/r. All of these techniques appear to be beneficial, and we encourage the use of all of them. On the other hand, they appear to be less essential than one might expect. Even MLIPs that use none of them show a surprisingly small loss in accuracy on charged molecules. Acknowledgements ...

work page 2025
[2]

TensorNet: Cartesian Tensor Representations for Efficient Learning of Molecular Potentials

(2) Simeon, G.; De Fabritiis, G. TensorNet: Cartesian Tensor Representations for Efficient Learning of Molecular Potentials. In Proceedings of the 37th International Conference on Neural Information Processing Systems; NIPS ’23; Curran Associates Inc.: Red Hook, NY, USA, 2023; pp 37334–37353. (3) Plé, T.; Lagardère, L.; Piquemal, J.-P. Force-Field-Enhance...

work page doi:10.1039/d3sc02581k 2023
[3]

(6) Eastman, P.; Pritchard, B

https://doi.org/10.1038/s41597-022-01882-6. (6) Eastman, P.; Pritchard, B. P.; Chodera, J. D.; Markland, T. E. Nutmeg and SPICE: Models and Data for Biomolecular Machine Learning. J. Chem. Theory Comput. 2024, 20 (19), 8583–8593. https://doi.org/10.1021/acs.jctc.4c00794. (7) Barroso-Luque, L.; Shuaibi, M.; Fu, X.; Wood, B. M.; Dzamba, M.; Gao, M.; Rizvi, ...

work page doi:10.1038/s41597-022-01882-6 2024
[4]

Open Materials 2024 (OMat24) Inorganic Materials Dataset and Models

https://doi.org/10.48550/arXiv.2410.12771. (8) Levine, D. S.; Shuaibi, M.; Spotte-Smith, E. W. C.; Taylor, M. G.; Hasyim, M. R.; Michel, K.; Batatia, I.; Csányi, G.; Dzamba, M.; Eastman, P.; et al. The Open Molecules 2025 (OMol25) Dataset, Evaluations, and Models. arXiv May 13,

work page internal anchor Pith review doi:10.48550/arxiv.2410.12771 2025
[5]

The open molecules 2025 (omol25) dataset, evaluations, and models,

https://doi.org/10.48550/arXiv.2505.08762. (9) Farr, S. E.; Doerr, S.; Mirarchi, A.; Zariquiey, F. S.; Fabritiis, G. D. AceFF: A State-of-the-Art Machine Learning Potential for Small Molecules. arXiv January 2,

work page doi:10.48550/arxiv.2505.08762
[6]

https://doi.org/10.48550/arXiv.2601.00581. (10) M. Anstine, D.; Zubatyuk, R.; Isayev, O. AIMNet2: A Neural Network Potential to Meet Your Neutral, Charged, Organic, and Elemental-Organic Needs. Chem. Sci. 2025, 16 (23), 10228–10244. https://doi.org/10.1039/D4SC08572H. (11) Mann, E. L.; Wagen, C. C.; Vandezande, J. E.; Wagen, A. M.; Schneider, S. C. Egret-...

work page doi:10.48550/arxiv.2601.00581 2025
[7]

(12) Plé, T.; Adjoua, O.; Benali, A.; Posenitskiy, E.; Villot, C.; Lagardère, L.; Piquemal, J.-P

https://doi.org/10.48550/arXiv.2504.20955. (12) Plé, T.; Adjoua, O.; Benali, A.; Posenitskiy, E.; Villot, C.; Lagardère, L.; Piquemal, J.-P. A Foundation Model for Accurate Atomistic Simulations in Drug Design. ChemRxiv December 17,

work page doi:10.48550/arxiv.2504.20955
[8]

(13) Kovács, D

https://doi.org/10.26434/chemrxiv-2025-f1hgn-v4. (13) Kovács, D. P.; Moore, J. H.; Browning, N. J.; Batatia, I.; Horton, J. T.; Pu, Y.; Kapil, V.; Witt, W. C.; Magdău, I.-B.; Cole, D. J.; et al. MACE-OFF: Short-Range Transferable Machine Learning Force Fields for Organic Molecules. J. Am. Chem. Soc. 2025, 147 (21), 17598–17611. https://doi.org/10.1021/jac...

work page doi:10.26434/chemrxiv-2025-f1hgn-v4 2025
[9]

Cross learning between electronic structure the- ories for unifying molecular, surface, and inorganic crystal foundation force fields,

https://doi.org/10.48550/arXiv.2510.25380. (15) Chen, Y.; Cheng, L.; Jing, Y.; Zhong, P. Benchmarking a Foundation Potential against Quantum Chemistry Methods for Predicting Molecular Redox Potentials. arXiv October 28,

work page doi:10.48550/arxiv.2510.25380
[10]

(16) Kim, D.; Wang, X.; Zhong, P.; King, D

https://doi.org/10.48550/arXiv.2510.24063. (16) Kim, D.; Wang, X.; Zhong, P.; King, D. S.; Inizan, T. J.; Cheng, B. A Universal Augmentation Framework for Long-Range Electrostatics in Machine Learning Interatomic Potentials. arXiv July 18,

work page doi:10.48550/arxiv.2510.24063
[11]

(17) Rhodes, B.; Vandenhaute, S.; Šimkus, V.; Gin, J.; Godwin, J.; Duignan, T.; Neumann, M

https://doi.org/10.48550/arXiv.2507.14302. (17) Rhodes, B.; Vandenhaute, S.; Šimkus, V.; Gin, J.; Godwin, J.; Duignan, T.; Neumann, M. Orb-v3: Atomistic Simulation at Scale. arXiv April 10,

work page doi:10.48550/arxiv.2507.14302
[12]

Orb-v3: atomistic simulation at scale,

https://doi.org/10.48550/arXiv.2504.06231. (18) Wood, B. M.; Dzamba, M.; Fu, X.; Gao, M.; Shuaibi, M.; Barroso-Luque, L.; Abdelmaqsoud, K.; Gharakhanyan, V.; Kitchin, J. R.; Levine, D. S.; et al. UMA: A Family of Universal Models for Atoms. arXiv June 30,

work page doi:10.48550/arxiv.2504.06231
[13]

Uma: A family of universal models for atoms,

https://doi.org/10.48550/arXiv.2506.23971. (19) Chmiela, S.; Tkatchenko, A.; Sauceda, H. E.; Poltavsky, I.; Schütt, K. T.; Müller, K.-R. Machine Learning of Accurate Energy-Conserving Molecular Force Fields. Sci. Adv

work page doi:10.48550/arxiv.2506.23971
[14]

(20) Goerigk, L.; Hansen, A.; Bauer, C.; Ehrlich, S.; Najibi, A.; Grimme, S

https://doi.org/10.1126/sciadv.1603015. (20) Goerigk, L.; Hansen, A.; Bauer, C.; Ehrlich, S.; Najibi, A.; Grimme, S. A Look at the Density Functional Theory Zoo with the Advanced GMTKN55 Database for General Main Group Thermochemistry, Kinetics and Noncovalent Interactions. Phys. Chem. Chem. Phys. 2017, 19 (48), 32184–32215. https://doi.org/10.1039/C7CP04...

work page doi:10.1126/sciadv.1603015 2017
[15]

(23) Moon, J.; Jeon, U.; Choung, S.; Han, J

https://doi.org/10.48550/arXiv.2511.20487. (23) Moon, J.; Jeon, U.; Choung, S.; Han, J. W. CatBench Framework for Benchmarking Machine Learning Interatomic Potentials in Adsorption Energy Predictions for Heterogeneous Catalysis. Cell Rep. Phys. Sci. 2025, 6 (12). https://doi.org/10.1016/j.xcrp.2025.102968. (24) ACEsuit/Mace,

work page doi:10.48550/arxiv.2511.20487 2025
[16]

(25) Mardirossian, N.; Head-Gordon, M

https://github.com/ACEsuit/mace (accessed 2026-01-21). (25) Mardirossian, N.; Head-Gordon, M. ωB97M-V: A Combinatorially Optimized, Range-Separated Hybrid, Meta-GGA Density Functional with VV10 Nonlocal Correlation. J. Chem. Phys. 2016, 144 (21), 214110. https://doi.org/10.1063/1.4952647. (26) Vydrov, O. A.; Van Voorhis, T. Nonlocal van Der Waals Density ...

work page doi:10.1063/1.4952647 2026