pith. machine review for the scientific record. sign in

arxiv: 2601.16331 · v2 · submitted 2026-01-22 · ⚛️ physics.chem-ph

Recognition: no theorem link

Accuracy and Efficiency Benchmarks of Pretrained Machine Learning Potentials for Molecular Simulations

Authors on Pith no claims yet

Pith reviewed 2026-05-16 11:25 UTC · model grok-4.3

classification ⚛️ physics.chem-ph
keywords machine learning interatomic potentialsMLIP benchmarksmolecular simulationsmodel accuracycomputational efficiencytraining set sizeCoulomb interactionspretrained models
0
0 comments X

The pith

Benchmarks of 15 pretrained MLIPs show accuracy rises with parameter count and training set size, with no gain from explicit Coulomb terms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper evaluates fifteen pretrained machine learning interatomic potentials on molecular systems for accuracy against reference calculations, computational speed, memory consumption, and the ability to maintain stable simulations over time. The evaluation reveals that both higher parameter counts and larger training datasets correlate strongly with better accuracy across the tested models. Adding explicit Coulomb energy terms to the models produced no measurable improvement in accuracy. Speed and memory use turned out to depend as much on the specific model architecture as on the total number of parameters. These findings supply concrete guidance for selecting among existing pretrained models and for prioritizing design choices in future development.

Core claim

Fifteen pretrained MLIPs were benchmarked on accuracy for molecular systems, speed, memory requirements, and long-term simulation stability. Accuracy showed strong positive correlation with both the number of model parameters and the size of the training set. Models that included explicit Coulomb electrostatic energy terms did not outperform those that learned electrostatics implicitly. Speed and memory consumption were shaped comparably by architecture details and by overall model scale.

What carries the argument

Comparative benchmarking of fifteen pretrained machine learning interatomic potentials on standardized accuracy, efficiency, memory, and stability metrics for molecular simulations.

If this is right

  • Increasing the number of parameters and the size of the training set reliably improves accuracy.
  • Explicit Coulomb energy terms confer no accuracy advantage in the tested molecular systems.
  • Architecture choices influence speed and memory use at least as strongly as raw model size.
  • The observed correlations supply a practical basis for choosing among available pretrained MLIPs.
  • Stable molecular dynamics simulations remain feasible with the evaluated models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future MLIP development can focus resources on scaling training data and parameter counts rather than on hand-crafted electrostatic terms.
  • The same scaling trends may appear in other interatomic potential families or when applied to more complex reactive chemistry.
  • Targeted architecture improvements could support larger models without proportional increases in memory or slowdowns.
  • Repeating the benchmark on condensed-phase or reactive systems would test how far the size and data correlations extend.

Load-bearing premise

The fifteen selected pretrained MLIPs together with the chosen molecular test systems and performance metrics are representative enough to reveal general trends about what controls accuracy and efficiency.

What would settle it

A pretrained MLIP with relatively few parameters that nevertheless matches or exceeds the accuracy of the largest models on the same test systems, or a model with explicit Coulomb terms that clearly outperforms otherwise similar models, would falsify the reported correlations.

Figures

Figures reproduced from arXiv: 2601.16331 by Evan Pretti, Peter Eastman, Thomas E. Markland.

Figure 1
Figure 1. Figure 1: Model error versus a) number of parameters and b) number of training samples. [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Speed versus number of atoms [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Memory required versus number of atoms. 50 75 100 774 2661 6282 12,255 21,384 AIMNet2 90.54 85.00 132.22 48.24 19.89 8.84 4.59 2.59 AceFF-1.1 110.11 109.44 72.56 42.56 13.78 5.41 2.77 - AceFF-2.0 38.05 54.18 53.21 33.24 9.31 2.92 - - Egret-1 43.50 40.70 28.09 3.80 1.11 - - - FeNNix-Bio1(M) 122.44 27.76 70.01 119.69 50.51 17.92 6.18 2.38 FeNNix-Bio1(S) 124.73 36.87 124.19 114.09 63.12 19.72 6.47 2.46 MACE-M… view at source ↗
Figure 4
Figure 4. Figure 4: Model error on the test set versus speed on a) a 50 atom molecule and b) a 2661 atom [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
read the original abstract

The rapid development of pretrained Machine Learning Interatomic Potentials (MLIPs) that cover a wide range of molecular species has made it challenging to select the best model for a given application. We benchmark 15 pretrained MLIPs, evaluating each one on accuracy, speed, memory use, and ability to produce stable simulations. This provides an objective basis for practitioners to select the most appropriate MLIP for their own simulations, and offers insight into which factors most strongly influence model accuracy. We find that the number of model parameters and the size of the training set are both strongly correlated with accuracy, but observe no benefit from including explicit Coulomb energy terms. Speed and memory use are determined as much by the model architecture as by the size of the model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript benchmarks 15 pretrained MLIPs on accuracy, speed, memory consumption, and long-term simulation stability across molecular systems. It reports strong positive correlations between accuracy and both model parameter count and training-set size, no accuracy gain from explicit Coulomb terms, and that runtime/memory depend on architecture as well as size.

Significance. If the model selection and test-suite choices prove representative, the work supplies immediately usable selection criteria for practitioners and identifies two controllable design variables (parameter count, data volume) that future MLIP developers can target.

major comments (2)
  1. [Abstract and §2] Abstract and §2 (Methods): the central correlations are stated without reported error bars, p-values, or explicit exclusion criteria for outliers or failed simulations. The strength of the claimed “strong correlation” cannot be assessed from the provided text.
  2. [§3 and Table 2] §3 (Results) and Table 2: the 15 models appear clustered in a few architectural families (mostly equivariant GNNs trained on similar organic datasets). No quantitative diversity metric (e.g., pairwise architectural distance or elemental coverage) is supplied, so the claim that parameter count and training-set size are general drivers rather than selection artifacts remains untested.
minor comments (2)
  1. [Figure 3] Figure 3: axis labels and units for memory usage are missing; add them for reproducibility.
  2. [§4] §4 (Discussion): the statement that “no benefit from explicit Coulomb terms” should be qualified by the specific charge models and cutoff radii used in the compared potentials.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our benchmark study of pretrained MLIPs. We have revised the manuscript to address the major comments by adding statistical rigor to the correlation analysis and a quantitative assessment of model diversity. These changes strengthen the presentation without altering the core findings.

read point-by-point responses
  1. Referee: [Abstract and §2] Abstract and §2 (Methods): the central correlations are stated without reported error bars, p-values, or explicit exclusion criteria for outliers or failed simulations. The strength of the claimed “strong correlation” cannot be assessed from the provided text.

    Authors: We agree that statistical measures are needed to substantiate the correlations. In the revised manuscript we have added error bars to all relevant figures, reported Pearson correlation coefficients together with p-values for the relationships between accuracy and both parameter count and training-set size, and inserted explicit text in §2 stating that all completed simulation runs were retained with no post-hoc exclusion of outliers or failures. revision: yes

  2. Referee: [§3 and Table 2] §3 (Results) and Table 2: the 15 models appear clustered in a few architectural families (mostly equivariant GNNs trained on similar organic datasets). No quantitative diversity metric (e.g., pairwise architectural distance or elemental coverage) is supplied, so the claim that parameter count and training-set size are general drivers rather than selection artifacts remains untested.

    Authors: The 15 models comprise essentially all publicly available pretrained MLIPs at the time of submission, so the observed clustering reflects the current state of the field rather than arbitrary selection. To address the concern directly we have added a supplementary table that quantifies architectural diversity (message-passing type, number of layers, invariance/equivariance properties) and elemental coverage across the models. We have also revised the language in §3 to frame the correlations as trends within the representative set of available pretrained models. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with observed correlations

full rationale

The paper is an empirical benchmark of 15 pretrained MLIPs on accuracy, speed, memory, and stability across molecular systems. Claims that parameter count and training-set size correlate with accuracy, and that explicit Coulomb terms show no benefit, are direct observations from the evaluation results rather than any derivation, fitted prediction, or self-referential construction. No equations, ansatzes, uniqueness theorems, or load-bearing self-citations are present. The study reports measured outcomes on chosen models and systems; representativeness is a scope limitation but does not create circularity in the reported findings.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical benchmarking study with no free parameters, axioms, or invented entities required for the central claims.

pith-pipeline@v0.9.0 · 5423 in / 994 out tokens · 30458 ms · 2026-05-16T11:25:25.921514+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 1 internal anchor

  1. [1]

    All of these techniques appear to be beneficial, and we encourage the use of all of them

    including an energy term that scales with the distance between atoms as 1/r. All of these techniques appear to be beneficial, and we encourage the use of all of them. On the other hand, they appear to be less essential than one might expect. Even MLIPs that use none of them show a surprisingly small loss in accuracy on charged molecules. Acknowledgements ...

  2. [2]

    TensorNet: Cartesian Tensor Representations for Efficient Learning of Molecular Potentials

    (2) Simeon, G.; De Fabritiis, G. TensorNet: Cartesian Tensor Representations for Efficient Learning of Molecular Potentials. In Proceedings of the 37th International Conference on Neural Information Processing Systems; NIPS ’23; Curran Associates Inc.: Red Hook, NY, USA, 2023; pp 37334–37353. (3) Plé, T.; Lagardère, L.; Piquemal, J.-P. Force-Field-Enhance...

  3. [3]

    (6) Eastman, P.; Pritchard, B

    https://doi.org/10.1038/s41597-022-01882-6. (6) Eastman, P.; Pritchard, B. P.; Chodera, J. D.; Markland, T. E. Nutmeg and SPICE: Models and Data for Biomolecular Machine Learning. J. Chem. Theory Comput. 2024, 20 (19), 8583–8593. https://doi.org/10.1021/acs.jctc.4c00794. (7) Barroso-Luque, L.; Shuaibi, M.; Fu, X.; Wood, B. M.; Dzamba, M.; Gao, M.; Rizvi, ...

  4. [4]

    Open Materials 2024 (OMat24) Inorganic Materials Dataset and Models

    https://doi.org/10.48550/arXiv.2410.12771. (8) Levine, D. S.; Shuaibi, M.; Spotte-Smith, E. W. C.; Taylor, M. G.; Hasyim, M. R.; Michel, K.; Batatia, I.; Csányi, G.; Dzamba, M.; Eastman, P.; et al. The Open Molecules 2025 (OMol25) Dataset, Evaluations, and Models. arXiv May 13,

  5. [5]

    The open molecules 2025 (omol25) dataset, evaluations, and models,

    https://doi.org/10.48550/arXiv.2505.08762. (9) Farr, S. E.; Doerr, S.; Mirarchi, A.; Zariquiey, F. S.; Fabritiis, G. D. AceFF: A State-of-the-Art Machine Learning Potential for Small Molecules. arXiv January 2,

  6. [6]

    https://doi.org/10.48550/arXiv.2601.00581. (10) M. Anstine, D.; Zubatyuk, R.; Isayev, O. AIMNet2: A Neural Network Potential to Meet Your Neutral, Charged, Organic, and Elemental-Organic Needs. Chem. Sci. 2025, 16 (23), 10228–10244. https://doi.org/10.1039/D4SC08572H. (11) Mann, E. L.; Wagen, C. C.; Vandezande, J. E.; Wagen, A. M.; Schneider, S. C. Egret-...

  7. [7]

    (12) Plé, T.; Adjoua, O.; Benali, A.; Posenitskiy, E.; Villot, C.; Lagardère, L.; Piquemal, J.-P

    https://doi.org/10.48550/arXiv.2504.20955. (12) Plé, T.; Adjoua, O.; Benali, A.; Posenitskiy, E.; Villot, C.; Lagardère, L.; Piquemal, J.-P. A Foundation Model for Accurate Atomistic Simulations in Drug Design. ChemRxiv December 17,

  8. [8]

    (13) Kovács, D

    https://doi.org/10.26434/chemrxiv-2025-f1hgn-v4. (13) Kovács, D. P.; Moore, J. H.; Browning, N. J.; Batatia, I.; Horton, J. T.; Pu, Y.; Kapil, V.; Witt, W. C.; Magdău, I.-B.; Cole, D. J.; et al. MACE-OFF: Short-Range Transferable Machine Learning Force Fields for Organic Molecules. J. Am. Chem. Soc. 2025, 147 (21), 17598–17611. https://doi.org/10.1021/jac...

  9. [9]

    Cross learning between electronic structure the- ories for unifying molecular, surface, and inorganic crystal foundation force fields,

    https://doi.org/10.48550/arXiv.2510.25380. (15) Chen, Y.; Cheng, L.; Jing, Y.; Zhong, P. Benchmarking a Foundation Potential against Quantum Chemistry Methods for Predicting Molecular Redox Potentials. arXiv October 28,

  10. [10]

    (16) Kim, D.; Wang, X.; Zhong, P.; King, D

    https://doi.org/10.48550/arXiv.2510.24063. (16) Kim, D.; Wang, X.; Zhong, P.; King, D. S.; Inizan, T. J.; Cheng, B. A Universal Augmentation Framework for Long-Range Electrostatics in Machine Learning Interatomic Potentials. arXiv July 18,

  11. [11]

    (17) Rhodes, B.; Vandenhaute, S.; Šimkus, V.; Gin, J.; Godwin, J.; Duignan, T.; Neumann, M

    https://doi.org/10.48550/arXiv.2507.14302. (17) Rhodes, B.; Vandenhaute, S.; Šimkus, V.; Gin, J.; Godwin, J.; Duignan, T.; Neumann, M. Orb-v3: Atomistic Simulation at Scale. arXiv April 10,

  12. [12]

    Orb-v3: atomistic simulation at scale,

    https://doi.org/10.48550/arXiv.2504.06231. (18) Wood, B. M.; Dzamba, M.; Fu, X.; Gao, M.; Shuaibi, M.; Barroso-Luque, L.; Abdelmaqsoud, K.; Gharakhanyan, V.; Kitchin, J. R.; Levine, D. S.; et al. UMA: A Family of Universal Models for Atoms. arXiv June 30,

  13. [13]

    Uma: A family of universal models for atoms,

    https://doi.org/10.48550/arXiv.2506.23971. (19) Chmiela, S.; Tkatchenko, A.; Sauceda, H. E.; Poltavsky, I.; Schütt, K. T.; Müller, K.-R. Machine Learning of Accurate Energy-Conserving Molecular Force Fields. Sci. Adv

  14. [14]

    (20) Goerigk, L.; Hansen, A.; Bauer, C.; Ehrlich, S.; Najibi, A.; Grimme, S

    https://doi.org/10.1126/sciadv.1603015. (20) Goerigk, L.; Hansen, A.; Bauer, C.; Ehrlich, S.; Najibi, A.; Grimme, S. A Look at the Density Functional Theory Zoo with the Advanced GMTKN55 Database for General Main Group Thermochemistry, Kinetics and Noncovalent Interactions. Phys. Chem. Chem. Phys. 2017, 19 (48), 32184–32215. https://doi.org/10.1039/C7CP04...

  15. [15]

    (23) Moon, J.; Jeon, U.; Choung, S.; Han, J

    https://doi.org/10.48550/arXiv.2511.20487. (23) Moon, J.; Jeon, U.; Choung, S.; Han, J. W. CatBench Framework for Benchmarking Machine Learning Interatomic Potentials in Adsorption Energy Predictions for Heterogeneous Catalysis. Cell Rep. Phys. Sci. 2025, 6 (12). https://doi.org/10.1016/j.xcrp.2025.102968. (24) ACEsuit/Mace,

  16. [16]

    (25) Mardirossian, N.; Head-Gordon, M

    https://github.com/ACEsuit/mace (accessed 2026-01-21). (25) Mardirossian, N.; Head-Gordon, M. ωB97M-V: A Combinatorially Optimized, Range-Separated Hybrid, Meta-GGA Density Functional with VV10 Nonlocal Correlation. J. Chem. Phys. 2016, 144 (21), 214110. https://doi.org/10.1063/1.4952647. (26) Vydrov, O. A.; Van Voorhis, T. Nonlocal van Der Waals Density ...