Dataset Distillation for Machine Learning Force Field in Phase Transition Regime

Ji Chen; Qingyuan Zhang; Ruiyang Chen

arxiv: 2604.03027 · v1 · submitted 2026-04-03 · ⚛️ physics.chem-ph

Dataset Distillation for Machine Learning Force Field in Phase Transition Regime

Ruiyang Chen , Qingyuan Zhang , Ji Chen This is my paper

Pith reviewed 2026-05-13 18:08 UTC · model grok-4.3

classification ⚛️ physics.chem-ph

keywords dataset distillationmachine learning force fieldphase transitionliquid hydrogencentral-peripheral distillationstructural diversityab initio labeling

0 comments

The pith

A Central-Peripheral Distillation method lets machine learning force fields reproduce liquid hydrogen properties near its phase transition using only 200 configurations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Central-Peripheral Distillation algorithm to create compact training sets for machine learning force fields when structural fluctuations are large, as in phase transition regimes. By merging representative configurations with critical corner cases, the method aims to preserve maximum structural diversity in a much smaller dataset. Validation on the liquid-liquid phase transition of dense hydrogen shows that an MLFF trained on the resulting 200 configurations matches the structural and dynamical properties obtained from much larger reference sets. This reduction matters because generating high-accuracy labels for training data is computationally expensive, especially when moving beyond standard density functional theory.

Core claim

The CPD algorithm produces a distilled dataset whose structural diversity is sufficient for an MLFF trained on only 200 configurations to fully reproduce the structural and dynamical properties of liquid hydrogen in the vicinity of its liquid-liquid phase transition regime.

What carries the argument

The Central-Peripheral Distillation algorithm, which integrates representative samples with critical corner cases to retain maximum structural diversity in the distilled set.

If this is right

MLFF training in phase transition regimes becomes feasible with far fewer expensive ab initio labels.
High-level electronic structure methods can label the smaller distilled sets to raise overall predictive accuracy.
Large-scale atomistic simulations of fluctuating systems near transitions can be performed at lower data-generation cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same selection logic could reduce data needs for other materials whose phase diagrams contain regions of high fluctuation.
Smaller distilled sets open the door to using more expensive reference methods such as quantum Monte Carlo for the labeling step.
Testing whether the 200-configuration set remains sufficient when the transition pressure or temperature is shifted would check robustness outside the exact validation window.

Load-bearing premise

That strategically selecting representative samples together with critical corner cases will automatically produce a dataset diverse enough to capture all relevant fluctuations in the phase transition regime.

What would settle it

An MLFF trained on the 200 CPD-selected configurations produces radial distribution functions or mean-squared displacements that deviate from those obtained with a much larger reference dataset when both are evaluated at the same state points near the transition.

Figures

Figures reproduced from arXiv: 2604.03027 by Ji Chen, Qingyuan Zhang, Ruiyang Chen.

**Figure 2.** Figure 2: FIG. 2. Structural characterization and phase transition of the hydrogen test dataset for LLPT at [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: FIG. 3. Comparison of energy and force prediction performance. The RMSE of energy (a) and [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: FIG. 4. Performance of MLFFs trained on different datasets for hydrogen LLPT at 1000 K. Pressure [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Machine learning force field (MLFF) has emerged as a powerful data-driven tool for atomistic simulations, enabling large-scale and complex atomic systems to be simulated with accuracy comparable to \textit{ab initio} methods. However, MLFFs often suffer from low training efficiency in the phase transition regime, where structural fluctuations are significantly elevated. To address this challenge, we propose a Central-Peripheral Distillation (CPD) algorithm for training dataset distillation. By strategically integrating representative samples with critical corner cases, the CPD algorithm ensures that the distilled dataset retains maximum structural diversity. We validated the efficacy of the CPD method on the liquid-liquid phase transition of dense hydrogen. Results show that, with the CPD approach, only 200 configurations are sufficient to train a MLFF that can fully reproduce the structural and dynamical properties of liquid hydrogen in the vicinity of its phase transition regime. This work paves the way for high-fidelity labeling of the MLFF training datasets, for instance by adopting high-level \textit{ab initio} calculations beyond the standard density functional theory, thereby enhancing the predictive accuracy of MLFFs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CPD distillation trims the training set for hydrogen MLFFs down to 200 configs by mixing cluster centers with uncertainty-driven corner cases, but the abstract gives no error metrics or baseline comparisons to judge whether the transition itself is captured.

read the letter

The paper's central move is the Central-Peripheral Distillation algorithm, which selects a small set by combining representative structures from clustering with peripheral samples flagged by model uncertainty. On dense hydrogen near the liquid-liquid transition, they report that 200 such configurations are enough for an ML force field to match the reference DFT-MD on radial distribution functions, coordination numbers, diffusion constants, and velocity autocorrelations. That is the practical claim worth noting: it directly targets the regime where fluctuations spike and ordinary uniform sampling wastes labels. The approach is straightforward to implement on top of existing active-learning loops, and the choice of observables is the right one for checking whether the model has learned the relevant physics rather than just fitting averages. If the full results include force RMSE values, direct comparison to a much larger reference trajectory, and confirmation that the density jump or gap closure still appears, the method could save real compute when moving to higher-level electronic structure for labeling. The soft spot is the missing quantitative backbone in the abstract. No numbers on how much the distilled set improves over random selection, no error bars on the property matches, and no explicit test that rare events controlling the order parameter were retained rather than averaged away. The stress-test concern about omitted proton delocalization or transient metallic clusters is the one to verify in the results; if those configurations were not pulled into the peripheral set, the model could reproduce mean properties while suppressing the transition. The work is aimed at groups already running ML force fields on materials with first-order transitions and expensive reference calculations. A reader who needs to stretch a fixed labeling budget would find the selection logic useful even if the hydrogen numbers need more scrutiny. I would send it to peer review so the community can check the actual error tables and the independent validation runs against the claim.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes a Central-Peripheral Distillation (CPD) algorithm that combines representative cluster centers with uncertainty-selected corner cases to produce compact training sets for machine-learning force fields (MLFFs). Applied to the liquid-liquid phase transition (LLPT) of dense hydrogen, the central claim is that a distilled set of only 200 configurations suffices to train an MLFF whose molecular-dynamics trajectories reproduce the structural (RDF, coordination) and dynamical (diffusion, velocity autocorrelation) observables obtained from reference DFT-MD in the vicinity of the transition.

Significance. If the quantitative validation holds, the CPD procedure would offer a practical route to high-fidelity MLFF training data in regimes of large structural fluctuations, thereby lowering the barrier to labeling with higher-level ab initio methods. The algorithmic emphasis on both typical and high-uncertainty configurations directly targets a known bottleneck in MLFF construction for phase-transition systems.

major comments (3)

[Abstract / Results] Abstract and Results: the assertion that the 200-configuration CPD set 'fully reproduce[s]' structural and dynamical properties supplies no numerical error metrics (RMSE on forces, RDF deviations, diffusion-coefficient differences), error bars, or baseline comparisons against random or active-learning selections; without these quantities the central performance claim cannot be evaluated.
[Validation / §4] Validation procedure: the reported MLFF-MD trajectories are compared only against DFT-MD generated from the same underlying ensemble; the manuscript does not demonstrate that the LLPT order parameter (density discontinuity or electronic gap closure) is recovered when the reference data are drawn from an independent, much larger sampling. This leaves open the possibility that low-probability but transition-controlling configurations are absent from the distilled manifold.
[Methods / §2.2] CPD selection details: the precise definition of 'uncertainty-driven corner cases' (e.g., the uncertainty estimator, the number of peripheral samples retained, and the clustering algorithm) is not given with sufficient algorithmic or hyper-parameter transparency to allow reproduction or to assess whether the peripheral set systematically covers the fluctuation modes that control the LLPT.

minor comments (2)

[§2] Notation: the symbols used for the central and peripheral subsets are introduced without a clear table or equation reference, making it difficult to follow the selection procedure.
[Figure 3] Figure clarity: the RDF and VACF plots lack shaded uncertainty bands or direct overlay of the reference DFT curves, reducing the ability to judge quantitative agreement.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and will revise the manuscript accordingly to improve quantitative validation, transparency, and reproducibility.

read point-by-point responses

Referee: [Abstract / Results] Abstract and Results: the assertion that the 200-configuration CPD set 'fully reproduce[s]' structural and dynamical properties supplies no numerical error metrics (RMSE on forces, RDF deviations, diffusion-coefficient differences), error bars, or baseline comparisons against random or active-learning selections; without these quantities the central performance claim cannot be evaluated.

Authors: We agree that quantitative metrics are required to substantiate the performance claims. In the revised manuscript we will add RMSE values for forces and energies, integrated absolute deviations for RDFs, diffusion-coefficient differences with error bars obtained from multiple independent MD runs, and explicit comparisons against both random selection and active-learning baselines. These additions will be placed in the Results section and referenced in the Abstract. revision: yes
Referee: [Validation / §4] Validation procedure: the reported MLFF-MD trajectories are compared only against DFT-MD generated from the same underlying ensemble; the manuscript does not demonstrate that the LLPT order parameter (density discontinuity or electronic gap closure) is recovered when the reference data are drawn from an independent, much larger sampling. This leaves open the possibility that low-probability but transition-controlling configurations are absent from the distilled manifold.

Authors: This is a fair criticism. Our present validation uses reference trajectories drawn from the same ensemble for direct comparability. To address coverage of rare but critical configurations, the revised manuscript will include additional tests against an independent, substantially larger reference dataset sampled from a broader ensemble. We will explicitly verify recovery of the LLPT order parameters (density discontinuity and electronic gap closure) on this independent set. revision: yes
Referee: [Methods / §2.2] CPD selection details: the precise definition of 'uncertainty-driven corner cases' (e.g., the uncertainty estimator, the number of peripheral samples retained, and the clustering algorithm) is not given with sufficient algorithmic or hyper-parameter transparency to allow reproduction or to assess whether the peripheral set systematically covers the fluctuation modes that control the LLPT.

Authors: We thank the referee for highlighting the need for reproducibility. The revised §2.2 will supply the exact uncertainty estimator (ensemble variance of an auxiliary committee of models), the precise number of peripheral samples retained (50 out of the final 200), the clustering algorithm and its parameters (k-means with k = 150 for central samples), and all hyper-parameters. We will also add a short analysis demonstrating that the peripheral subset captures the dominant fluctuation modes associated with the LLPT. revision: yes

Circularity Check

0 steps flagged

No circularity: CPD is an independent algorithmic selection with external validation

full rationale

The paper presents CPD as a procedural algorithm that selects representative cluster centers plus uncertainty-driven corner cases from an existing pool of configurations. No equations, fitted parameters, or self-citations are shown that reduce the claimed sufficiency of 200 structures to a tautological definition or to the same data used for selection. Validation consists of direct comparison of MLFF-MD observables (RDF, diffusion, etc.) against independent reference DFT-MD trajectories on the target system; this comparison is external to the distillation step itself. The central claim therefore rests on empirical reproduction rather than any self-referential construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that phase-transition fluctuations can be captured by a small curated subset; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Structural fluctuations are significantly elevated in the phase transition regime, making standard training inefficient.
Directly stated as the motivating challenge in the abstract.

pith-pipeline@v0.9.0 · 5490 in / 1084 out tokens · 36123 ms · 2026-05-13T18:08:47.152705+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages

[1]

Behler and M

J. Behler and M. Parrinello, Generalized neural-network representation of high-dimensional potential-energy surfaces, Phys. Rev. Lett.98, 146401 (2007). S4

work page 2007
[2]

K. T. Sch¨ utt, H. E. Sauceda, P. J. Kindermans, A. Tkatchenko, and K.-R. M¨ uller, Schnet - a deep learning architecture for molecules and materials., The Journal of chemical physics148, 241722 (2017)

work page 2017
[3]

H. Wang, L. Zhang, J. Han, and W. E, Deepmd-kit: A deep learning package for many-body potential energy representation and molecular dynamics, Computer Physics Communications 228, 178 (2018)

work page 2018
[4]

Batzner, A

S. Batzner, A. Musaelian, L. Sun, M. Geiger, J. P. Mailoa, M. Kornbluth, N. Molinari, T. E. Smidt, and B. Kozinsky, E (3)-equivariant graph neural networks for data-efficient and accu- rate interatomic potentials, Nature communications13, 2453 (2022)

work page 2022
[5]

V. L. Deringer, A. P. Bart´ ok, N. Bernstein, D. M. Wilkins, M. Ceriotti, and G. Cs´ anyi, Gaussian process regression for materials and molecules, Chemical Reviews121, 10073 (2021)

work page 2021
[6]

O. T. Unke, S. Chmiela, H. E. Sauceda, M. Gastegger, I. Poltavsky, K. T. Sch¨ utt, A. Tkatchenko, and K.-R. M¨ uller, Machine learning force fields, Chemical Reviews121, 10142 (2021)

work page 2021
[7]

Batatia, D

I. Batatia, D. P. Kovacs, G. Simm, C. Ortner, and G. Cs´ anyi, Mace: Higher order equivari- ant message passing neural networks for fast and accurate force fields, Advances in neural information processing systems35, 11423 (2022)

work page 2022
[8]

Jinnouchi, F

R. Jinnouchi, F. Karsai, and G. Kresse, On-the-fly machine learning force field generation: Application to melting points, Phys. Rev. B100, 014105 (2019)

work page 2019
[9]

Jinnouchi, J

R. Jinnouchi, J. Lahnsteiner, F. Karsai, G. Kresse, and M. Bokdam, Phase transitions of hybrid perovskites simulated by machine-learning force fields trained on the fly with bayesian inference, Physical review letters122, 225701 (2019)

work page 2019
[10]

Cheng, G

B. Cheng, G. Mazzola, C. J. Pickard, and M. Ceriotti, Evidence for supercritical behaviour of high-pressure liquid hydrogen, Nature585, 217 (2020)

work page 2020
[11]

P.-Y. Chen, K. Shibata, and T. Mizoguchi, High precision machine learning force field develop- ment for batio3 phase transitions, amorphous, and liquid structures, APL Machine Learning 3, 036115 (2025)

work page 2025
[12]

V. L. Deringer, C. J. Pickard, and G. Cs´ anyi, Data-driven learning of total and local energies in elemental boron., Physical review letters120, 156001 (2017)

work page 2017
[13]

J. S. Smith, B. T. Nebgen, N. Lubbers, O. Isayev, and A. E. Roitberg, Less is more: sampling chemical space with active learning, The Journal of chemical physics148, 241733 (2018). S5

work page 2018
[14]

Sivaraman, A

G. Sivaraman, A. N. Krishnamoorthy, M. Baur, C. Holm, M. Stan, G. Cs´ anyi, C. J. Ben- more, and ´A. V´ azquez-Mayagoitia, Machine-learned interatomic potentials by active learning: amorphous and liquid hafnium dioxide, npj Computational Materials6, 104 (2020)

work page 2020
[15]

Vandermause, Y

J. Vandermause, Y. Xie, J. S. Lim, C. J. Owen, and B. Kozinsky, Active learning of reactive bayesian force fields applied to heterogeneous catalysis dynamics of h/pt, Nature Communi- cations13, 5183 (2021)

work page 2021
[16]

Finkbeiner, S

J. Finkbeiner, S. Tovey, and D. Fink, Generating minimal training sets for machine learned potentials., Physical review letters132, 167301 (2023)

work page 2023
[17]

J. Qi, T. W. Ko, B. Wood, T. A. Pham, and S. P. Ong, Robust training of machine learning interatomic potentials with dimensionality reduction and stratified sampling, npj Computa- tional Materials10, 43 (2023)

work page 2023
[18]

K. Li, D. Persaud, K. Choudhary, B. DeCost, M. Greenwood, and J. Hattrick-Simpers, Ex- ploiting redundancy in large materials datasets for efficient machine learning with less data, Nature Communications14, 7283 (2023)

work page 2023
[19]

Schwalbe-Koda, S

D. Schwalbe-Koda, S. Hamel, B. Sadigh, F. Zhou, and V. Lordi, Model-free estimation of com- pleteness, uncertainties, and outliers in atomistic machine learning using information theory, Nature Communications16, 4014 (2025)

work page 2025
[20]

G. H. Booth, A. Gr¨ uneis, G. Kresse, and A. Alavi, Towards an exact description of electronic wavefunctions in real solids, Nature493, 365 (2012)

work page 2012
[21]

Hermann, J

J. Hermann, J. S. Spencer, K. Choo, A. Mezzacapo, W. M. C. Foulkes, D. Pfau, G. Carleo, and F. No’e, Ab initio quantum chemistry with neural-network wavefunctions, Nature Reviews Chemistry7, 692 (2022)

work page 2022
[22]

W. Ren, W. Fu, X. Wu, and J. Chen, Towards the ground state of molecules via diffusion monte carlo on neural networks, Nature Communications14, 1860 (2023)

work page 2023
[23]

B. X. Shi, A. S. Rosen, T. Sch¨ afer, A. Gr¨ uneis, V. Kapil, A. Zen, and A. Michaelides, An accurate and efficient framework for modelling the surface chemistry of ionic materials, Nature Chemistry17, 1688 (2024)

work page 2024
[24]

Z. Tang, H. Chen, Y. Li, Y. Qian, Y. Wang, W. Fu, J. Li, C. Si, W. Duan, J. Chen,et al., Deep-learning electronic structure calculations, Nature Computational Science5, 1133 (2025)

work page 2025
[25]

Y. Qian, X. Li, Z. Li, W. Ren, and J. Chen, Deep learning quantum monte carlo for solids, Wiley Interdisciplinary Reviews: Computational Molecular Science15, e70015 (2025). S6

work page 2025
[26]

Huang, Z

Z. Huang, Z. Guo, C. Cao, H. Q. Pham, X. Wen, G. H. Booth, J. Chen, and D. Lv, A multi-resolution systematically improvable quantum embedding scheme for large-scale surface chemistry calculations, Nature Communications16, 9297 (2025)

work page 2025
[27]

E. P. Wigner and H. B. Huntington, On the possibility of a metallic modification of hydrogen, Journal of Chemical Physics3, 764 (1935)

work page 1935
[28]

J. M. McMahon, M. A. Morales, C. Pierleoni, and D. M. Ceperley, The properties of hydrogen and helium under extreme conditions, Reviews of modern physics84, 1607 (2012)

work page 2012
[29]

Pierleoni, M

C. Pierleoni, M. A. Morales, G. Rillo, M. Holzmann, and D. M. Ceperley, Liquid–liquid phase transition in hydrogen by coupled electron–ion monte carlo simulations, Proceedings of the National Academy of Sciences113, 4953 (2016)

work page 2016
[30]

M. A. Morales, C. Pierleoni, E. Schwegler, and D. M. Ceperley, Evidence for a first-order liquid-liquid transition in high-pressure hydrogen from ab initio simulations, Proceedings of the National Academy of Sciences107, 12799 (2010)

work page 2010
[31]

W. Fang, J. Chen, Y. Feng, X.-Z. Li, and A. Michaelides, The quantum nature of hydrogen, International Reviews in Physical Chemistry38, 35 (2019)

work page 2019
[32]

Istas, S

M. Istas, S. Jensen, Y. Yang, M. Holzmann, C. Pierleoni, and D. M. Ceperley, Liquid-liquid phase transition of hydrogen and its critical point: Analysis from ab initio simulation and a machine-learned potential, Physical Review E111, 045307 (2025)

work page 2025
[33]

Giannozzi, S

P. Giannozzi, S. Baroni, N. Bonini, M. Calandra, R. Car, C. Cavazzoni, D. Ceresoli, G. L. Chiarotti, M. Cococcioni, I. Dabo,et al., Quantum espresso: a modular and open-source software project for quantum simulations of materials, Journal of physics: Condensed matter 21, 395502 (2009)

work page 2009
[34]

Giannozzi, O

P. Giannozzi, O. Andreussi, T. Brumme, O. Bunau, M. Buongiorno Nardelli, M. Calandra, R. Car, C. Cavazzoni, D. Ceresoli, M. Cococcioni,et al., Advanced capabilities for materials modelling with quantum espresso, Journal of physics: Condensed matter29, 465901 (2017)

work page 2017
[35]

M. Dion, H. Rydberg, E. Schr¨ oder, D. C. Langreth, and B. I. Lundqvist, Van der waals density functional for general geometries, Physical review letters92, 246401 (2004)

work page 2004
[36]

Vanderbilt, Optimally smooth norm-conserving pseudopotentials, Physical Review B32, 8412 (1985)

D. Vanderbilt, Optimally smooth norm-conserving pseudopotentials, Physical Review B32, 8412 (1985)

work page 1985
[37]

Bussi, D

G. Bussi, D. Donadio, and M. Parrinello, Canonical sampling through velocity rescaling., The Journal of chemical physics126, 014101 (2007). S7

work page 2007
[38]

Cheng, G

B. Cheng, G. Mazzola, C. J. Pickard, and M. Ceriotti, Evidence for supercritical behaviour of high-pressure liquid hydrogen, Nature585, 217 (2020). S8

work page 2020

[1] [1]

Behler and M

J. Behler and M. Parrinello, Generalized neural-network representation of high-dimensional potential-energy surfaces, Phys. Rev. Lett.98, 146401 (2007). S4

work page 2007

[2] [2]

K. T. Sch¨ utt, H. E. Sauceda, P. J. Kindermans, A. Tkatchenko, and K.-R. M¨ uller, Schnet - a deep learning architecture for molecules and materials., The Journal of chemical physics148, 241722 (2017)

work page 2017

[3] [3]

H. Wang, L. Zhang, J. Han, and W. E, Deepmd-kit: A deep learning package for many-body potential energy representation and molecular dynamics, Computer Physics Communications 228, 178 (2018)

work page 2018

[4] [4]

Batzner, A

S. Batzner, A. Musaelian, L. Sun, M. Geiger, J. P. Mailoa, M. Kornbluth, N. Molinari, T. E. Smidt, and B. Kozinsky, E (3)-equivariant graph neural networks for data-efficient and accu- rate interatomic potentials, Nature communications13, 2453 (2022)

work page 2022

[5] [5]

V. L. Deringer, A. P. Bart´ ok, N. Bernstein, D. M. Wilkins, M. Ceriotti, and G. Cs´ anyi, Gaussian process regression for materials and molecules, Chemical Reviews121, 10073 (2021)

work page 2021

[6] [6]

O. T. Unke, S. Chmiela, H. E. Sauceda, M. Gastegger, I. Poltavsky, K. T. Sch¨ utt, A. Tkatchenko, and K.-R. M¨ uller, Machine learning force fields, Chemical Reviews121, 10142 (2021)

work page 2021

[7] [7]

Batatia, D

I. Batatia, D. P. Kovacs, G. Simm, C. Ortner, and G. Cs´ anyi, Mace: Higher order equivari- ant message passing neural networks for fast and accurate force fields, Advances in neural information processing systems35, 11423 (2022)

work page 2022

[8] [8]

Jinnouchi, F

R. Jinnouchi, F. Karsai, and G. Kresse, On-the-fly machine learning force field generation: Application to melting points, Phys. Rev. B100, 014105 (2019)

work page 2019

[9] [9]

Jinnouchi, J

R. Jinnouchi, J. Lahnsteiner, F. Karsai, G. Kresse, and M. Bokdam, Phase transitions of hybrid perovskites simulated by machine-learning force fields trained on the fly with bayesian inference, Physical review letters122, 225701 (2019)

work page 2019

[10] [10]

Cheng, G

B. Cheng, G. Mazzola, C. J. Pickard, and M. Ceriotti, Evidence for supercritical behaviour of high-pressure liquid hydrogen, Nature585, 217 (2020)

work page 2020

[11] [11]

P.-Y. Chen, K. Shibata, and T. Mizoguchi, High precision machine learning force field develop- ment for batio3 phase transitions, amorphous, and liquid structures, APL Machine Learning 3, 036115 (2025)

work page 2025

[12] [12]

V. L. Deringer, C. J. Pickard, and G. Cs´ anyi, Data-driven learning of total and local energies in elemental boron., Physical review letters120, 156001 (2017)

work page 2017

[13] [13]

J. S. Smith, B. T. Nebgen, N. Lubbers, O. Isayev, and A. E. Roitberg, Less is more: sampling chemical space with active learning, The Journal of chemical physics148, 241733 (2018). S5

work page 2018

[14] [14]

Sivaraman, A

G. Sivaraman, A. N. Krishnamoorthy, M. Baur, C. Holm, M. Stan, G. Cs´ anyi, C. J. Ben- more, and ´A. V´ azquez-Mayagoitia, Machine-learned interatomic potentials by active learning: amorphous and liquid hafnium dioxide, npj Computational Materials6, 104 (2020)

work page 2020

[15] [15]

Vandermause, Y

J. Vandermause, Y. Xie, J. S. Lim, C. J. Owen, and B. Kozinsky, Active learning of reactive bayesian force fields applied to heterogeneous catalysis dynamics of h/pt, Nature Communi- cations13, 5183 (2021)

work page 2021

[16] [16]

Finkbeiner, S

J. Finkbeiner, S. Tovey, and D. Fink, Generating minimal training sets for machine learned potentials., Physical review letters132, 167301 (2023)

work page 2023

[17] [17]

J. Qi, T. W. Ko, B. Wood, T. A. Pham, and S. P. Ong, Robust training of machine learning interatomic potentials with dimensionality reduction and stratified sampling, npj Computa- tional Materials10, 43 (2023)

work page 2023

[18] [18]

K. Li, D. Persaud, K. Choudhary, B. DeCost, M. Greenwood, and J. Hattrick-Simpers, Ex- ploiting redundancy in large materials datasets for efficient machine learning with less data, Nature Communications14, 7283 (2023)

work page 2023

[19] [19]

Schwalbe-Koda, S

D. Schwalbe-Koda, S. Hamel, B. Sadigh, F. Zhou, and V. Lordi, Model-free estimation of com- pleteness, uncertainties, and outliers in atomistic machine learning using information theory, Nature Communications16, 4014 (2025)

work page 2025

[20] [20]

G. H. Booth, A. Gr¨ uneis, G. Kresse, and A. Alavi, Towards an exact description of electronic wavefunctions in real solids, Nature493, 365 (2012)

work page 2012

[21] [21]

Hermann, J

J. Hermann, J. S. Spencer, K. Choo, A. Mezzacapo, W. M. C. Foulkes, D. Pfau, G. Carleo, and F. No’e, Ab initio quantum chemistry with neural-network wavefunctions, Nature Reviews Chemistry7, 692 (2022)

work page 2022

[22] [22]

W. Ren, W. Fu, X. Wu, and J. Chen, Towards the ground state of molecules via diffusion monte carlo on neural networks, Nature Communications14, 1860 (2023)

work page 2023

[23] [23]

B. X. Shi, A. S. Rosen, T. Sch¨ afer, A. Gr¨ uneis, V. Kapil, A. Zen, and A. Michaelides, An accurate and efficient framework for modelling the surface chemistry of ionic materials, Nature Chemistry17, 1688 (2024)

work page 2024

[24] [24]

Z. Tang, H. Chen, Y. Li, Y. Qian, Y. Wang, W. Fu, J. Li, C. Si, W. Duan, J. Chen,et al., Deep-learning electronic structure calculations, Nature Computational Science5, 1133 (2025)

work page 2025

[25] [25]

Y. Qian, X. Li, Z. Li, W. Ren, and J. Chen, Deep learning quantum monte carlo for solids, Wiley Interdisciplinary Reviews: Computational Molecular Science15, e70015 (2025). S6

work page 2025

[26] [26]

Huang, Z

Z. Huang, Z. Guo, C. Cao, H. Q. Pham, X. Wen, G. H. Booth, J. Chen, and D. Lv, A multi-resolution systematically improvable quantum embedding scheme for large-scale surface chemistry calculations, Nature Communications16, 9297 (2025)

work page 2025

[27] [27]

E. P. Wigner and H. B. Huntington, On the possibility of a metallic modification of hydrogen, Journal of Chemical Physics3, 764 (1935)

work page 1935

[28] [28]

J. M. McMahon, M. A. Morales, C. Pierleoni, and D. M. Ceperley, The properties of hydrogen and helium under extreme conditions, Reviews of modern physics84, 1607 (2012)

work page 2012

[29] [29]

Pierleoni, M

C. Pierleoni, M. A. Morales, G. Rillo, M. Holzmann, and D. M. Ceperley, Liquid–liquid phase transition in hydrogen by coupled electron–ion monte carlo simulations, Proceedings of the National Academy of Sciences113, 4953 (2016)

work page 2016

[30] [30]

M. A. Morales, C. Pierleoni, E. Schwegler, and D. M. Ceperley, Evidence for a first-order liquid-liquid transition in high-pressure hydrogen from ab initio simulations, Proceedings of the National Academy of Sciences107, 12799 (2010)

work page 2010

[31] [31]

W. Fang, J. Chen, Y. Feng, X.-Z. Li, and A. Michaelides, The quantum nature of hydrogen, International Reviews in Physical Chemistry38, 35 (2019)

work page 2019

[32] [32]

Istas, S

M. Istas, S. Jensen, Y. Yang, M. Holzmann, C. Pierleoni, and D. M. Ceperley, Liquid-liquid phase transition of hydrogen and its critical point: Analysis from ab initio simulation and a machine-learned potential, Physical Review E111, 045307 (2025)

work page 2025

[33] [33]

Giannozzi, S

P. Giannozzi, S. Baroni, N. Bonini, M. Calandra, R. Car, C. Cavazzoni, D. Ceresoli, G. L. Chiarotti, M. Cococcioni, I. Dabo,et al., Quantum espresso: a modular and open-source software project for quantum simulations of materials, Journal of physics: Condensed matter 21, 395502 (2009)

work page 2009

[34] [34]

Giannozzi, O

P. Giannozzi, O. Andreussi, T. Brumme, O. Bunau, M. Buongiorno Nardelli, M. Calandra, R. Car, C. Cavazzoni, D. Ceresoli, M. Cococcioni,et al., Advanced capabilities for materials modelling with quantum espresso, Journal of physics: Condensed matter29, 465901 (2017)

work page 2017

[35] [35]

M. Dion, H. Rydberg, E. Schr¨ oder, D. C. Langreth, and B. I. Lundqvist, Van der waals density functional for general geometries, Physical review letters92, 246401 (2004)

work page 2004

[36] [36]

Vanderbilt, Optimally smooth norm-conserving pseudopotentials, Physical Review B32, 8412 (1985)

D. Vanderbilt, Optimally smooth norm-conserving pseudopotentials, Physical Review B32, 8412 (1985)

work page 1985

[37] [37]

Bussi, D

G. Bussi, D. Donadio, and M. Parrinello, Canonical sampling through velocity rescaling., The Journal of chemical physics126, 014101 (2007). S7

work page 2007

[38] [38]

Cheng, G

B. Cheng, G. Mazzola, C. J. Pickard, and M. Ceriotti, Evidence for supercritical behaviour of high-pressure liquid hydrogen, Nature585, 217 (2020). S8

work page 2020