Dataset Distillation for Machine Learning Force Field in Phase Transition Regime
Pith reviewed 2026-05-13 18:08 UTC · model grok-4.3
The pith
A Central-Peripheral Distillation method lets machine learning force fields reproduce liquid hydrogen properties near its phase transition using only 200 configurations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The CPD algorithm produces a distilled dataset whose structural diversity is sufficient for an MLFF trained on only 200 configurations to fully reproduce the structural and dynamical properties of liquid hydrogen in the vicinity of its liquid-liquid phase transition regime.
What carries the argument
The Central-Peripheral Distillation algorithm, which integrates representative samples with critical corner cases to retain maximum structural diversity in the distilled set.
If this is right
- MLFF training in phase transition regimes becomes feasible with far fewer expensive ab initio labels.
- High-level electronic structure methods can label the smaller distilled sets to raise overall predictive accuracy.
- Large-scale atomistic simulations of fluctuating systems near transitions can be performed at lower data-generation cost.
Where Pith is reading between the lines
- The same selection logic could reduce data needs for other materials whose phase diagrams contain regions of high fluctuation.
- Smaller distilled sets open the door to using more expensive reference methods such as quantum Monte Carlo for the labeling step.
- Testing whether the 200-configuration set remains sufficient when the transition pressure or temperature is shifted would check robustness outside the exact validation window.
Load-bearing premise
That strategically selecting representative samples together with critical corner cases will automatically produce a dataset diverse enough to capture all relevant fluctuations in the phase transition regime.
What would settle it
An MLFF trained on the 200 CPD-selected configurations produces radial distribution functions or mean-squared displacements that deviate from those obtained with a much larger reference dataset when both are evaluated at the same state points near the transition.
Figures
read the original abstract
Machine learning force field (MLFF) has emerged as a powerful data-driven tool for atomistic simulations, enabling large-scale and complex atomic systems to be simulated with accuracy comparable to \textit{ab initio} methods. However, MLFFs often suffer from low training efficiency in the phase transition regime, where structural fluctuations are significantly elevated. To address this challenge, we propose a Central-Peripheral Distillation (CPD) algorithm for training dataset distillation. By strategically integrating representative samples with critical corner cases, the CPD algorithm ensures that the distilled dataset retains maximum structural diversity. We validated the efficacy of the CPD method on the liquid-liquid phase transition of dense hydrogen. Results show that, with the CPD approach, only 200 configurations are sufficient to train a MLFF that can fully reproduce the structural and dynamical properties of liquid hydrogen in the vicinity of its phase transition regime. This work paves the way for high-fidelity labeling of the MLFF training datasets, for instance by adopting high-level \textit{ab initio} calculations beyond the standard density functional theory, thereby enhancing the predictive accuracy of MLFFs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a Central-Peripheral Distillation (CPD) algorithm that combines representative cluster centers with uncertainty-selected corner cases to produce compact training sets for machine-learning force fields (MLFFs). Applied to the liquid-liquid phase transition (LLPT) of dense hydrogen, the central claim is that a distilled set of only 200 configurations suffices to train an MLFF whose molecular-dynamics trajectories reproduce the structural (RDF, coordination) and dynamical (diffusion, velocity autocorrelation) observables obtained from reference DFT-MD in the vicinity of the transition.
Significance. If the quantitative validation holds, the CPD procedure would offer a practical route to high-fidelity MLFF training data in regimes of large structural fluctuations, thereby lowering the barrier to labeling with higher-level ab initio methods. The algorithmic emphasis on both typical and high-uncertainty configurations directly targets a known bottleneck in MLFF construction for phase-transition systems.
major comments (3)
- [Abstract / Results] Abstract and Results: the assertion that the 200-configuration CPD set 'fully reproduce[s]' structural and dynamical properties supplies no numerical error metrics (RMSE on forces, RDF deviations, diffusion-coefficient differences), error bars, or baseline comparisons against random or active-learning selections; without these quantities the central performance claim cannot be evaluated.
- [Validation / §4] Validation procedure: the reported MLFF-MD trajectories are compared only against DFT-MD generated from the same underlying ensemble; the manuscript does not demonstrate that the LLPT order parameter (density discontinuity or electronic gap closure) is recovered when the reference data are drawn from an independent, much larger sampling. This leaves open the possibility that low-probability but transition-controlling configurations are absent from the distilled manifold.
- [Methods / §2.2] CPD selection details: the precise definition of 'uncertainty-driven corner cases' (e.g., the uncertainty estimator, the number of peripheral samples retained, and the clustering algorithm) is not given with sufficient algorithmic or hyper-parameter transparency to allow reproduction or to assess whether the peripheral set systematically covers the fluctuation modes that control the LLPT.
minor comments (2)
- [§2] Notation: the symbols used for the central and peripheral subsets are introduced without a clear table or equation reference, making it difficult to follow the selection procedure.
- [Figure 3] Figure clarity: the RDF and VACF plots lack shaded uncertainty bands or direct overlay of the reference DFT curves, reducing the ability to judge quantitative agreement.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below and will revise the manuscript accordingly to improve quantitative validation, transparency, and reproducibility.
read point-by-point responses
-
Referee: [Abstract / Results] Abstract and Results: the assertion that the 200-configuration CPD set 'fully reproduce[s]' structural and dynamical properties supplies no numerical error metrics (RMSE on forces, RDF deviations, diffusion-coefficient differences), error bars, or baseline comparisons against random or active-learning selections; without these quantities the central performance claim cannot be evaluated.
Authors: We agree that quantitative metrics are required to substantiate the performance claims. In the revised manuscript we will add RMSE values for forces and energies, integrated absolute deviations for RDFs, diffusion-coefficient differences with error bars obtained from multiple independent MD runs, and explicit comparisons against both random selection and active-learning baselines. These additions will be placed in the Results section and referenced in the Abstract. revision: yes
-
Referee: [Validation / §4] Validation procedure: the reported MLFF-MD trajectories are compared only against DFT-MD generated from the same underlying ensemble; the manuscript does not demonstrate that the LLPT order parameter (density discontinuity or electronic gap closure) is recovered when the reference data are drawn from an independent, much larger sampling. This leaves open the possibility that low-probability but transition-controlling configurations are absent from the distilled manifold.
Authors: This is a fair criticism. Our present validation uses reference trajectories drawn from the same ensemble for direct comparability. To address coverage of rare but critical configurations, the revised manuscript will include additional tests against an independent, substantially larger reference dataset sampled from a broader ensemble. We will explicitly verify recovery of the LLPT order parameters (density discontinuity and electronic gap closure) on this independent set. revision: yes
-
Referee: [Methods / §2.2] CPD selection details: the precise definition of 'uncertainty-driven corner cases' (e.g., the uncertainty estimator, the number of peripheral samples retained, and the clustering algorithm) is not given with sufficient algorithmic or hyper-parameter transparency to allow reproduction or to assess whether the peripheral set systematically covers the fluctuation modes that control the LLPT.
Authors: We thank the referee for highlighting the need for reproducibility. The revised §2.2 will supply the exact uncertainty estimator (ensemble variance of an auxiliary committee of models), the precise number of peripheral samples retained (50 out of the final 200), the clustering algorithm and its parameters (k-means with k = 150 for central samples), and all hyper-parameters. We will also add a short analysis demonstrating that the peripheral subset captures the dominant fluctuation modes associated with the LLPT. revision: yes
Circularity Check
No circularity: CPD is an independent algorithmic selection with external validation
full rationale
The paper presents CPD as a procedural algorithm that selects representative cluster centers plus uncertainty-driven corner cases from an existing pool of configurations. No equations, fitted parameters, or self-citations are shown that reduce the claimed sufficiency of 200 structures to a tautological definition or to the same data used for selection. Validation consists of direct comparison of MLFF-MD observables (RDF, diffusion, etc.) against independent reference DFT-MD trajectories on the target system; this comparison is external to the distillation step itself. The central claim therefore rests on empirical reproduction rather than any self-referential construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Structural fluctuations are significantly elevated in the phase transition regime, making standard training inefficient.
Reference graph
Works this paper leans on
-
[1]
J. Behler and M. Parrinello, Generalized neural-network representation of high-dimensional potential-energy surfaces, Phys. Rev. Lett.98, 146401 (2007). S4
work page 2007
-
[2]
K. T. Sch¨ utt, H. E. Sauceda, P. J. Kindermans, A. Tkatchenko, and K.-R. M¨ uller, Schnet - a deep learning architecture for molecules and materials., The Journal of chemical physics148, 241722 (2017)
work page 2017
-
[3]
H. Wang, L. Zhang, J. Han, and W. E, Deepmd-kit: A deep learning package for many-body potential energy representation and molecular dynamics, Computer Physics Communications 228, 178 (2018)
work page 2018
-
[4]
S. Batzner, A. Musaelian, L. Sun, M. Geiger, J. P. Mailoa, M. Kornbluth, N. Molinari, T. E. Smidt, and B. Kozinsky, E (3)-equivariant graph neural networks for data-efficient and accu- rate interatomic potentials, Nature communications13, 2453 (2022)
work page 2022
-
[5]
V. L. Deringer, A. P. Bart´ ok, N. Bernstein, D. M. Wilkins, M. Ceriotti, and G. Cs´ anyi, Gaussian process regression for materials and molecules, Chemical Reviews121, 10073 (2021)
work page 2021
-
[6]
O. T. Unke, S. Chmiela, H. E. Sauceda, M. Gastegger, I. Poltavsky, K. T. Sch¨ utt, A. Tkatchenko, and K.-R. M¨ uller, Machine learning force fields, Chemical Reviews121, 10142 (2021)
work page 2021
-
[7]
I. Batatia, D. P. Kovacs, G. Simm, C. Ortner, and G. Cs´ anyi, Mace: Higher order equivari- ant message passing neural networks for fast and accurate force fields, Advances in neural information processing systems35, 11423 (2022)
work page 2022
-
[8]
R. Jinnouchi, F. Karsai, and G. Kresse, On-the-fly machine learning force field generation: Application to melting points, Phys. Rev. B100, 014105 (2019)
work page 2019
-
[9]
R. Jinnouchi, J. Lahnsteiner, F. Karsai, G. Kresse, and M. Bokdam, Phase transitions of hybrid perovskites simulated by machine-learning force fields trained on the fly with bayesian inference, Physical review letters122, 225701 (2019)
work page 2019
- [10]
-
[11]
P.-Y. Chen, K. Shibata, and T. Mizoguchi, High precision machine learning force field develop- ment for batio3 phase transitions, amorphous, and liquid structures, APL Machine Learning 3, 036115 (2025)
work page 2025
-
[12]
V. L. Deringer, C. J. Pickard, and G. Cs´ anyi, Data-driven learning of total and local energies in elemental boron., Physical review letters120, 156001 (2017)
work page 2017
-
[13]
J. S. Smith, B. T. Nebgen, N. Lubbers, O. Isayev, and A. E. Roitberg, Less is more: sampling chemical space with active learning, The Journal of chemical physics148, 241733 (2018). S5
work page 2018
-
[14]
G. Sivaraman, A. N. Krishnamoorthy, M. Baur, C. Holm, M. Stan, G. Cs´ anyi, C. J. Ben- more, and ´A. V´ azquez-Mayagoitia, Machine-learned interatomic potentials by active learning: amorphous and liquid hafnium dioxide, npj Computational Materials6, 104 (2020)
work page 2020
-
[15]
J. Vandermause, Y. Xie, J. S. Lim, C. J. Owen, and B. Kozinsky, Active learning of reactive bayesian force fields applied to heterogeneous catalysis dynamics of h/pt, Nature Communi- cations13, 5183 (2021)
work page 2021
-
[16]
J. Finkbeiner, S. Tovey, and D. Fink, Generating minimal training sets for machine learned potentials., Physical review letters132, 167301 (2023)
work page 2023
-
[17]
J. Qi, T. W. Ko, B. Wood, T. A. Pham, and S. P. Ong, Robust training of machine learning interatomic potentials with dimensionality reduction and stratified sampling, npj Computa- tional Materials10, 43 (2023)
work page 2023
-
[18]
K. Li, D. Persaud, K. Choudhary, B. DeCost, M. Greenwood, and J. Hattrick-Simpers, Ex- ploiting redundancy in large materials datasets for efficient machine learning with less data, Nature Communications14, 7283 (2023)
work page 2023
-
[19]
D. Schwalbe-Koda, S. Hamel, B. Sadigh, F. Zhou, and V. Lordi, Model-free estimation of com- pleteness, uncertainties, and outliers in atomistic machine learning using information theory, Nature Communications16, 4014 (2025)
work page 2025
-
[20]
G. H. Booth, A. Gr¨ uneis, G. Kresse, and A. Alavi, Towards an exact description of electronic wavefunctions in real solids, Nature493, 365 (2012)
work page 2012
-
[21]
J. Hermann, J. S. Spencer, K. Choo, A. Mezzacapo, W. M. C. Foulkes, D. Pfau, G. Carleo, and F. No’e, Ab initio quantum chemistry with neural-network wavefunctions, Nature Reviews Chemistry7, 692 (2022)
work page 2022
-
[22]
W. Ren, W. Fu, X. Wu, and J. Chen, Towards the ground state of molecules via diffusion monte carlo on neural networks, Nature Communications14, 1860 (2023)
work page 2023
-
[23]
B. X. Shi, A. S. Rosen, T. Sch¨ afer, A. Gr¨ uneis, V. Kapil, A. Zen, and A. Michaelides, An accurate and efficient framework for modelling the surface chemistry of ionic materials, Nature Chemistry17, 1688 (2024)
work page 2024
-
[24]
Z. Tang, H. Chen, Y. Li, Y. Qian, Y. Wang, W. Fu, J. Li, C. Si, W. Duan, J. Chen,et al., Deep-learning electronic structure calculations, Nature Computational Science5, 1133 (2025)
work page 2025
-
[25]
Y. Qian, X. Li, Z. Li, W. Ren, and J. Chen, Deep learning quantum monte carlo for solids, Wiley Interdisciplinary Reviews: Computational Molecular Science15, e70015 (2025). S6
work page 2025
- [26]
-
[27]
E. P. Wigner and H. B. Huntington, On the possibility of a metallic modification of hydrogen, Journal of Chemical Physics3, 764 (1935)
work page 1935
-
[28]
J. M. McMahon, M. A. Morales, C. Pierleoni, and D. M. Ceperley, The properties of hydrogen and helium under extreme conditions, Reviews of modern physics84, 1607 (2012)
work page 2012
-
[29]
C. Pierleoni, M. A. Morales, G. Rillo, M. Holzmann, and D. M. Ceperley, Liquid–liquid phase transition in hydrogen by coupled electron–ion monte carlo simulations, Proceedings of the National Academy of Sciences113, 4953 (2016)
work page 2016
-
[30]
M. A. Morales, C. Pierleoni, E. Schwegler, and D. M. Ceperley, Evidence for a first-order liquid-liquid transition in high-pressure hydrogen from ab initio simulations, Proceedings of the National Academy of Sciences107, 12799 (2010)
work page 2010
-
[31]
W. Fang, J. Chen, Y. Feng, X.-Z. Li, and A. Michaelides, The quantum nature of hydrogen, International Reviews in Physical Chemistry38, 35 (2019)
work page 2019
- [32]
-
[33]
P. Giannozzi, S. Baroni, N. Bonini, M. Calandra, R. Car, C. Cavazzoni, D. Ceresoli, G. L. Chiarotti, M. Cococcioni, I. Dabo,et al., Quantum espresso: a modular and open-source software project for quantum simulations of materials, Journal of physics: Condensed matter 21, 395502 (2009)
work page 2009
-
[34]
P. Giannozzi, O. Andreussi, T. Brumme, O. Bunau, M. Buongiorno Nardelli, M. Calandra, R. Car, C. Cavazzoni, D. Ceresoli, M. Cococcioni,et al., Advanced capabilities for materials modelling with quantum espresso, Journal of physics: Condensed matter29, 465901 (2017)
work page 2017
-
[35]
M. Dion, H. Rydberg, E. Schr¨ oder, D. C. Langreth, and B. I. Lundqvist, Van der waals density functional for general geometries, Physical review letters92, 246401 (2004)
work page 2004
-
[36]
Vanderbilt, Optimally smooth norm-conserving pseudopotentials, Physical Review B32, 8412 (1985)
D. Vanderbilt, Optimally smooth norm-conserving pseudopotentials, Physical Review B32, 8412 (1985)
work page 1985
- [37]
- [38]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.