pith. sign in

arxiv: 2604.03027 · v1 · submitted 2026-04-03 · ⚛️ physics.chem-ph

Dataset Distillation for Machine Learning Force Field in Phase Transition Regime

Pith reviewed 2026-05-13 18:08 UTC · model grok-4.3

classification ⚛️ physics.chem-ph
keywords dataset distillationmachine learning force fieldphase transitionliquid hydrogencentral-peripheral distillationstructural diversityab initio labeling
0
0 comments X

The pith

A Central-Peripheral Distillation method lets machine learning force fields reproduce liquid hydrogen properties near its phase transition using only 200 configurations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Central-Peripheral Distillation algorithm to create compact training sets for machine learning force fields when structural fluctuations are large, as in phase transition regimes. By merging representative configurations with critical corner cases, the method aims to preserve maximum structural diversity in a much smaller dataset. Validation on the liquid-liquid phase transition of dense hydrogen shows that an MLFF trained on the resulting 200 configurations matches the structural and dynamical properties obtained from much larger reference sets. This reduction matters because generating high-accuracy labels for training data is computationally expensive, especially when moving beyond standard density functional theory.

Core claim

The CPD algorithm produces a distilled dataset whose structural diversity is sufficient for an MLFF trained on only 200 configurations to fully reproduce the structural and dynamical properties of liquid hydrogen in the vicinity of its liquid-liquid phase transition regime.

What carries the argument

The Central-Peripheral Distillation algorithm, which integrates representative samples with critical corner cases to retain maximum structural diversity in the distilled set.

If this is right

  • MLFF training in phase transition regimes becomes feasible with far fewer expensive ab initio labels.
  • High-level electronic structure methods can label the smaller distilled sets to raise overall predictive accuracy.
  • Large-scale atomistic simulations of fluctuating systems near transitions can be performed at lower data-generation cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same selection logic could reduce data needs for other materials whose phase diagrams contain regions of high fluctuation.
  • Smaller distilled sets open the door to using more expensive reference methods such as quantum Monte Carlo for the labeling step.
  • Testing whether the 200-configuration set remains sufficient when the transition pressure or temperature is shifted would check robustness outside the exact validation window.

Load-bearing premise

That strategically selecting representative samples together with critical corner cases will automatically produce a dataset diverse enough to capture all relevant fluctuations in the phase transition regime.

What would settle it

An MLFF trained on the 200 CPD-selected configurations produces radial distribution functions or mean-squared displacements that deviate from those obtained with a much larger reference dataset when both are evaluated at the same state points near the transition.

Figures

Figures reproduced from arXiv: 2604.03027 by Ji Chen, Qingyuan Zhang, Ruiyang Chen.

Figure 1
Figure 1. Figure 1: FIG. 1. Schematic illustration of the CPD sampling workflow. The CPD algorithm extracts [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: FIG. 2. Structural characterization and phase transition of the hydrogen test dataset for LLPT at [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: FIG. 3. Comparison of energy and force prediction performance. The RMSE of energy (a) and [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: FIG. 4. Performance of MLFFs trained on different datasets for hydrogen LLPT at 1000 K. Pressure [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

Machine learning force field (MLFF) has emerged as a powerful data-driven tool for atomistic simulations, enabling large-scale and complex atomic systems to be simulated with accuracy comparable to \textit{ab initio} methods. However, MLFFs often suffer from low training efficiency in the phase transition regime, where structural fluctuations are significantly elevated. To address this challenge, we propose a Central-Peripheral Distillation (CPD) algorithm for training dataset distillation. By strategically integrating representative samples with critical corner cases, the CPD algorithm ensures that the distilled dataset retains maximum structural diversity. We validated the efficacy of the CPD method on the liquid-liquid phase transition of dense hydrogen. Results show that, with the CPD approach, only 200 configurations are sufficient to train a MLFF that can fully reproduce the structural and dynamical properties of liquid hydrogen in the vicinity of its phase transition regime. This work paves the way for high-fidelity labeling of the MLFF training datasets, for instance by adopting high-level \textit{ab initio} calculations beyond the standard density functional theory, thereby enhancing the predictive accuracy of MLFFs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes a Central-Peripheral Distillation (CPD) algorithm that combines representative cluster centers with uncertainty-selected corner cases to produce compact training sets for machine-learning force fields (MLFFs). Applied to the liquid-liquid phase transition (LLPT) of dense hydrogen, the central claim is that a distilled set of only 200 configurations suffices to train an MLFF whose molecular-dynamics trajectories reproduce the structural (RDF, coordination) and dynamical (diffusion, velocity autocorrelation) observables obtained from reference DFT-MD in the vicinity of the transition.

Significance. If the quantitative validation holds, the CPD procedure would offer a practical route to high-fidelity MLFF training data in regimes of large structural fluctuations, thereby lowering the barrier to labeling with higher-level ab initio methods. The algorithmic emphasis on both typical and high-uncertainty configurations directly targets a known bottleneck in MLFF construction for phase-transition systems.

major comments (3)
  1. [Abstract / Results] Abstract and Results: the assertion that the 200-configuration CPD set 'fully reproduce[s]' structural and dynamical properties supplies no numerical error metrics (RMSE on forces, RDF deviations, diffusion-coefficient differences), error bars, or baseline comparisons against random or active-learning selections; without these quantities the central performance claim cannot be evaluated.
  2. [Validation / §4] Validation procedure: the reported MLFF-MD trajectories are compared only against DFT-MD generated from the same underlying ensemble; the manuscript does not demonstrate that the LLPT order parameter (density discontinuity or electronic gap closure) is recovered when the reference data are drawn from an independent, much larger sampling. This leaves open the possibility that low-probability but transition-controlling configurations are absent from the distilled manifold.
  3. [Methods / §2.2] CPD selection details: the precise definition of 'uncertainty-driven corner cases' (e.g., the uncertainty estimator, the number of peripheral samples retained, and the clustering algorithm) is not given with sufficient algorithmic or hyper-parameter transparency to allow reproduction or to assess whether the peripheral set systematically covers the fluctuation modes that control the LLPT.
minor comments (2)
  1. [§2] Notation: the symbols used for the central and peripheral subsets are introduced without a clear table or equation reference, making it difficult to follow the selection procedure.
  2. [Figure 3] Figure clarity: the RDF and VACF plots lack shaded uncertainty bands or direct overlay of the reference DFT curves, reducing the ability to judge quantitative agreement.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and will revise the manuscript accordingly to improve quantitative validation, transparency, and reproducibility.

read point-by-point responses
  1. Referee: [Abstract / Results] Abstract and Results: the assertion that the 200-configuration CPD set 'fully reproduce[s]' structural and dynamical properties supplies no numerical error metrics (RMSE on forces, RDF deviations, diffusion-coefficient differences), error bars, or baseline comparisons against random or active-learning selections; without these quantities the central performance claim cannot be evaluated.

    Authors: We agree that quantitative metrics are required to substantiate the performance claims. In the revised manuscript we will add RMSE values for forces and energies, integrated absolute deviations for RDFs, diffusion-coefficient differences with error bars obtained from multiple independent MD runs, and explicit comparisons against both random selection and active-learning baselines. These additions will be placed in the Results section and referenced in the Abstract. revision: yes

  2. Referee: [Validation / §4] Validation procedure: the reported MLFF-MD trajectories are compared only against DFT-MD generated from the same underlying ensemble; the manuscript does not demonstrate that the LLPT order parameter (density discontinuity or electronic gap closure) is recovered when the reference data are drawn from an independent, much larger sampling. This leaves open the possibility that low-probability but transition-controlling configurations are absent from the distilled manifold.

    Authors: This is a fair criticism. Our present validation uses reference trajectories drawn from the same ensemble for direct comparability. To address coverage of rare but critical configurations, the revised manuscript will include additional tests against an independent, substantially larger reference dataset sampled from a broader ensemble. We will explicitly verify recovery of the LLPT order parameters (density discontinuity and electronic gap closure) on this independent set. revision: yes

  3. Referee: [Methods / §2.2] CPD selection details: the precise definition of 'uncertainty-driven corner cases' (e.g., the uncertainty estimator, the number of peripheral samples retained, and the clustering algorithm) is not given with sufficient algorithmic or hyper-parameter transparency to allow reproduction or to assess whether the peripheral set systematically covers the fluctuation modes that control the LLPT.

    Authors: We thank the referee for highlighting the need for reproducibility. The revised §2.2 will supply the exact uncertainty estimator (ensemble variance of an auxiliary committee of models), the precise number of peripheral samples retained (50 out of the final 200), the clustering algorithm and its parameters (k-means with k = 150 for central samples), and all hyper-parameters. We will also add a short analysis demonstrating that the peripheral subset captures the dominant fluctuation modes associated with the LLPT. revision: yes

Circularity Check

0 steps flagged

No circularity: CPD is an independent algorithmic selection with external validation

full rationale

The paper presents CPD as a procedural algorithm that selects representative cluster centers plus uncertainty-driven corner cases from an existing pool of configurations. No equations, fitted parameters, or self-citations are shown that reduce the claimed sufficiency of 200 structures to a tautological definition or to the same data used for selection. Validation consists of direct comparison of MLFF-MD observables (RDF, diffusion, etc.) against independent reference DFT-MD trajectories on the target system; this comparison is external to the distillation step itself. The central claim therefore rests on empirical reproduction rather than any self-referential construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that phase-transition fluctuations can be captured by a small curated subset; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Structural fluctuations are significantly elevated in the phase transition regime, making standard training inefficient.
    Directly stated as the motivating challenge in the abstract.

pith-pipeline@v0.9.0 · 5490 in / 1084 out tokens · 36123 ms · 2026-05-13T18:08:47.152705+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages

  1. [1]

    Behler and M

    J. Behler and M. Parrinello, Generalized neural-network representation of high-dimensional potential-energy surfaces, Phys. Rev. Lett.98, 146401 (2007). S4

  2. [2]

    K. T. Sch¨ utt, H. E. Sauceda, P. J. Kindermans, A. Tkatchenko, and K.-R. M¨ uller, Schnet - a deep learning architecture for molecules and materials., The Journal of chemical physics148, 241722 (2017)

  3. [3]

    H. Wang, L. Zhang, J. Han, and W. E, Deepmd-kit: A deep learning package for many-body potential energy representation and molecular dynamics, Computer Physics Communications 228, 178 (2018)

  4. [4]

    Batzner, A

    S. Batzner, A. Musaelian, L. Sun, M. Geiger, J. P. Mailoa, M. Kornbluth, N. Molinari, T. E. Smidt, and B. Kozinsky, E (3)-equivariant graph neural networks for data-efficient and accu- rate interatomic potentials, Nature communications13, 2453 (2022)

  5. [5]

    V. L. Deringer, A. P. Bart´ ok, N. Bernstein, D. M. Wilkins, M. Ceriotti, and G. Cs´ anyi, Gaussian process regression for materials and molecules, Chemical Reviews121, 10073 (2021)

  6. [6]

    O. T. Unke, S. Chmiela, H. E. Sauceda, M. Gastegger, I. Poltavsky, K. T. Sch¨ utt, A. Tkatchenko, and K.-R. M¨ uller, Machine learning force fields, Chemical Reviews121, 10142 (2021)

  7. [7]

    Batatia, D

    I. Batatia, D. P. Kovacs, G. Simm, C. Ortner, and G. Cs´ anyi, Mace: Higher order equivari- ant message passing neural networks for fast and accurate force fields, Advances in neural information processing systems35, 11423 (2022)

  8. [8]

    Jinnouchi, F

    R. Jinnouchi, F. Karsai, and G. Kresse, On-the-fly machine learning force field generation: Application to melting points, Phys. Rev. B100, 014105 (2019)

  9. [9]

    Jinnouchi, J

    R. Jinnouchi, J. Lahnsteiner, F. Karsai, G. Kresse, and M. Bokdam, Phase transitions of hybrid perovskites simulated by machine-learning force fields trained on the fly with bayesian inference, Physical review letters122, 225701 (2019)

  10. [10]

    Cheng, G

    B. Cheng, G. Mazzola, C. J. Pickard, and M. Ceriotti, Evidence for supercritical behaviour of high-pressure liquid hydrogen, Nature585, 217 (2020)

  11. [11]

    P.-Y. Chen, K. Shibata, and T. Mizoguchi, High precision machine learning force field develop- ment for batio3 phase transitions, amorphous, and liquid structures, APL Machine Learning 3, 036115 (2025)

  12. [12]

    V. L. Deringer, C. J. Pickard, and G. Cs´ anyi, Data-driven learning of total and local energies in elemental boron., Physical review letters120, 156001 (2017)

  13. [13]

    J. S. Smith, B. T. Nebgen, N. Lubbers, O. Isayev, and A. E. Roitberg, Less is more: sampling chemical space with active learning, The Journal of chemical physics148, 241733 (2018). S5

  14. [14]

    Sivaraman, A

    G. Sivaraman, A. N. Krishnamoorthy, M. Baur, C. Holm, M. Stan, G. Cs´ anyi, C. J. Ben- more, and ´A. V´ azquez-Mayagoitia, Machine-learned interatomic potentials by active learning: amorphous and liquid hafnium dioxide, npj Computational Materials6, 104 (2020)

  15. [15]

    Vandermause, Y

    J. Vandermause, Y. Xie, J. S. Lim, C. J. Owen, and B. Kozinsky, Active learning of reactive bayesian force fields applied to heterogeneous catalysis dynamics of h/pt, Nature Communi- cations13, 5183 (2021)

  16. [16]

    Finkbeiner, S

    J. Finkbeiner, S. Tovey, and D. Fink, Generating minimal training sets for machine learned potentials., Physical review letters132, 167301 (2023)

  17. [17]

    J. Qi, T. W. Ko, B. Wood, T. A. Pham, and S. P. Ong, Robust training of machine learning interatomic potentials with dimensionality reduction and stratified sampling, npj Computa- tional Materials10, 43 (2023)

  18. [18]

    K. Li, D. Persaud, K. Choudhary, B. DeCost, M. Greenwood, and J. Hattrick-Simpers, Ex- ploiting redundancy in large materials datasets for efficient machine learning with less data, Nature Communications14, 7283 (2023)

  19. [19]

    Schwalbe-Koda, S

    D. Schwalbe-Koda, S. Hamel, B. Sadigh, F. Zhou, and V. Lordi, Model-free estimation of com- pleteness, uncertainties, and outliers in atomistic machine learning using information theory, Nature Communications16, 4014 (2025)

  20. [20]

    G. H. Booth, A. Gr¨ uneis, G. Kresse, and A. Alavi, Towards an exact description of electronic wavefunctions in real solids, Nature493, 365 (2012)

  21. [21]

    Hermann, J

    J. Hermann, J. S. Spencer, K. Choo, A. Mezzacapo, W. M. C. Foulkes, D. Pfau, G. Carleo, and F. No’e, Ab initio quantum chemistry with neural-network wavefunctions, Nature Reviews Chemistry7, 692 (2022)

  22. [22]

    W. Ren, W. Fu, X. Wu, and J. Chen, Towards the ground state of molecules via diffusion monte carlo on neural networks, Nature Communications14, 1860 (2023)

  23. [23]

    B. X. Shi, A. S. Rosen, T. Sch¨ afer, A. Gr¨ uneis, V. Kapil, A. Zen, and A. Michaelides, An accurate and efficient framework for modelling the surface chemistry of ionic materials, Nature Chemistry17, 1688 (2024)

  24. [24]

    Z. Tang, H. Chen, Y. Li, Y. Qian, Y. Wang, W. Fu, J. Li, C. Si, W. Duan, J. Chen,et al., Deep-learning electronic structure calculations, Nature Computational Science5, 1133 (2025)

  25. [25]

    Y. Qian, X. Li, Z. Li, W. Ren, and J. Chen, Deep learning quantum monte carlo for solids, Wiley Interdisciplinary Reviews: Computational Molecular Science15, e70015 (2025). S6

  26. [26]

    Huang, Z

    Z. Huang, Z. Guo, C. Cao, H. Q. Pham, X. Wen, G. H. Booth, J. Chen, and D. Lv, A multi-resolution systematically improvable quantum embedding scheme for large-scale surface chemistry calculations, Nature Communications16, 9297 (2025)

  27. [27]

    E. P. Wigner and H. B. Huntington, On the possibility of a metallic modification of hydrogen, Journal of Chemical Physics3, 764 (1935)

  28. [28]

    J. M. McMahon, M. A. Morales, C. Pierleoni, and D. M. Ceperley, The properties of hydrogen and helium under extreme conditions, Reviews of modern physics84, 1607 (2012)

  29. [29]

    Pierleoni, M

    C. Pierleoni, M. A. Morales, G. Rillo, M. Holzmann, and D. M. Ceperley, Liquid–liquid phase transition in hydrogen by coupled electron–ion monte carlo simulations, Proceedings of the National Academy of Sciences113, 4953 (2016)

  30. [30]

    M. A. Morales, C. Pierleoni, E. Schwegler, and D. M. Ceperley, Evidence for a first-order liquid-liquid transition in high-pressure hydrogen from ab initio simulations, Proceedings of the National Academy of Sciences107, 12799 (2010)

  31. [31]

    W. Fang, J. Chen, Y. Feng, X.-Z. Li, and A. Michaelides, The quantum nature of hydrogen, International Reviews in Physical Chemistry38, 35 (2019)

  32. [32]

    Istas, S

    M. Istas, S. Jensen, Y. Yang, M. Holzmann, C. Pierleoni, and D. M. Ceperley, Liquid-liquid phase transition of hydrogen and its critical point: Analysis from ab initio simulation and a machine-learned potential, Physical Review E111, 045307 (2025)

  33. [33]

    Giannozzi, S

    P. Giannozzi, S. Baroni, N. Bonini, M. Calandra, R. Car, C. Cavazzoni, D. Ceresoli, G. L. Chiarotti, M. Cococcioni, I. Dabo,et al., Quantum espresso: a modular and open-source software project for quantum simulations of materials, Journal of physics: Condensed matter 21, 395502 (2009)

  34. [34]

    Giannozzi, O

    P. Giannozzi, O. Andreussi, T. Brumme, O. Bunau, M. Buongiorno Nardelli, M. Calandra, R. Car, C. Cavazzoni, D. Ceresoli, M. Cococcioni,et al., Advanced capabilities for materials modelling with quantum espresso, Journal of physics: Condensed matter29, 465901 (2017)

  35. [35]

    M. Dion, H. Rydberg, E. Schr¨ oder, D. C. Langreth, and B. I. Lundqvist, Van der waals density functional for general geometries, Physical review letters92, 246401 (2004)

  36. [36]

    Vanderbilt, Optimally smooth norm-conserving pseudopotentials, Physical Review B32, 8412 (1985)

    D. Vanderbilt, Optimally smooth norm-conserving pseudopotentials, Physical Review B32, 8412 (1985)

  37. [37]

    Bussi, D

    G. Bussi, D. Donadio, and M. Parrinello, Canonical sampling through velocity rescaling., The Journal of chemical physics126, 014101 (2007). S7

  38. [38]

    Cheng, G

    B. Cheng, G. Mazzola, C. J. Pickard, and M. Ceriotti, Evidence for supercritical behaviour of high-pressure liquid hydrogen, Nature585, 217 (2020). S8