k-Means Clustering in Fingerprint-Based Configuration Selection for Fitting Interatomic Potentials

Jan Drahokoupil; Ludv\'ik L\"obel; Miroslav Lebeda; Petr Vl\v{c}\'ak

arxiv: 2606.09575 · v1 · pith:T7G2CTZYnew · submitted 2026-06-08 · ❄️ cond-mat.mtrl-sci

k-Means Clustering in Fingerprint-Based Configuration Selection for Fitting Interatomic Potentials

Miroslav Lebeda , Jan Drahokoupil , Ludv\'ik L\"obel , Petr Vl\v{c}\'ak This is my paper

Pith reviewed 2026-06-27 15:26 UTC · model grok-4.3

classification ❄️ cond-mat.mtrl-sci

keywords k-means clusteringconfiguration selectioninteratomic potentialsEAMfingerprintCrystalNNRDFtitanium

0 comments

The pith

k-Means clustering on CrystalNN and RDF fingerprints selects smaller, more effective configuration sets for fitting EAM potentials than random selection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper demonstrates a method to choose representative atomistic configurations from a large DFT dataset by clustering their fingerprints. The approach uses k-means on features derived from CrystalNN neighbor models and radial distribution functions. When fitting an embedded-atom method potential for titanium, the clustered selections produce more accurate energy and force predictions with less variation than random picks, even when using far fewer points. Only around 30 configurations prove sufficient to model the behavior of the entire 1800-configuration set. Visualization with t-SNE further shows that clustering identifies redundant environments involving vacancies that random sampling overlooks.

Core claim

The paper claims that k-means clustering applied to atomistic configuration fingerprints based on the CrystalNN model and radial distribution function improves the accuracy of fitting classical molecular dynamics interatomic potentials to density functional theory data for both energies and forces while requiring fewer configurations than random selection. For an EAM potential of titanium, this method achieves better precision and lower standard deviations, with only about 30 configurations sufficient to describe the full set of 1800 configurations well. The t-SNE reduction reveals overlaps between subsets with and without Ti vacancy, which k-means captures but random does not, and excluding

What carries the argument

k-means clustering on configuration fingerprints constructed from CrystalNN and RDF representations, which groups similar atomic environments to select diverse yet representative training configurations for potential fitting.

If this is right

Only about 30 configurations suffice to obtain an EAM model that describes well the full set of 1800 configurations in terms of energies and forces.
k-means clustering consistently achieves better precision and lower standard deviations for a smaller number of configurations than random selection.
Overlapping configurations in t-SNE space indicate potential information redundancy that clustering can handle without loss of fit quality.
When overlapping configurations with vacancies are excluded from the k-means selection and used only as a test set, their energy and force predictions show similar precision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar fingerprint clustering could reduce the computational expense of generating large training sets for other interatomic potential forms or materials systems.
The method highlights how dimensionality reduction techniques like t-SNE can help validate the completeness of selected configuration spaces.
Applying this to larger or more complex datasets might reveal systematic redundancies in high-throughput simulation workflows.

Load-bearing premise

The fingerprints based on CrystalNN and RDF are assumed to capture the atomic-environment similarities that matter for energy and force accuracy in the subsequent EAM fit.

What would settle it

If a comparison on the same titanium DFT dataset shows that random selection of configurations achieves equal or better energy and force accuracy in EAM fits than the k-means selected sets of the same size, the advantage would be falsified.

Figures

Figures reproduced from arXiv: 2606.09575 by Jan Drahokoupil, Ludv\'ik L\"obel, Miroslav Lebeda, Petr Vl\v{c}\'ak.

**Figure 1.** Figure 1: Illustration of three generated configurations from the initial Ti supercell with 36 atoms, showing maximum displacements of 0.1 Å, 0.2 Å, and 0.5 Å, respectively. 2.3 DFT and MEAMfit2 Settings For each configuration, we performed single-point DFT calculations to obtain the total energy and the forces acting on the atoms. These calculations were carried out using the Vienna Ab initio Simulation Package (VA… view at source ↗

**Figure 2.** Figure 2: Principal Component Analysis (PCA) of 352-element fingerprints for a dataset of 1800 titanium atomistic configurations. The cumulative explained variance reached 98 % with the first 85 components, as indicated by the green lines. These 85 components were then used as a new, reduced-dimensional fingerprint for each configuration. The significance of each of the 352 initial features of the fingerprints is de… view at source ↗

**Figure 3.** Figure 3: Contributions of each of the 352 fingerprint features to the 98 % of cumulative explained variance (corresponding to the first 85 PCA components) within the 1800 titanium set of configurations. The contributions are calculated by summing the squared loadings of each feature across all 85 PCA components. The green part corresponding to the atomic mass difference is zero as there is only one element in the d… view at source ↗

read the original abstract

In this study, we present a method for selecting an arbitrary number of distinct configurations from a larger data set by applying k-means clustering to atomistic configuration fingerprints based on the CrystalNN model and radial distribution function (RDF). This approach improves the accuracy of fitting classical molecular dynamics interatomic potentials to density functional theory (DFT) data for both energies and forces while requiring fewer configurations than random selection. We demonstrate this improvement by fitting an embedded-atom method (EAM) potential for titanium, using various configurational sizes from an initial set of 1800 configurations. The k-means clustering consistently achieves better precision and lower standard deviations for a smaller number of configurations than random selection. The results also suggest that only about 30 configurations are sufficient to obtain an EAM model that describes well the full set of 1800 configurations in terms of energies and forces. Additionally, t-distributed stochastic neighbor embedding (t-SNE) method was used to reduce the configuration fingerprints into 2D space, and it revealed an overlap between two configuration subsets with and without Ti vacancy, indicating similar atomic environments. This similarity is captured by k-means clustering but not by random selection. Furthermore, when the overlapping configurations with vacancies were excluded from the k-means algorithm and used only as a test set, their energy and force predictions showed similar precision to those when they were included. This indicates that the overlapping configurations in the 2D t-SNE space indeed imply potential information redundancy among the atomistic configurations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

k-means on CrystalNN+RDF fingerprints beats random selection for EAM config choice on Ti, with a redundancy check that partially addresses the correlation concern.

read the letter

The main takeaway is that k-means clustering on CrystalNN and RDF fingerprints picks training configurations for EAM potential fitting more effectively than random sampling. On the titanium dataset it reaches usable accuracy on energies and forces with roughly 30 configurations out of 1800.

The paper applies an existing clustering technique to this particular fingerprint pair for the concrete task of cutting DFT data volume in potential development. The results show lower standard deviations at small training sizes, and the t-SNE plot plus the vacancy hold-out test give some evidence that the clusters capture real redundancy rather than noise.

The hold-out test is a useful addition. It shows that configurations the method groups together still produce similar fit quality when left out, which supports the claim that the fingerprints are doing something relevant.

The soft spot is that the abstract supplies no concrete error values, fitting details, or statistical tests, so the size of the improvement stays unclear. The assumption that fingerprint distances track EAM energy and force differences is tested only indirectly through the final accuracy numbers. A direct correlation check between the two is missing, which leaves the method's reliability on other materials or potentials open.

This is for people who fit classical interatomic potentials and want a data-efficient selection step. A reader already working in that workflow could implement and test the pipeline on their own data.

It deserves peer review. The empirical demonstration and the redundancy test make the contribution concrete enough to evaluate.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes selecting training configurations for EAM potential fitting to Ti DFT data via k-means clustering on CrystalNN+RDF fingerprints. It claims this yields EAM models with better energy/force accuracy and lower variance than random selection, with ~30 configurations sufficing for the full 1800-configuration set; t-SNE visualization and a vacancy-exclusion test are used to argue that clustering captures redundancy.

Significance. If the fingerprint-to-property correlation holds, the method could reduce the DFT data volume required for interatomic potential development. The empirical comparison to random selection and the redundancy test via vacancy hold-out provide concrete support on this Ti dataset, though broader utility depends on the representation's fidelity.

major comments (3)

[Results (comparison of k-means vs. random selection)] The central claim that k-means on CrystalNN+RDF fingerprints outperforms random selection for EAM accuracy rests on the untested assumption that Euclidean (or other) fingerprint distance tracks differences in per-atom energies and forces. No correlation analysis (e.g., rank correlation between fingerprint distances and |E_i - E_j| or force residuals on the held-out set) is reported; without it the reported improvement remains dataset-specific and does not explain why clustering should systematically beat random sampling.
[Results (EAM fit accuracy vs. number of configurations)] The statement that 'only about 30 configurations are sufficient' and that k-means yields 'lower standard deviations' requires error-vs-N curves with error bars from repeated k-means initializations and random draws, plus a statistical test (e.g., paired t-test or Wilcoxon) on the RMSE distributions. The abstract supplies no numerical RMSE values or p-values, so the precision and variance claims cannot be evaluated for robustness.
[t-SNE analysis and vacancy test] t-SNE is used to demonstrate overlap between vacancy and non-vacancy configurations, yet t-SNE is a nonlinear embedding that does not preserve original distances; therefore overlap in the 2-D projection does not confirm that the high-dimensional fingerprints used by k-means actually group configurations with similar EAM energies/forces. A direct check in the original fingerprint space (e.g., intra- vs. inter-cluster distance statistics) is needed.

minor comments (2)

[Methods] The EAM fitting protocol (loss-function weights between energy and force terms, optimizer, convergence criteria, and how the 'full set of 1800 configurations' is used for validation) should be stated explicitly so that the reported precision can be reproduced.
[Figures] Figure captions and axis labels for the error-vs-N plots should include the exact definition of the error metric (RMSE per atom? total energy?) and the number of independent trials used for the standard-deviation shading.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Results (comparison of k-means vs. random selection)] The central claim that k-means on CrystalNN+RDF fingerprints outperforms random selection for EAM accuracy rests on the untested assumption that Euclidean (or other) fingerprint distance tracks differences in per-atom energies and forces. No correlation analysis (e.g., rank correlation between fingerprint distances and |E_i - E_j| or force residuals on the held-out set) is reported; without it the reported improvement remains dataset-specific and does not explain why clustering should systematically beat random sampling.

Authors: We agree that an explicit correlation analysis would provide stronger justification for why the clustering approach outperforms random selection. Our current work presents an empirical demonstration on the Ti dataset, where the method yields measurable improvements in EAM accuracy. To address the concern, we will add a rank-correlation analysis (e.g., Spearman) between fingerprint distances and energy/force differences on held-out configurations in the revised manuscript. revision: yes
Referee: [Results (EAM fit accuracy vs. number of configurations)] The statement that 'only about 30 configurations are sufficient' and that k-means yields 'lower standard deviations' requires error-vs-N curves with error bars from repeated k-means initializations and random draws, plus a statistical test (e.g., paired t-test or Wilcoxon) on the RMSE distributions. The abstract supplies no numerical RMSE values or p-values, so the precision and variance claims cannot be evaluated for robustness.

Authors: We accept that the robustness of the claims would benefit from the requested statistical elements. In the revision we will include error-versus-N curves with error bars obtained from multiple k-means initializations and random draws, report explicit RMSE values, apply a statistical test such as the Wilcoxon signed-rank test to the RMSE distributions, and update the abstract with numerical results and significance values. revision: yes
Referee: [t-SNE analysis and vacancy test] t-SNE is used to demonstrate overlap between vacancy and non-vacancy configurations, yet t-SNE is a nonlinear embedding that does not preserve original distances; therefore overlap in the 2-D projection does not confirm that the high-dimensional fingerprints used by k-means actually group configurations with similar EAM energies/forces. A direct check in the original fingerprint space (e.g., intra- vs. inter-cluster distance statistics) is needed.

Authors: We recognize that t-SNE is a visualization tool that does not preserve distances and therefore cannot alone confirm clustering behavior in the original space. We will add a direct analysis in the high-dimensional fingerprint space, including intra- versus inter-cluster distance statistics, to corroborate that the k-means groupings correspond to similar atomic environments relevant to the EAM fit. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison to random baseline is independent of clustering definition

full rationale

The paper reports an empirical result: k-means on CrystalNN+RDF fingerprints produces configuration subsets that yield lower EAM energy/force errors (and lower variance) than random subsets of the same size when fitted to the 1800-configuration Ti DFT set. This is measured directly on held-out or full-set residuals after fitting; the improvement is not defined into existence by the clustering algorithm or any fitted parameter. The t-SNE overlap test is a post-hoc visualization confirming redundancy capture, not a derivation step. No equations, self-citations, or ansatzes reduce the central claim to a tautology with the input fingerprints or selection procedure. The method is externally falsifiable against the random baseline on the same data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract; the method rests on the domain assumption that the chosen fingerprints encode the relevant similarity for potential fitting. No free parameters or invented entities are described.

axioms (1)

domain assumption k-means clustering applied to the chosen fingerprints will produce representative subsets for EAM fitting
Central to the claim that clustering outperforms random selection

pith-pipeline@v0.9.1-grok · 5819 in / 1246 out tokens · 21347 ms · 2026-06-27T15:26:34.840225+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references

[1]

This selection should aim to minimize the training dataset size while maximizing the representation of the underlying potential energy surface (PES)

Introduction In the development of classical or machine-learning interatomic potentials based on ab initio density functional theory (DFT) data, an important task is the selection of suitable atomistic configurations for fitting or learning 1–3. This selection should aim to minimize the training dataset size while maximizing the representation of the unde...
[2]

Methodology 2.1 Configuration Fingerprint and K-means Clustering The fingerprint characterizing each atomistic configuration was defined as consisting of three parts:
[3]

These descriptors quantify the likelihood of different coordination numbers and the resemblance to specific geometric configurations

CrystalNN (244 elements, for geometrical atomic environments) The CrystalNN method generates a 61-dimensional vector for each atom within a configuration, capturing various aspects of its local atomic coordination environment using coordination descriptors. These descriptors quantify the likelihood of different coordination numbers and the resemblance to ...
[4]

The values are averaged over each atomic site, resulting in four metrics: mean, standard, minimum, and maximum value

Atomic mass difference (4 elements , for differences in local element composition) This part represents the aggregate differences in atomic masses between an atom and its neighbors (up to 10 Å) in the configuration. The values are averaged over each atomic site, resulting in four metrics: mean, standard, minimum, and maximum value
[5]

Average bond distance (4 elements) , radial distribution function (100 elements, for different distances between atoms / lattice parameters) The first 4 elements represent statistical measures (mean, standard deviation, minimum and maximum) of the average bond distances for each atomic site and its neighboring pairs (up to 10 Å) within a configuration. Th...
[6]

The first 85 components accounted for a cumulative explained variance of 98% (Fig

Results 3.1 Clustering of Ti Configurations The 352-element fingerprints of the generated Ti set, consisting of 1800 configurations, were analyzed using PCA. The first 85 components accounted for a cumulative explained variance of 98% (Fig. 2). This number of components was used for each configuration as its reduced fingerprint, and they were clustered us...
[7]

One group includes the supercells without vacancies and with a maximum atomic displacement of up to 0.1 Å (marked as 'Standard 0.1')
[8]

The second group consists of supercells with no vacancy and displacements of up to 0.2 Å ('Standard 0.2')
[9]

The third group features supercells with one Ti vacancy and a maximum displacement of 0.1 Å ('Vaca 0.1')
[10]

The fourth group contains supercells with one Ti vacancy and a maximum displacement of 0.2 Å ('Vaca 0.2')
[11]

The fifth group is formed by supercells without vacancies and with a maximum displacement of 0.5 Å ('Standard 0.5'), as well as supercells with one Ti vacancy and a maximum displacement of 0.5 Å ('Vaca 0.5'). The fifth group exhibits considerable overlap between non -vacancy and vacancy Ti supercells, indicating redundancy (i.e., similarity in atomic envi...
[12]

The advantages of this selection method were shown in the fitting of the EAM potential for Ti with an initial configuration size of 1800

Conclusions This study demonstrates the effectiveness of using k -means clustering based on atomistic configuration fingerprints combining CrystalNN and RDF to select configurations from a larger generated set for fitting MD interatomic potentials on DFT data. The advantages of this selection method were shown in the fitting of the EAM potential for Ti wi...
[13]

SGS24/121/OHK2/3T/12]

Acknowledgments This work was supported by the Grant Agency of the Czech Technical University in Prague [grant No. SGS24/121/OHK2/3T/12]. Computational resources were provided by the e - INFRA CZ project (ID:90254), supported by the Ministry of Education, Youth and Sports of the Czech Republic. I would also like to thank Pavel Baláž for his valuable advic...
[14]

Machine Learning Interatomic Potentials: Keys to First-Principles Multiscale Modeling

References (1) Mortazavi, B. Machine Learning Interatomic Potentials: Keys to First-Principles Multiscale Modeling. In Machine Learning in Modeling and Simulation: Methods and Applications; Springer, 2023; pp 427–451. (2) Eyert, V .; Wormald, J.; Curtin, W. A.; Wimmer, E. Machine-Learned Interatomic Potentials: Recent Developments and Prospective Applicat...

arXiv 2023

[1] [1]

This selection should aim to minimize the training dataset size while maximizing the representation of the underlying potential energy surface (PES)

Introduction In the development of classical or machine-learning interatomic potentials based on ab initio density functional theory (DFT) data, an important task is the selection of suitable atomistic configurations for fitting or learning 1–3. This selection should aim to minimize the training dataset size while maximizing the representation of the unde...

[2] [2]

Methodology 2.1 Configuration Fingerprint and K-means Clustering The fingerprint characterizing each atomistic configuration was defined as consisting of three parts:

[3] [3]

These descriptors quantify the likelihood of different coordination numbers and the resemblance to specific geometric configurations

CrystalNN (244 elements, for geometrical atomic environments) The CrystalNN method generates a 61-dimensional vector for each atom within a configuration, capturing various aspects of its local atomic coordination environment using coordination descriptors. These descriptors quantify the likelihood of different coordination numbers and the resemblance to ...

[4] [4]

The values are averaged over each atomic site, resulting in four metrics: mean, standard, minimum, and maximum value

Atomic mass difference (4 elements , for differences in local element composition) This part represents the aggregate differences in atomic masses between an atom and its neighbors (up to 10 Å) in the configuration. The values are averaged over each atomic site, resulting in four metrics: mean, standard, minimum, and maximum value

[5] [5]

Average bond distance (4 elements) , radial distribution function (100 elements, for different distances between atoms / lattice parameters) The first 4 elements represent statistical measures (mean, standard deviation, minimum and maximum) of the average bond distances for each atomic site and its neighboring pairs (up to 10 Å) within a configuration. Th...

[6] [6]

The first 85 components accounted for a cumulative explained variance of 98% (Fig

Results 3.1 Clustering of Ti Configurations The 352-element fingerprints of the generated Ti set, consisting of 1800 configurations, were analyzed using PCA. The first 85 components accounted for a cumulative explained variance of 98% (Fig. 2). This number of components was used for each configuration as its reduced fingerprint, and they were clustered us...

[7] [7]

One group includes the supercells without vacancies and with a maximum atomic displacement of up to 0.1 Å (marked as 'Standard 0.1')

[8] [8]

The second group consists of supercells with no vacancy and displacements of up to 0.2 Å ('Standard 0.2')

[9] [9]

The third group features supercells with one Ti vacancy and a maximum displacement of 0.1 Å ('Vaca 0.1')

[10] [10]

The fourth group contains supercells with one Ti vacancy and a maximum displacement of 0.2 Å ('Vaca 0.2')

[11] [11]

The fifth group is formed by supercells without vacancies and with a maximum displacement of 0.5 Å ('Standard 0.5'), as well as supercells with one Ti vacancy and a maximum displacement of 0.5 Å ('Vaca 0.5'). The fifth group exhibits considerable overlap between non -vacancy and vacancy Ti supercells, indicating redundancy (i.e., similarity in atomic envi...

[12] [12]

The advantages of this selection method were shown in the fitting of the EAM potential for Ti with an initial configuration size of 1800

Conclusions This study demonstrates the effectiveness of using k -means clustering based on atomistic configuration fingerprints combining CrystalNN and RDF to select configurations from a larger generated set for fitting MD interatomic potentials on DFT data. The advantages of this selection method were shown in the fitting of the EAM potential for Ti wi...

[13] [13]

SGS24/121/OHK2/3T/12]

Acknowledgments This work was supported by the Grant Agency of the Czech Technical University in Prague [grant No. SGS24/121/OHK2/3T/12]. Computational resources were provided by the e - INFRA CZ project (ID:90254), supported by the Ministry of Education, Youth and Sports of the Czech Republic. I would also like to thank Pavel Baláž for his valuable advic...

[14] [14]

Machine Learning Interatomic Potentials: Keys to First-Principles Multiscale Modeling

References (1) Mortazavi, B. Machine Learning Interatomic Potentials: Keys to First-Principles Multiscale Modeling. In Machine Learning in Modeling and Simulation: Methods and Applications; Springer, 2023; pp 427–451. (2) Eyert, V .; Wormald, J.; Curtin, W. A.; Wimmer, E. Machine-Learned Interatomic Potentials: Recent Developments and Prospective Applicat...

arXiv 2023