pith. sign in

arxiv: 2504.16418 · v2 · submitted 2025-04-23 · ⚛️ physics.comp-ph · math.OC

Scalable Data-Driven Basis Selection for Linear Machine Learning Interatomic Potentials

Pith reviewed 2026-05-22 19:13 UTC · model grok-4.3

classification ⚛️ physics.comp-ph math.OC
keywords machine learning interatomic potentialsatomic cluster expansionfeature selectionsparse modelsactive set algorithmsbasis selectionlinear potentials
0
0 comments X

The pith

Active set algorithms produce sparse ACE models that run faster and generalize better than dense models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies active set algorithms to automate data-driven selection of basis functions inside the Atomic Cluster Expansion framework for linear machine learning interatomic potentials. This selection produces sparse models that reduce computational cost, raise accuracy on unseen atomic configurations, and increase interpretability relative to models that retain every possible basis function. The same procedure also returns entire families of models that trade cost against accuracy without separate hyperparameter searches. A sympathetic reader would care because manual feature selection has been a persistent source of high complexity and poor transfer in practical atomistic modeling. If the gains hold, the method would let practitioners obtain usable potentials more quickly and with less trial-and-error.

Core claim

The authors show that active set algorithms for automated, data-driven feature selection within the Atomic Cluster Expansion yield sparse linear models. These sparse models deliver consistent gains in computational efficiency, generalization accuracy, and interpretability over dense ACE models on multiple benchmark datasets. The algorithms further generate full paths of models that span a range of cost-to-accuracy ratios.

What carries the argument

Active set algorithms that iteratively select or discard basis functions in the Atomic Cluster Expansion according to their contribution to the training loss, thereby constructing sparse linear models from data.

If this is right

  • Atomistic simulations become cheaper to run at fixed accuracy because fewer basis functions are evaluated per atom.
  • Generalization to new configurations improves, lowering the risk of unphysical behavior outside the training set.
  • Model interpretability rises because only the retained basis functions need to be examined.
  • Development time drops since entire cost-accuracy curves are obtained from one run instead of repeated hyperparameter searches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same selection procedure could be applied to other linear regression models for interatomic potentials to obtain similar sparsity benefits.
  • Production codes could expose the cost-accuracy path as a user option so that end users pick the operating point best suited to their scale and accuracy needs.
  • Transfer tests on datasets drawn from different chemical elements or extreme conditions would directly check whether the observed gains survive outside the original benchmarks.

Load-bearing premise

The benchmark datasets used in the tests are representative of the atomic configurations that appear in actual production simulations.

What would settle it

A new simulation on atomic configurations absent from the original benchmarks in which the sparse model’s prediction error exceeds that of the corresponding dense model or in which the measured wall-clock speedup disappears.

Figures

Figures reproduced from arXiv: 2504.16418 by Christoph Ortner, Matthias Militzer, Michael P. Friedlander, Tina Torabi.

Figure 1
Figure 1. Figure 1: Energy MAE vs basis size for selected limited diversity datasets (cf. III-A), comparing three sparse least squares solvers [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of the basis functions selected for two-body interactions in the Mo dataset (cf. III-A). The figure illustrates [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of the basis functions selected for three-body interactions in the Mo dataset (cf. III-A). The figure illustrates [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Energy MAE vs. basis size for the Silicon dataset [4], [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Percentage error relative to the computed values in Table III for various silicon properties using GAP, BLR, and ASP (cf. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Energy MAE vs. basis size for the Water dataset [18], [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Illustration of ASP and other (non-sparse) ACE solvers’ gradual selection of three-body basis functions for the water [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
read the original abstract

Machine learning interatomic potentials (MLIPs) provide an effective approach for accurately and efficiently modeling atomic interactions, expanding the capabilities of atomistic simulations to complex systems. However, a priori feature selection leads to high complexity, which can be detrimental to both computational cost and generalization, resulting in a need for hyperparameter tuning. We demonstrate the benefits of active set algorithms for automated data-driven feature selection. The proposed methods are implemented within the Atomic Cluster Expansion (ACE) framework. Computational tests conducted on a variety of benchmark datasets indicate that sparse ACE models consistently enhance computational efficiency, generalization accuracy and interpretability over dense ACE models. An added benefit of the proposed algorithms is that they produce entire paths of models with varying cost/accuracy ratio.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes active-set algorithms for automated, data-driven feature selection within the Atomic Cluster Expansion (ACE) framework for linear machine learning interatomic potentials. It claims that the resulting sparse ACE models deliver consistent gains in computational efficiency, generalization accuracy, and interpretability relative to dense ACE models, as shown on a variety of benchmark datasets, while also generating entire paths of models with tunable cost/accuracy ratios.

Significance. If the empirical results are robust, the work offers a practical route to reducing manual hyperparameter tuning and a priori basis complexity in MLIPs. The production of model paths and the emphasis on interpretability are concrete strengths that could aid adoption in large-scale atomistic simulations.

major comments (1)
  1. [Computational tests / benchmark results] The central empirical claim of consistent generalization improvements rests on results from 'a variety of benchmark datasets' (abstract and computational tests section). The manuscript does not demonstrate that these datasets include sufficient structural, compositional, and thermodynamic diversity (e.g., defects, surfaces, or disordered phases) to ensure the selected sparse bases remain optimal outside the training distribution; this is load-bearing for the transferability of the reported accuracy and efficiency gains.
minor comments (2)
  1. [Methods] Clarify the precise active-set algorithm variant employed and any regularization choices in the methods section to aid reproducibility.
  2. Ensure all benchmark dataset references and preprocessing steps are explicitly cited or described.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback, which helps clarify the scope and limitations of our empirical claims. We address the major comment point by point below.

read point-by-point responses
  1. Referee: The central empirical claim of consistent generalization improvements rests on results from 'a variety of benchmark datasets' (abstract and computational tests section). The manuscript does not demonstrate that these datasets include sufficient structural, compositional, and thermodynamic diversity (e.g., defects, surfaces, or disordered phases) to ensure the selected sparse bases remain optimal outside the training distribution; this is load-bearing for the transferability of the reported accuracy and efficiency gains.

    Authors: We agree that explicit documentation of dataset diversity is necessary to support claims of transferability. The benchmarks employed are standard datasets from the ACE and MLIP literature (e.g., elemental and alloy systems with varying degrees of structural complexity). In the revised manuscript we will add a dedicated paragraph and accompanying table in the Computational Tests section that quantifies the structural (defects, surfaces, grain boundaries), compositional, and thermodynamic coverage of each dataset. We will also include a brief discussion of how the active-set selection procedure adapts to these variations. Because the current results already span multiple chemistries and phases, we view this as a clarification rather than a fundamental change to the conclusions. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes and empirically validates active-set algorithms for data-driven feature selection inside the existing ACE framework. All load-bearing claims (improved efficiency, generalization, and interpretability of sparse versus dense models) rest on direct numerical comparisons across benchmark datasets rather than any derivation, uniqueness theorem, or fitted parameter that is redefined as a prediction. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the reported chain; the results are therefore independent of the inputs they are tested against.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are described in the abstract; the work relies on standard convex optimization and existing ACE basis functions.

pith-pipeline@v0.9.0 · 5656 in / 1068 out tokens · 35476 ms · 2026-05-22T19:13:08.398105+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages

  1. [1]

    Bachmayr, G

    M. Bachmayr, G. Csanyi, G. Dusson, R. Drautz, S. Etter, C. Oord, and C. Ortner. Atomic cluster expansion: Completeness, efficiency and stability. J. Comput. Phys. , 454, 2022

  2. [2]

    Bachmayr, G

    M. Bachmayr, G. Dusson, C. Ortner, and J. Thomas. Polynomial approximation of symmetric functions. Math. Comp., 93:811–839, 2024

  3. [3]

    A. P. Bart ´ok, M. C. Payne, R. Kondor, and G. Cs ´anyi. Gaussian approximation potentials: The accuracy of quan- tum mechanics, without the electrons. Phys. Rev. Lett. , 104:136403, Apr 2010

  4. [4]

    Bart ´ok, James Kermode, Noam Bernstein, and G´abor Cs ´anyi

    Albert P. Bart ´ok, James Kermode, Noam Bernstein, and G´abor Cs ´anyi. Machine learning a general-purpose inter- atomic potential for silicon. Phys. Rev. X, 8:041048, Dec 2018

  5. [5]

    Bart ´ok, Risi Kondor, and G ´abor Cs ´anyi

    Albert P. Bart ´ok, Risi Kondor, and G ´abor Cs ´anyi. On representing chemical environments. Physical Review B , 87(18), May 2013

  6. [6]

    Atom-centered symmetry functions for con- structing high-dimensional neural network potentials

    J ¨org Behler. Atom-centered symmetry functions for con- structing high-dimensional neural network potentials. The Journal of Chemical Physics , 134(7):074106, 02 2011

  7. [7]

    Bernstein, G

    N. Bernstein, G. Cs ´anyi, and V . Deringer. De novo exploration and self-guided learning of potential-energy surfaces. npj Computational Materials , 5, 12 2019

  8. [8]

    Introduction to Linear Optimization

    Dimitris Bertsimas and John Tsitsiklis. Introduction to Linear Optimization. 01 1998

  9. [9]

    Braams and Joel M

    Bastiaan J. Braams and Joel M. Bowman and. Permuta- tionally invariant potential energy surfaces in high dimen- sionality. International Reviews in Physical Chemistry , 28(4):577–606, 2009

  10. [10]

    K. Burke. Perspective on density functional theory. The Journal of chemical physics , 136:150901, 04 2012

  11. [11]

    Wakin, m.b.: An introduction to compressive sampling

    Emmanuel Candes and Michael Wakin. Wakin, m.b.: An introduction to compressive sampling. ieee signal process. mag. 25(2), 21-30. Signal Processing Magazine, IEEE , 25:21 – 30, 04 2008

  12. [12]

    Cand `es, Justin K

    Emmanuel J. Cand `es, Justin K. Romberg, and Terence Tao. Stable signal recovery from incomplete and inaccu- rate measurements. Communications on Pure and Applied Mathematics, 59(8):1207–1223, 2006

  13. [13]

    Tony F. Chan. Rank revealing qr factorizations. Linear Algebra and its Applications , 88-89:67–82, 1987

  14. [14]

    Graph networks as a universal machine learning framework for molecules and crystals

    Chi Chen, Weike Ye, Yunxing Zuo, Chen Zheng, and Shyue Ong. Graph networks as a universal machine learning framework for molecules and crystals. Chemistry of Materials, 31, 04 2019

  15. [15]

    Learning properties of ordered and disordered materials from multi-fidelity data

    Chi Chen, Yunxing Zuo, Weike Ye, Xiang-Guo Li, and Shyue Ong. Learning properties of ordered and disordered materials from multi-fidelity data. Nature Computational Science, 1:46–53, 01 2021

  16. [16]

    Donoho, and Michael A

    Scott Shaobing Chen, David L. Donoho, and Michael A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing , 20(1):33–61, 1998

  17. [17]

    Donoho, and Michael A

    Scott Shaobing Chen, David L. Donoho, and Michael A. Saunders. Atomic decomposition by basis pursuit. SIAM Review, 43(1):129–159, 2001

  18. [18]

    Cheng, E

    B. Cheng, E. A. Engel, J. Behler, C. Dellago, and M. Ce- riotti. Ab initio thermodynamics of liquid and solid water. Proceedings of the National Academy of Sciences , 116(4):1110–1115, 2019

  19. [19]

    Cartesian atomic cluster expansion for machine learning interatomic potentials

    Bingqing Cheng. Cartesian atomic cluster expansion for machine learning interatomic potentials. npj Computa- tional Materials, 10, 07 2024

  20. [20]

    Clark, Matthew D

    Stewart J. Clark, Matthew D. Segall, Chris J. Pickard, Phil J. Hasnip, Matt I. J. Probert, Keith Refson, and Mike C. Payne. First principles methods using castep. Zeitschrift f ¨ur Kristallographie - Crystalline Materials , 220(5-6):567–570, 2005

  21. [21]

    M. S. Daw, S. M. Foiles, and M. I. Baskes. The embedded-atom method: a review of theory and appli- cations. Materials Science Reports , 9(7):251–310, 1993

  22. [22]

    Aldo Faisal, and Cheng Soon Ong

    Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. Mathematics for Machine Learning . Cambridge University Press, 2020

  23. [23]

    Deringer, Noam Bernstein, Albert P

    V olker L. Deringer, Noam Bernstein, Albert P. Bart ´ok, Matthew J. Cliffe, Rachel N. Kerber, Lauren E. Marbella, Clare P. Grey, Stephen R. Elliott, and G ´abor Cs ´anyi. Realistic atomistic structure of amorphous silicon from machine-learning-driven molecular dynamics. The Jour- nal of Physical Chemistry Letters , 9(11):2879–2885, Jun 2018

  24. [24]

    Deringer, Miguel A

    V olker L. Deringer, Miguel A. Caro, and G ´abor Cs ´anyi. Machine learning interatomic potentials as emerging tools for materials science. Advanced Materials , 31(46):1902765, 2019

  25. [25]

    Atomic cluster expansion for accurate and transferable interatomic potentials

    Ralf Drautz. Atomic cluster expansion for accurate and transferable interatomic potentials. Phys. Rev. B , 99:014104, Jan 2019

  26. [26]

    Robust solutions to least-squares problems with uncertain data

    Laurent El Ghaoui and Herv ´e Lebret. Robust solutions to least-squares problems with uncertain data. SIAM Journal on Matrix Analysis and Applications , 18(4):1035–1064, 1997

  27. [27]

    M. P. Friedlander and M. A. Saunders. A dual active- set quadratic programming method for finding sparse least-squares solutions. Technical report, Department of Computer Science, University of British Columbia, July 30 2012

  28. [28]

    Friedlander and contributors

    Michael P. Friedlander and contributors. Qrupdate.jl - a julia package for updating qr factorizations. https://github. com/mpf/QRupdate.jl, 2012. Accessed: 2024-10-29

  29. [29]

    Friedlander and Michael A

    Michael P. Friedlander and Michael A. Saunders. Active- set pursuit: an active-set solver for basis pursuit and related sparse optimization problems. https://github.com/ MPF-Optimization-Laboratory/asp

  30. [30]

    Minima of functions of several variables with inequalities as side conditions

    William Karush. Minima of functions of several variables with inequalities as side conditions. Master’s thesis, De- partment of Mathematics, University of Chicago, Chicago, IL, USA, 1939

  31. [31]

    H. W. Kuhn and A. W. Tucker. Nonlinear Programming, pages 481–492. University of California Press, Berkeley, 1951

  32. [32]

    Distributed learning with regularized least squares

    Shao-Bo Lin, Xin Guo, and Ding-Xuan Zhou. Distributed learning with regularized least squares. Journal of Ma- chine Learning Research , 18(92):1–31, 2017

  33. [33]

    David J. C. MacKay. Bayesian Non-Linear Modeling for the Prediction Competition , pages 221–234. Springer Netherlands, Dordrecht, 1996

  34. [34]

    Bayesian interpolation

    David John Cameron MacKay. Bayesian interpolation. Neural Computation, 4:415–447, 1992

  35. [35]

    A Wavelet Tour of Signal Processing, Third Edition: The Sparse Way

    Stphane Mallat. A Wavelet Tour of Signal Processing, Third Edition: The Sparse Way . Academic Press, Inc., USA, 3rd edition, 2008

  36. [36]

    B. K. Natarajan. Sparse approximate solutions to linear systems. SIAM Journal on Computing , 24(2):227–234, 1995

  37. [37]

    A new approach to variable selection in least squares problems

    MR Osborne, B Presnell, and BA Turlach. A new approach to variable selection in least squares problems. IMA Journal of Numerical Analysis , 20(3):389–403, 07 2000

  38. [38]

    Y .C. Pati, R. Rezaiifar, and P.S. Krishnaprasad. Orthog- onal matching pursuit: recursive function approximation with applications to wavelet decomposition. In Proceed- ings of 27th Asilomar Conference on Signals, Systems and Computers, pages 40–44 vol.1, 1993

  39. [39]

    Pedregosa, G

    F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourg, J. Vanderplas, A. Passos, D. Cour- napeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit- learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011

  40. [40]

    Graph neural networks for materials science and chemistry

    Patrick Reiser, Marlen Neubert, Andr ´e Eberhard, Luca Torresi, Chen Zhou, Chen Shao, Houssam Metni, Clint Hoesel, Henrik Schopmans, Timo Sommer, and Pascal Friederich. Graph neural networks for materials science and chemistry. Communications Materials, 3, 11 2022

  41. [41]

    Moment tensor potentials: A class of systematically improvable interatomic potentials

    Alexander Shapeev. Moment tensor potentials: A class of systematically improvable interatomic potentials. Multi- scale Modeling & Simulation , 14, 12 2015

  42. [42]

    J. Tersoff. New empirical approach for the structure and energy of covalent systems. Phys. Rev. B, 37:6991–7000, Apr 1988

  43. [43]

    Thompson, L.P

    A.P. Thompson, L.P. Swiler, C.R. Trott, S.M. Foiles, and G.J. Tucker. Spectral neighbor analysis method for automated generation of quantum-accurate interatomic potentials. Journal of Computational Physics , 285:316– 330, 2015

  44. [44]

    Regression shrinkage and selection via the lasso

    Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the royal statistical society series b- methodological, 58:267–288, 1996

  45. [45]

    Regression shrinkage and selection via the lasso

    Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267–288, 1996

  46. [46]

    Water dynam- ics: Relation between hydrogen bond bifurcations, molec- ular jumps, local density & hydrophobicity

    John Tatini Titantah and Mikko Karttunen. Water dynam- ics: Relation between hydrogen bond bifurcations, molec- ular jumps, local density & hydrophobicity. Scientific Reports, 3(1):2991, Oct 2013

  47. [47]

    Activesetpursuit.jl: A julia implementation of active set pursuit algorithms for sparse optimiza- tion

    Tina Torabi. Activesetpursuit.jl: A julia implementation of active set pursuit algorithms for sparse optimiza- tion. https://github.com/MPF-Optimization-Laboratory/ ActiveSetPursuit.jl, 2024. Accessed: 2024-11-05

  48. [48]

    On asymptotically optimal confidence regions and tests for high-dimensional models

    Sara van de Geer, Peter B ¨uhlmann, Ya’acov Ritov, and Ruben Dezeure. On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42(3):1166 – 1202, 2014

  49. [49]

    Friedlander

    Ewout van den Berg and Michael P. Friedlander. Probing the pareto frontier for basis pursuit solutions. SIAM Journal on Scientific Computing , 31(2):890–912, 2009

  50. [50]

    Hyperactive learning for data-driven interatomic potentials

    Cas van der Oord, Matthias Sachs, D’avid P’eter Kov’acs, Christoph Ortner, and G´abor Cs´anyi. Hyperactive learning for data-driven interatomic potentials. Npj Computational Materials, 9, 2022

  51. [51]

    Linear programming: Foundations and extensions

    Robert Vanderbei. Linear programming: Foundations and extensions. Journal of the Operational Research Society , 49, 03 2002

  52. [52]

    W. C. Witt, C. van der Oord, E. Gel ˇzinyt˙e, T. J ¨arvinen, A. Ross, J. P. Darby, C. Hin Ho, W. J. Baldwin, M. Sachs, J. Kermode, N. Bernstein, G. Cs ´anyi, and C. Ortner. ACEpotentials.jl: A Julia implementation of the atomic cluster expansion. The Journal of Chemical Physics , 159(16):164101, 10 2023

  53. [53]

    Grossman

    Tian Xie and Jeffrey C. Grossman. Crystal graph con- volutional neural networks for an accurate and inter- pretable prediction of material properties. Phys. Rev. Lett., 120:145301, Apr 2018

  54. [54]

    Cun-Hui Zhang and Stephanie S. Zhang. Confidence intervals for low dimensional parameters in high dimen- sional linear models. Journal of the Royal Statistical Society Series B: Statistical Methodology, 76(1):217–242, 07 2013

  55. [55]

    Y . Zuo, C. Chen, X. Li, Z. Deng, Y . Chen, J. Behler, G. Cs ´anyi, A. Shapeev, A. Thompson, M. A. Wood, and Shyue P. Ong. Performance and cost assessment of machine learning interatomic potentials. The Journal of Physical Chemistry A , 124(4):731–745, Jan 2020