pith. sign in

arxiv: 2503.04492 · v3 · submitted 2025-03-06 · ❄️ cond-mat.mtrl-sci · cs.LG

Accurate predictive model of band gap with selected important features based on explainable machine learning

Pith reviewed 2026-05-23 01:18 UTC · model grok-4.3

classification ❄️ cond-mat.mtrl-sci cs.LG
keywords band gap predictionexplainable machine learningfeature selectionsupport vector regressionmaterials informaticsSHAPpermutation importanceGW approximation
0
0 comments X

The pith

An explainable ML approach selects five key features that predict band gaps with accuracy matching an eighteen-feature model while improving performance on unseen materials.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies permutation importance and SHAP explanations to a support vector regression model trained on eighteen features to predict GW-level band gaps. It shows that a reduced model using only the top five features performs nearly as well on data from the same distribution but yields lower errors on compounds outside the training domain. This simplification also requires first removing highly correlated features to avoid misleading importance rankings. The result is a more interpretable and computationally efficient predictor for materials properties.

Core claim

Guided by XML-derived feature importance from a pristine SVR model, a compact model with the top five features achieves comparable accuracy to the full model on in-domain datasets (MAE 0.254 eV vs 0.247 eV) and improved generalization on out-of-domain data (0.348 eV vs 0.460 eV).

What carries the argument

Permutation feature importance and SHAP values applied to a support vector regression model after eliminating features with correlation above 0.8.

Load-bearing premise

The features identified as most important on the training distribution continue to be the most informative when applied to materials with different feature distributions.

What would settle it

Evaluating both models on a large independent dataset of out-of-domain compounds and checking whether the five-feature model's error remains lower than the full model's.

Figures

Figures reproduced from arXiv: 2503.04492 by Joohwi Lee, Kaito Miyamoto.

Figure 1
Figure 1. Figure 1: XML importance scores of SVR regression model for predicting E GW g using 11-feature set. For each feature, the paired bars represent the PFI (left bar) and SHAP importance (right bar). The PFI score is calculated as the increase in the RMSE for the analysis using the test dataset when the values of a specific feature are shuffled, where the predictive model is trained using the training dataset. The SHAP … view at source ↗
Figure 2
Figure 2. Figure 2: SVR regression models for E GW g prediction using various feature sets. (a) Dependence of the RMSE for the test in-domain dataset (cyan rectangles) and generalization gap (orange ×, right vertical axis) on the number of features selected based on XML importance scores. (b) Dependence of the RMSE for the OOD dataset (cyan rectangles) and the predicted value deviations (orange ×, right vertical axis) on the … view at source ↗
Figure 3
Figure 3. Figure 3: (a) Top five features with SHAP importance scores for SVR regression model for E GW g prediction using 18-feature set. The SHAP importance scores for all 18 features are provided in Supplementary Fig. S10. Detailed information relevant to most options, such as error bars and colors for the bar graph, is presented in [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
read the original abstract

In the rapidly advancing field of materials informatics, nonlinear machine learning models have demonstrated exceptional predictive capabilities for material properties. However, their black-box nature limits interpretability, and they may incorporate features that do not contribute to -- or even deteriorate -- model performance. This study employs explainable ML (XML) techniques, including permutation feature importance and the SHapley Additive exPlanation, applied to a pristine support vector regression model designed to predict band gaps at the GW level using 18 input features. Guided by XML-derived individual feature importance, a simple framework is proposed to construct reduced-feature predictive models. Model evaluations indicate that an XML-guided compact model, consisting of the top five features, achieves comparable accuracy to the pristine model on in-domain datasets (0.254 vs. 0.247 eV) while showing improved generalization with lower prediction errors on out-of-domain data (0.348 vs. 0.460 eV). Additionally, the study underscores the necessity for eliminating strongly correlated features (correlation coefficient greater than 0.8) to prevent misinterpretation and overestimation of feature importance before applying XML. This study highlights XML's effectiveness in developing simplified yet highly accurate machine learning models by clarifying feature roles, thereby reducing computational costs for feature acquisition and enhancing model trustworthiness for materials discovery.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript develops an explainable ML framework for GW band-gap prediction using support vector regression on 18 features. After removing pairs with correlation coefficient >0.8, permutation importance and SHAP are used to rank features on the training distribution; a compact model retaining only the top five features is reported to achieve in-domain MAE of 0.254 eV (versus 0.247 eV for the full model) and out-of-domain MAE of 0.348 eV (versus 0.460 eV), supporting claims of comparable accuracy with improved generalization and lower feature-acquisition cost.

Significance. If the reported OOD improvement is shown to be robust, the work would demonstrate that XML-guided feature reduction can produce simpler, more trustworthy models that generalize better than the full-feature baseline, directly addressing computational cost in materials discovery. The explicit provision of separate in-domain and out-of-domain MAE numbers is a concrete strength that facilitates direct assessment of generalization.

major comments (2)
  1. [Methods] The Methods section supplies no information on dataset size, the train-test split protocol, the definition or selection criteria for out-of-domain compounds, or the hyperparameter tuning procedure for the SVR model. These details are required to judge whether the 0.112 eV OOD MAE reduction is statistically meaningful or sensitive to the particular split.
  2. [Results (out-of-domain evaluation)] In the out-of-domain results, the top-five features are selected exclusively by permutation importance and SHAP computed on the training distribution. No ablation is presented that recomputes importance rankings on the OOD set itself or compares the XML-selected five features against the five features that would minimize OOD error if chosen directly from the OOD distribution; this test is necessary to establish that the reported generalization gain is attributable to the train-derived ranking rather than to an arbitrary five-feature subset.
minor comments (1)
  1. [Abstract] The abstract states the necessity of removing features with correlation >0.8 before XML but does not indicate how many features were actually discarded or which pairs exceeded the threshold; adding this information would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [Methods] The Methods section supplies no information on dataset size, the train-test split protocol, the definition or selection criteria for out-of-domain compounds, or the hyperparameter tuning procedure for the SVR model. These details are required to judge whether the 0.112 eV OOD MAE reduction is statistically meaningful or sensitive to the particular split.

    Authors: We agree that these methodological details are essential for reproducibility and assessing robustness. The revised manuscript will expand the Methods section to report the dataset size, the train-test split protocol, the definition and selection criteria for out-of-domain compounds, and the hyperparameter tuning procedure for the SVR model. This will allow readers to evaluate the statistical significance of the observed OOD improvement. revision: yes

  2. Referee: [Results (out-of-domain evaluation)] In the out-of-domain results, the top-five features are selected exclusively by permutation importance and SHAP computed on the training distribution. No ablation is presented that recomputes importance rankings on the OOD set itself or compares the XML-selected five features against the five features that would minimize OOD error if chosen directly from the OOD distribution; this test is necessary to establish that the reported generalization gain is attributable to the train-derived ranking rather than to an arbitrary five-feature subset.

    Authors: We maintain that recomputing importance rankings or optimizing feature selection directly on the OOD set is neither necessary nor appropriate for validating the proposed framework. The XML-guided selection is intentionally performed on training data alone to reflect realistic deployment conditions where OOD labels are unavailable. Performing feature selection on OOD data would constitute leakage and would not demonstrate the practical value of train-derived explanations. The reported results already show that the train-selected five-feature model outperforms the full model on OOD data. In the revision we will add an explicit discussion clarifying this rationale and the limitations of alternative ablations. revision: partial

Circularity Check

0 steps flagged

No circularity; standard supervised feature-selection pipeline on external labels

full rationale

The derivation consists of (1) training an SVR regressor on 18 features to predict external GW band-gap values, (2) computing permutation importance and SHAP on the fitted model, (3) selecting the top-five features by those scores, and (4) retraining and evaluating MAE on in-domain and OOD splits. None of these steps defines a target quantity in terms of itself, renames a fitted parameter as a prediction, or relies on a self-citation chain for a uniqueness claim. The reported OOD improvement is an empirical observation on held-out compounds, not a quantity forced by construction from the training distribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the standard supervised-learning assumptions that the training distribution is representative and that feature importance rankings transfer to new distributions; no additional free parameters, axioms, or invented entities are introduced beyond those implicit in SVR and SHAP.

pith-pipeline@v0.9.0 · 5765 in / 1126 out tokens · 32318 ms · 2026-05-23T01:18:57.809351+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

  1. [1]

    E., Guyon, I

    Boser, B. E., Guyon, I. M. & Vapnik, V . N. A training algorithm for optimal margin classifiers.Proc. Fifth ann. Work. comput. Learn. Theor .144–152 (1992)

  2. [2]

    A., Dumais, S

    Hearst, M. A., Dumais, S. T., Osuna, E., Platt, J. & Schölkopf, B. Support vector machines.IEEE Intell. Syst. Their Appl. 13, 18–28 (1998). 3.Smola, A. J. & Schölkopf, B. A tutorial on support vector regression.Stat. Comput.14, 199–222 (2004). 4.LeCun, Y ., Bengio, Y . & Hinton, G. Deep learning.Nature521, 436–444 (2015). 5.Molnar, C.Interpretable Machine...

  3. [3]

    L., Buonassisi, T

    Oviedo, F., Ferres, J. L., Buonassisi, T. & Butler, K. T. Interpretable and explainable machine learning for materials science and chemistry.Acc. Mater . Res.3, 597–607 (2022)

  4. [4]

    & Sham, L

    Kohn, W. & Sham, L. J. Self-consistent equations including exchange and correlation effects.Phys. Rev.140, A1133–A1138 (1965)

  5. [5]

    Heyd, J., Scuseria, G. E. & Ernzerhof, M. Hybrid functionals based on a screened Coulomb potential.J. Chem. Phys.118, 8207–8215 (2003)

  6. [6]

    & Kresse, G

    Fuchs, F., Furthmüller, J., Bechstedt, F., Shishkin, M. & Kresse, G. Quasiparticle band structure based on a generalized Kohn-Sham scheme.Phys. Rev. B76, 115109 (2007)

  7. [7]

    & Tanaka, I

    Lee, J., Seko, A., Shitara, K., Nakayama, K. & Tanaka, I. Prediction model of band gap for inorganic compounds by combination of density functional theory calculations and machine learning techniques.Phys. Rev. B93, 115104 (2016)

  8. [8]

    P., Burke, K

    Perdew, J. P., Burke, K. & Ernzerhof, M. Generalized gradient approximation made simple.Phys. Rev. Lett.77, 3865–3868 (1996)

  9. [9]

    O.et al.Explainable machine learning for predicting the band gaps of ABX 3 perovskites.Mater

    Obada, D. O.et al.Explainable machine learning for predicting the band gaps of ABX 3 perovskites.Mater . Sci. Semicond. Process.161, 107427 (2023). 14.Choubisa, H.et al.Interpretable discovery of semiconductors with machine learning.npj Comput. Mater .9, 117 (2023)

  10. [10]

    & Dominici, F

    Fisher, A., Rudin, C. & Dominici, F. All models are wrong, but many are useful: Learning a variable’s importance by studying an entire class of prediction models simultaneously.J. Mach. Learn. Res.20, 1–81 (2019)

  11. [11]

    Today Commun.33, 104630 (2022)

    Zhang, L.et al.Accurate band gap prediction based on an interpretable ∆-machine learning.Mater . Today Commun.33, 104630 (2022)

  12. [12]

    & Ghiringhelli, L

    Ouyang, R., Curtarolo, S., Ahmetcik, E., Scheffler, M. & Ghiringhelli, L. M. SISSO: A compressed-sensing method for identifying the best low-dimensional descriptor in an immensity of offered candidates.Phys. Rev. Mater .2, 083802 (2018)

  13. [13]

    Shi, Y .et al.Interpretable machine learning for stability and electronic structure prediction of Janus III–VI van der Waals heterostructures.MGE Adv.2, e76 (2024)

  14. [14]

    Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions.Adv. Neural Inf. Process. Syst.30 (2017)

  15. [15]

    Jihad, I., Anfa, M. H. S., Alqahtani, S. M. & Alharbi, F. H. DFT-PBE band gap correction using machine learning with a reduced set of features.Comput. Mater . Sci.244, 113153 (2024)

  16. [16]

    Regression shrinkage and selection via the lasso.J

    Tibshirani, R. Regression shrinkage and selection via the lasso.J. Royal Stat. Soc. Ser. B: Stat. Methodol.58, 267–288 (1996). 22.GWgap_predictor_data. http://github.com/JoohwiLEE/GWgap_predictor_data. Last Accessed: Nov 22 2023

  17. [17]

    & Blaha, P

    Tran, F. & Blaha, P. Accurate band gaps of semiconductors and insulators with a semilocal exchange-correlation potential. Phys. Rev. Lett.102, 226401 (2009). 8/25 24.Pedregosa, F. et al. Scikit-learn: Machine learning in Python.J. Mach. Learn. Res.12, 2825–2830 (2011)

  18. [18]

    Jain, A.et al.Commentary: The materials project: A materials genome approach to accelerating materials innovation. Appl. Phys. Lett. Mater .1, 011002 (2013). 26.SHAP. http://github.com/shap/shap. Last Accessed: Nov 28 2023. Data Availability The in-domain data can be obtained at http://github.com/JoohwiLEE/GWgap_predictor_data. Other raw/processed data ca...