Accurate predictive model of band gap with selected important features based on explainable machine learning
Pith reviewed 2026-05-23 01:18 UTC · model grok-4.3
The pith
An explainable ML approach selects five key features that predict band gaps with accuracy matching an eighteen-feature model while improving performance on unseen materials.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Guided by XML-derived feature importance from a pristine SVR model, a compact model with the top five features achieves comparable accuracy to the full model on in-domain datasets (MAE 0.254 eV vs 0.247 eV) and improved generalization on out-of-domain data (0.348 eV vs 0.460 eV).
What carries the argument
Permutation feature importance and SHAP values applied to a support vector regression model after eliminating features with correlation above 0.8.
Load-bearing premise
The features identified as most important on the training distribution continue to be the most informative when applied to materials with different feature distributions.
What would settle it
Evaluating both models on a large independent dataset of out-of-domain compounds and checking whether the five-feature model's error remains lower than the full model's.
Figures
read the original abstract
In the rapidly advancing field of materials informatics, nonlinear machine learning models have demonstrated exceptional predictive capabilities for material properties. However, their black-box nature limits interpretability, and they may incorporate features that do not contribute to -- or even deteriorate -- model performance. This study employs explainable ML (XML) techniques, including permutation feature importance and the SHapley Additive exPlanation, applied to a pristine support vector regression model designed to predict band gaps at the GW level using 18 input features. Guided by XML-derived individual feature importance, a simple framework is proposed to construct reduced-feature predictive models. Model evaluations indicate that an XML-guided compact model, consisting of the top five features, achieves comparable accuracy to the pristine model on in-domain datasets (0.254 vs. 0.247 eV) while showing improved generalization with lower prediction errors on out-of-domain data (0.348 vs. 0.460 eV). Additionally, the study underscores the necessity for eliminating strongly correlated features (correlation coefficient greater than 0.8) to prevent misinterpretation and overestimation of feature importance before applying XML. This study highlights XML's effectiveness in developing simplified yet highly accurate machine learning models by clarifying feature roles, thereby reducing computational costs for feature acquisition and enhancing model trustworthiness for materials discovery.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript develops an explainable ML framework for GW band-gap prediction using support vector regression on 18 features. After removing pairs with correlation coefficient >0.8, permutation importance and SHAP are used to rank features on the training distribution; a compact model retaining only the top five features is reported to achieve in-domain MAE of 0.254 eV (versus 0.247 eV for the full model) and out-of-domain MAE of 0.348 eV (versus 0.460 eV), supporting claims of comparable accuracy with improved generalization and lower feature-acquisition cost.
Significance. If the reported OOD improvement is shown to be robust, the work would demonstrate that XML-guided feature reduction can produce simpler, more trustworthy models that generalize better than the full-feature baseline, directly addressing computational cost in materials discovery. The explicit provision of separate in-domain and out-of-domain MAE numbers is a concrete strength that facilitates direct assessment of generalization.
major comments (2)
- [Methods] The Methods section supplies no information on dataset size, the train-test split protocol, the definition or selection criteria for out-of-domain compounds, or the hyperparameter tuning procedure for the SVR model. These details are required to judge whether the 0.112 eV OOD MAE reduction is statistically meaningful or sensitive to the particular split.
- [Results (out-of-domain evaluation)] In the out-of-domain results, the top-five features are selected exclusively by permutation importance and SHAP computed on the training distribution. No ablation is presented that recomputes importance rankings on the OOD set itself or compares the XML-selected five features against the five features that would minimize OOD error if chosen directly from the OOD distribution; this test is necessary to establish that the reported generalization gain is attributable to the train-derived ranking rather than to an arbitrary five-feature subset.
minor comments (1)
- [Abstract] The abstract states the necessity of removing features with correlation >0.8 before XML but does not indicate how many features were actually discarded or which pairs exceeded the threshold; adding this information would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and indicate planned revisions to the manuscript.
read point-by-point responses
-
Referee: [Methods] The Methods section supplies no information on dataset size, the train-test split protocol, the definition or selection criteria for out-of-domain compounds, or the hyperparameter tuning procedure for the SVR model. These details are required to judge whether the 0.112 eV OOD MAE reduction is statistically meaningful or sensitive to the particular split.
Authors: We agree that these methodological details are essential for reproducibility and assessing robustness. The revised manuscript will expand the Methods section to report the dataset size, the train-test split protocol, the definition and selection criteria for out-of-domain compounds, and the hyperparameter tuning procedure for the SVR model. This will allow readers to evaluate the statistical significance of the observed OOD improvement. revision: yes
-
Referee: [Results (out-of-domain evaluation)] In the out-of-domain results, the top-five features are selected exclusively by permutation importance and SHAP computed on the training distribution. No ablation is presented that recomputes importance rankings on the OOD set itself or compares the XML-selected five features against the five features that would minimize OOD error if chosen directly from the OOD distribution; this test is necessary to establish that the reported generalization gain is attributable to the train-derived ranking rather than to an arbitrary five-feature subset.
Authors: We maintain that recomputing importance rankings or optimizing feature selection directly on the OOD set is neither necessary nor appropriate for validating the proposed framework. The XML-guided selection is intentionally performed on training data alone to reflect realistic deployment conditions where OOD labels are unavailable. Performing feature selection on OOD data would constitute leakage and would not demonstrate the practical value of train-derived explanations. The reported results already show that the train-selected five-feature model outperforms the full model on OOD data. In the revision we will add an explicit discussion clarifying this rationale and the limitations of alternative ablations. revision: partial
Circularity Check
No circularity; standard supervised feature-selection pipeline on external labels
full rationale
The derivation consists of (1) training an SVR regressor on 18 features to predict external GW band-gap values, (2) computing permutation importance and SHAP on the fitted model, (3) selecting the top-five features by those scores, and (4) retraining and evaluating MAE on in-domain and OOD splits. None of these steps defines a target quantity in terms of itself, renames a fitted parameter as a prediction, or relies on a self-citation chain for a uniqueness claim. The reported OOD improvement is an empirical observation on held-out compounds, not a quantity forced by construction from the training distribution.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Boser, B. E., Guyon, I. M. & Vapnik, V . N. A training algorithm for optimal margin classifiers.Proc. Fifth ann. Work. comput. Learn. Theor .144–152 (1992)
work page 1992
-
[2]
Hearst, M. A., Dumais, S. T., Osuna, E., Platt, J. & Schölkopf, B. Support vector machines.IEEE Intell. Syst. Their Appl. 13, 18–28 (1998). 3.Smola, A. J. & Schölkopf, B. A tutorial on support vector regression.Stat. Comput.14, 199–222 (2004). 4.LeCun, Y ., Bengio, Y . & Hinton, G. Deep learning.Nature521, 436–444 (2015). 5.Molnar, C.Interpretable Machine...
work page 1998
-
[3]
Oviedo, F., Ferres, J. L., Buonassisi, T. & Butler, K. T. Interpretable and explainable machine learning for materials science and chemistry.Acc. Mater . Res.3, 597–607 (2022)
work page 2022
- [4]
-
[5]
Heyd, J., Scuseria, G. E. & Ernzerhof, M. Hybrid functionals based on a screened Coulomb potential.J. Chem. Phys.118, 8207–8215 (2003)
work page 2003
-
[6]
Fuchs, F., Furthmüller, J., Bechstedt, F., Shishkin, M. & Kresse, G. Quasiparticle band structure based on a generalized Kohn-Sham scheme.Phys. Rev. B76, 115109 (2007)
work page 2007
-
[7]
Lee, J., Seko, A., Shitara, K., Nakayama, K. & Tanaka, I. Prediction model of band gap for inorganic compounds by combination of density functional theory calculations and machine learning techniques.Phys. Rev. B93, 115104 (2016)
work page 2016
-
[8]
Perdew, J. P., Burke, K. & Ernzerhof, M. Generalized gradient approximation made simple.Phys. Rev. Lett.77, 3865–3868 (1996)
work page 1996
-
[9]
O.et al.Explainable machine learning for predicting the band gaps of ABX 3 perovskites.Mater
Obada, D. O.et al.Explainable machine learning for predicting the band gaps of ABX 3 perovskites.Mater . Sci. Semicond. Process.161, 107427 (2023). 14.Choubisa, H.et al.Interpretable discovery of semiconductors with machine learning.npj Comput. Mater .9, 117 (2023)
work page 2023
-
[10]
Fisher, A., Rudin, C. & Dominici, F. All models are wrong, but many are useful: Learning a variable’s importance by studying an entire class of prediction models simultaneously.J. Mach. Learn. Res.20, 1–81 (2019)
work page 2019
-
[11]
Today Commun.33, 104630 (2022)
Zhang, L.et al.Accurate band gap prediction based on an interpretable ∆-machine learning.Mater . Today Commun.33, 104630 (2022)
work page 2022
-
[12]
Ouyang, R., Curtarolo, S., Ahmetcik, E., Scheffler, M. & Ghiringhelli, L. M. SISSO: A compressed-sensing method for identifying the best low-dimensional descriptor in an immensity of offered candidates.Phys. Rev. Mater .2, 083802 (2018)
work page 2018
-
[13]
Shi, Y .et al.Interpretable machine learning for stability and electronic structure prediction of Janus III–VI van der Waals heterostructures.MGE Adv.2, e76 (2024)
work page 2024
-
[14]
Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions.Adv. Neural Inf. Process. Syst.30 (2017)
work page 2017
-
[15]
Jihad, I., Anfa, M. H. S., Alqahtani, S. M. & Alharbi, F. H. DFT-PBE band gap correction using machine learning with a reduced set of features.Comput. Mater . Sci.244, 113153 (2024)
work page 2024
-
[16]
Regression shrinkage and selection via the lasso.J
Tibshirani, R. Regression shrinkage and selection via the lasso.J. Royal Stat. Soc. Ser. B: Stat. Methodol.58, 267–288 (1996). 22.GWgap_predictor_data. http://github.com/JoohwiLEE/GWgap_predictor_data. Last Accessed: Nov 22 2023
work page 1996
-
[17]
Tran, F. & Blaha, P. Accurate band gaps of semiconductors and insulators with a semilocal exchange-correlation potential. Phys. Rev. Lett.102, 226401 (2009). 8/25 24.Pedregosa, F. et al. Scikit-learn: Machine learning in Python.J. Mach. Learn. Res.12, 2825–2830 (2011)
work page 2009
-
[18]
Jain, A.et al.Commentary: The materials project: A materials genome approach to accelerating materials innovation. Appl. Phys. Lett. Mater .1, 011002 (2013). 26.SHAP. http://github.com/shap/shap. Last Accessed: Nov 28 2023. Data Availability The in-domain data can be obtained at http://github.com/JoohwiLEE/GWgap_predictor_data. Other raw/processed data ca...
work page 2013
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.