Recognition: unknown
Benchmarking bandgap prediction in semiconductors under experimental and realistic evaluation settings
Pith reviewed 2026-05-07 16:00 UTC · model grok-4.3
The pith
Machine learning models for predicting semiconductor bandgaps show limited generalization from computational data to experimental measurements.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We curate an open-access dataset of experimental bandgaps with aligned crystal structures and introduce the RealMat-BaG benchmark to assess model reliability under experimentally relevant conditions. Performance is compared for graph neural networks and classical machine learning baselines across statistical and domain-based splits, with additional tests of transfer from DFT-computed to experimental bandgaps and interpretability analysis at elemental-property and structural levels. The results demonstrate the fundamental generalization limitations of existing bandgap prediction models.
What carries the argument
The RealMat-BaG benchmark, which evaluates models on a curated experimental bandgap dataset with aligned crystal structures using statistical and domain-based splits plus DFT-to-experimental transfer.
If this is right
- Future models for materials discovery should prioritize training or fine-tuning strategies that align with experimental rather than purely computational data.
- Domain shifts between computational and measured properties must be explicitly addressed to improve reliability in semiconductor applications.
- Interpretability tools operating at both elemental and structural levels can identify specific sources of prediction errors.
- The benchmark supplies a standardized evaluation protocol for comparing new learning approaches under realistic conditions.
Where Pith is reading between the lines
- This suggests that incorporating explicit physics constraints or uncertainty estimates could mitigate some observed transfer failures in practice.
- Extending similar benchmarks to related properties such as carrier mobility would test whether the same generalization issues appear across multiple semiconductor metrics.
- A direct follow-up experiment could measure how much performance improves when models are pre-trained on mixed DFT and experimental data rather than DFT alone.
Load-bearing premise
The curated experimental bandgap dataset with aligned crystal structures is representative of real-world conditions and the chosen statistical and domain-based splits fairly capture generalization challenges without selection bias.
What would settle it
A new model achieving low error rates on domain-based splits when transferring from DFT data to experimental bandgaps would indicate that the reported generalization limitations are not fundamental.
Figures
read the original abstract
Accurate bandgap prediction is crucial for semiconductor applications, yet machine learning models trained on computational data often struggle to generalize to experimental bandgap measurements. Challenges related to data fidelity, domain generalization, and model interpretability remain insufficiently addressed in existing evaluation frameworks. To bridge this gap, we introduce RealMat-BaG, a benchmark for assessing model reliability under experimentally relevant conditions. We curate an open-access dataset of experimental bandgaps with aligned crystal structures and compare graph neural networks as well as classical machine learning baselines. Our framework evaluates performance across statistical and domain-based splits, examines transfer from DFT-computed to experimental bandgaps, and analyzes interpretability at both elemental-property and structural levels. Our results reveal the fundamental generalization limitations of current bandgap prediction models and establish a benchmark aligned with experimental measurements for developing more reliable learning strategies for materials discovery.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces RealMat-BaG, an open-access benchmark dataset of experimental semiconductor bandgaps paired with aligned crystal structures. It evaluates graph neural networks and classical ML baselines on statistical and domain-based data splits, assesses transfer performance from DFT-computed to experimental bandgaps, and includes interpretability analyses at elemental and structural levels. The central claim is that these evaluations reveal fundamental generalization limitations of current bandgap prediction models under experimentally relevant conditions.
Significance. If the dataset curation and splits are shown to be free of selection bias and representative of real-world semiconductor distributions, the benchmark could provide a valuable, experimentally aligned evaluation framework for improving ML models in materials discovery. The emphasis on domain generalization and interpretability addresses important gaps in existing DFT-trained models, though the strength depends on verifiable details of the data construction.
major comments (3)
- [§3] §3 (Dataset Curation): The alignment procedure between experimental bandgaps and crystal structures is not described with sufficient detail (e.g., no explicit criteria for structure matching, tolerance thresholds, or handling of multiple measurements per composition), preventing verification that the dataset avoids selection bias and is representative of real-world semiconductor distributions as required for the 'fundamental limitations' claim.
- [§4.2] §4.2 (Domain-based splits): The rules for constructing domain-based splits (e.g., by material class, space group, or elemental composition) are not specified, so it is impossible to confirm that performance drops reflect inherent model generalization failures rather than artifacts of how the splits were defined or potential data leakage.
- [§5] §5 (Results and Transfer Learning): The reported gaps between DFT-to-experimental transfer and in-domain performance are presented without accompanying statistical significance tests, confidence intervals on metrics, or ablation on dataset size/composition, weakening the assertion that these gaps demonstrate fundamental rather than dataset-specific limitations.
minor comments (2)
- The abstract states the dataset is 'open-access' but the manuscript does not provide a direct link, DOI, or repository identifier in the main text or data availability statement.
- Figure captions for performance plots lack explicit definitions of the error metrics used (e.g., MAE vs. RMSE) and do not indicate the number of runs or random seeds for reported averages.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We have revised the manuscript to address each major point by expanding descriptions, adding statistical analyses, and providing explicit rules and criteria. These changes strengthen the transparency and support for our claims on generalization limits without altering the core findings.
read point-by-point responses
-
Referee: [§3] §3 (Dataset Curation): The alignment procedure between experimental bandgaps and crystal structures is not described with sufficient detail (e.g., no explicit criteria for structure matching, tolerance thresholds, or handling of multiple measurements per composition), preventing verification that the dataset avoids selection bias and is representative of real-world semiconductor distributions as required for the 'fundamental limitations' claim.
Authors: We appreciate this feedback highlighting the need for greater transparency. In the revised manuscript, §3 now includes explicit criteria: structures are matched using identical composition and a StructureMatcher tolerance of 0.1 Å for site positions and 1% for lattice parameters. Multiple bandgap measurements per composition are averaged when the range is ≤0.3 eV (with variance reported); otherwise, the most recent measurement is retained after outlier screening. A new supplementary flowchart details the full curation pipeline, and we added a discussion of coverage relative to ICSD and experimental literature to address representativeness and selection bias. These revisions enable verification while preserving the dataset's alignment with real-world conditions. revision: yes
-
Referee: [§4.2] §4.2 (Domain-based splits): The rules for constructing domain-based splits (e.g., by material class, space group, or elemental composition) are not specified, so it is impossible to confirm that performance drops reflect inherent model generalization failures rather than artifacts of how the splits were defined or potential data leakage.
Authors: We thank the referee for noting this omission. The revised §4.2 now specifies the exact construction rules: splits are formed by (1) material class (oxides, halides, chalcogenides, etc., with 70/30 train/test per class), (2) space-group families (grouped by symmetry class), and (3) elemental composition (leave-one-element-out for 10 key elements). All splits enforce zero composition overlap between train and test sets to eliminate leakage. We also report results across three alternative split variants in the supplement, all yielding comparable performance drops, indicating the observed generalization failures are not artifacts of a single split definition. revision: yes
-
Referee: [§5] §5 (Results and Transfer Learning): The reported gaps between DFT-to-experimental transfer and in-domain performance are presented without accompanying statistical significance tests, confidence intervals on metrics, or ablation on dataset size/composition, weakening the assertion that these gaps demonstrate fundamental rather than dataset-specific limitations.
Authors: We agree that additional statistical rigor strengthens the interpretation. The revised §5 now reports 95% bootstrap confidence intervals (1,000 resamples) for all metrics and includes paired Wilcoxon signed-rank tests confirming that DFT-to-experimental transfer gaps are statistically significant (p < 0.01) relative to in-domain performance. We have added an ablation study in the supplementary material that varies training-set size (50–100%) and composition balance; the gaps remain stable across these conditions. These additions support that the limitations are fundamental rather than specific to the current dataset scale or makeup. revision: yes
Circularity Check
No circularity: empirical benchmark with no derivations or self-referential reductions
full rationale
This is a comparative benchmark study that curates an experimental bandgap dataset, applies statistical and domain splits, and reports model performance metrics for GNNs and baselines. No equations, derivations, fitted parameters renamed as predictions, or uniqueness theorems appear in the provided text. The central claims rest on observed generalization gaps between DFT and experimental data, which are presented as empirical findings rather than reductions to inputs by construction. Self-citations, if present, are not load-bearing for any mathematical premise; the work is self-contained as an evaluation framework whose results can be reproduced or falsified externally via the released dataset and splits.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Experimental bandgaps can be reliably paired with corresponding crystal structures from databases without significant alignment errors.
invented entities (1)
-
RealMat-BaG
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Wide bandgap semiconductor materials and devices.IEEE Transactions on Electron Devices43, 1633–1636 (1996)
Yoder, M. Wide bandgap semiconductor materials and devices.IEEE Transactions on Electron Devices43, 1633–1636 (1996). 2.Ueno, K.et al.Field-effect transistor based on KTaO3 perovskite.Appl. Phys. Lett.84, 3726–3728 (2004)
1996
-
[2]
C., Penn, R
Lisensky, G. C., Penn, R. L., Geselbracht, M. J. & Ellis, A. B. Periodic properties in a family of common semiconductors: Experiments with light emitting diodes.J. Chem. Educ.69, 151 (1992)
1992
-
[3]
Multiple band gap semiconductor/electrolyte solar energy conversion.J
Licht, S. Multiple band gap semiconductor/electrolyte solar energy conversion.J. Phys. Chem. B105, 6281–6294 (2001). 5.Dey, P.et al.Informatics-aided bandgap engineering for solar materials.Comput. Mater. Sci.83, 185–195 (2014)
2001
-
[4]
M., Tahir-Kheli, J
Crowley, J. M., Tahir-Kheli, J. & Goddard, W. A. I. Resolution of the band gap prediction problem for materials design. The J. Phys. Chem. Lett.7, 1198–1203 (2016)
2016
-
[5]
Perdew, J. P. & Levy, M. Physical content of the exact Kohn-Sham orbital energies: Band gaps and derivative discontinuities. Phys. Rev. Lett.51, 1884–1887 (1983)
1983
-
[6]
W., Marques, M
Borlido, P., Schmidt, J., Huran, A. W., Marques, M. A. L. & Botti, S. Exchange-correlation functionals for band gaps of solids: benchmark, reparametrization and machine learning.npj Comput. Mater.6, 96 (2020)
2020
-
[7]
& Pasquarello, A
Yang, J., Falletta, S. & Pasquarello, A. Range-separated hybrid functionals for accurate prediction of band gaps of extended systems.npj Comput. Mater.9, 108 (2023). 10.Reiser, P.et al.Graph neural networks for materials science and chemistry.Commun. Mater.3, 93 (2022)
2023
-
[8]
& Grossman, J
Xie, T. & Grossman, J. C. Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties.Phys. Rev. Lett.120, 145301 (2018)
2018
-
[9]
& Ong, S
Chen, C., Ye, W., Zuo, Y ., Zheng, C. & Ong, S. P. Graph networks as a universal machine learning framework for molecules and crystals.Chem. Mater.31, 3564–3572 (2019)
2019
-
[10]
& Ong, S
Chen, C., Zuo, Y ., Ye, W., Li, X. & Ong, S. P. Learning properties of ordered and disordered materials from multi-fidelity data.Nat. Comput. Sci.1, 46–53 (2021)
2021
-
[11]
& DeCost, B
Choudhary, K. & DeCost, B. Atomistic line graph neural network for improved materials property predictions.npj Comput. Mater.7, 185 (2021)
2021
-
[12]
& Rignanese, G.-M
De Breuck, P.-P., Hautier, G. & Rignanese, G.-M. Materials property prediction for limited datasets enabled by feature selection and joint learning with MODNet.npj Comput. Mater.7, 83 (2021)
2021
-
[13]
InThirty-seventh Conference on Neural Information Processing Systems(2023)
Du, W.et al.A new perspective on building efficient and expressive 3D equivariant graph neural networks. InThirty-seventh Conference on Neural Information Processing Systems(2023)
2023
-
[14]
Masood, H.et al.Enhancing prediction accuracy of physical band gaps in semiconductor materials.Cell Reports Phys. Sci. 4, 101555 (2023)
2023
-
[15]
In Proceedings of the AAAI Conference on Artificial Intelligence, vol
Das, K.et al.CrysGNN: Distilling pre-trained knowledge to enhance property prediction for crystalline materials. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, 7323–7331 (2023)
2023
-
[16]
Deng, B.et al.CHGNet as a pretrained universal neural network potential for charge-informed atomistic modelling.Nat. Mach. Intell.5, 1031–1041 (2023)
2023
-
[17]
Discov.4, 694–710 (2025)
Solé, À.et al.A Cartesian encoding graph neural network for crystal structure property prediction: Application to thermal ellipsoid estimation.Digit. Discov.4, 694–710 (2025)
2025
-
[18]
& Tarakanova, A
Madani, M., Lacivita, V ., Shin, Y . & Tarakanova, A. Accelerating materials property prediction via a hybrid transformer graph framework that leverages four body interactions.npj Comput. Mater.11, 15 (2025)
2025
-
[19]
& Jain, A
Dunn, A., Wang, Q., Ganose, A., Dopp, D. & Jain, A. Benchmarking materials property prediction methods: The Matbench test set and Automatminer reference algorithm.npj Comput. Mater.6, 138 (2020)
2020
-
[20]
& Sumpter, B
Fung, V ., Zhang, J., Juarez, E. & Sumpter, B. G. Benchmarking graph neural networks for materials chemistry.npj Comput. Mater.7, 84 (2021)
2021
-
[21]
Mater.10, 93 (2024)
Choudhary, K.et al.JARVIS-Leaderboard: A large scale benchmark of materials design methods.npj Comput. Mater.10, 93 (2024)
2024
-
[22]
S., Fu, N., Dong, R., Hu, M
Omee, S. S., Fu, N., Dong, R., Hu, M. & Hu, J. Structure-based out-of-distribution (OOD) materials property prediction: a benchmark study.npj Comput. Mater.10, 144 (2024). 13/35 26.Li, K.et al.Probing out-of-distribution generalization in machine learning for materials.Commun. Mater.6, 9 (2025)
2024
-
[23]
Commun.10, 5316 (2019)
Jha, D.et al.Enhancing materials property prediction by leveraging computational and experimental data using deep transfer learning.Nat. Commun.10, 5316 (2019)
2019
-
[24]
& Cole, J
Dong, Q. & Cole, J. M. Auto-generated database of semiconductor band gaps using ChemDataExtractor.Sci. Data9, 193 (2022)
2022
-
[25]
Meredig, B.et al.Can machine learning identify the next high-temperature superconductor? Examining extrapolation performance for materials discovery.Mol. Syst. Des. & Eng.3, 819–825 (2018). 30.Zhong, X.et al.Explainable machine learning in materials science.npj Comput. Mater.8, 204 (2022)
2018
-
[26]
& Butler, K
Oviedo, F., Lavista Ferres, J., Buonassisi, T. & Butler, K. T. Interpretable and explainable machine learning for materials science and chemistry.Accounts Mater. Res.3, 597–607 (2022)
2022
-
[27]
Li, X.et al.Interpretable deep learning: interpretation, interpretability, trustworthiness, and beyond.Knowl. Inf. Syst.64, 3197–3234 (2022)
2022
-
[28]
& Kotecha, K
Joshi, G., Walambe, R. & Kotecha, K. A review on explainability in multimodal deep neural nets.IEEE Access9, 59800–59821 (2021)
2021
-
[29]
& Yang, C
Gao, H., Guo, X.-W., Li, G., Li, C. & Yang, C. GCPNet: An interpretable generic crystal pattern graph neural network for predicting material properties.Neural Networks188, 107466 (2025)
2025
-
[30]
& Shan, G
Teng, Y ., Tan, H., Huang, W. & Shan, G. Atomic-level interpretable multimodal graph neural network for predicting carbon dioxide adsorption in metal-organic frameworks.Commun. Phys.8, 491 (2025)
2025
-
[31]
J., Singh, C., Kumbier, K., Abbasi-Asl, R
Murdoch, W. J., Singh, C., Kumbier, K., Abbasi-Asl, R. & Yu, B. Definitions, methods, and applications in interpretable machine learning.Proc. Natl. Acad. Sci.116, 22071–22080 (2019)
2019
-
[32]
P., Stanek, C
Talapatra, A., Uberuaga, B. P., Stanek, C. R. & Pilania, G. Band gap predictions of double perovskite oxides using machine learning.Commun. Mater.4, 46 (2023)
2023
-
[33]
& Naeimi-Sadigh, A
Sabagh Moeini, A., Shariatmadar Tehrani, F. & Naeimi-Sadigh, A. Machine learning-enhanced band gaps prediction for low-symmetry double and layered perovskites.Sci. Reports14, 26736 (2024)
2024
-
[34]
He, T.et al.Deep neural networks and kernel regression achieve comparable accuracies for functional connectivity prediction of behavior and demographics.NeuroImage206, 116276 (2020)
2020
-
[35]
A.et al.A reusable benchmark of brain-age prediction from M/EEG resting-state signals.NeuroImage262, 119521 (2022)
Engemann, D. A.et al.A reusable benchmark of brain-age prediction from M/EEG resting-state signals.NeuroImage262, 119521 (2022)
2022
-
[36]
& Alesanco, Á
Mehavilla, L., Rodríguez, M., García, J. & Alesanco, Á. Evaluating large language models effectiveness for flow-based intrusion detection: a comparative study with ML and DL baselines.Artif. Intell. Rev.59, 50 (2026)
2026
-
[37]
& Brgoch, J
Zhuo, Y ., Mansouri Tehrani, A. & Brgoch, J. Predicting the band gaps of inorganic solids by machine learning.The J. Phys. Chem. Lett.9, 1668–1673 (2018)
2018
-
[38]
Kumagai, M.et al.Effects of data bias on machine-learning–based material discovery using experimental property data. Sci. Technol. Adv. Materials: Methods2, 302–309 (2022)
2022
-
[39]
APL Mater.1, 011002 (2013)
Jain, A.et al.Commentary: The Materials Project: A materials genome approach to accelerating materials innovation. APL Mater.1, 011002 (2013). 45.Woods-Robinson, R.et al.Wide band gap chalcogenide semiconductors.Chem. Rev.120(2020)
2013
-
[40]
& Hong, J
Tang, G., Ghosez, P. & Hong, J. Band-edge orbital engineering of perovskite semiconductors for optoelectronic applications. The J. Phys. Chem. Lett.12, 4227–4239 (2021)
2021
-
[41]
Ye, K.et al.Low-energy electronic structure of perovskite and Ruddlesden-Popper semiconductors in the Ba-Zr-S system probed by bond-selective polarized x-ray absorption spectroscopy, infrared reflectivity, and raman scattering.Phys. Rev. B 105, 195203 (2022)
2022
-
[42]
N., Li, K., Hattrick-Simpers, J
Rubungo, A. N., Li, K., Hattrick-Simpers, J. & Dieng, A. B. LLM4Mat-Bench: Benchmarking large language models for materials property prediction. InAI for Accelerated Materials Design - NeurIPS 2024(2024)
2024
-
[43]
P.et al.Python materials genomics (pymatgen): A robust, open-source python library for materials analysis
Ong, S. P.et al.Python materials genomics (pymatgen): A robust, open-source python library for materials analysis. Comput. Mater. Sci.68, 314–319 (2013)
2013
-
[44]
Witman, M. D. & Schindler, P. MatFold: systematic insights into materials discovery models’ performance through standardized cross-validation protocols.Digit. Discov.4, 625–635 (2025). 14/35
2025
-
[45]
Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps
Simonyan, K., Vedaldi, A. & Zisserman, A. Deep inside convolutional networks: Visualising image classification models and saliency maps. InWorkshop Proceedings of the International Conference on Learning Representations (ICLR)(2014). ArXiv preprint arXiv:1312.6034
work page internal anchor Pith review arXiv 2014
-
[46]
E., Kolouri, S., Rostami, M., Martin, C
Pope, P. E., Kolouri, S., Rostami, M., Martin, C. E. & Hoffmann, H. Explainability methods for graph convolutional neural networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10772–10781 (2019). Acknowledgements X.L. and H.L. disclose support for the research of this work from the EPSRC[UKRI396]. A.J.R discloses s...
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.