Substitution-Based Analysis of Structural Novelty for Generative Models of Materials
Pith reviewed 2026-06-26 08:51 UTC · model grok-4.3
The pith
A workflow shows that 81-92% of crystals from generative models are either training duplicates or derived by elemental substitution.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The developed workflow classifies generated crystals and reveals that 81-92% of chemically valid and metastable generated crystals are either training duplicates or substitution-derived structures. This tendency is particularly strong in high-symmetry crystal systems. Low-symmetry structures beyond duplication or substitution can be interpreted as interpolation in training-data-rich regions, while high-symmetry duplicates appear to result from memorisation in training-sparse regions.
What carries the argument
The substitution-based analysis workflow that identifies training duplicates, substitution-derived structures, or unmatched novel crystals.
If this is right
- Generative models exhibit a bias towards known structural prototypes especially in high symmetry regions.
- Models enable wider exploration of low-symmetry structural space through interpolation.
- Current generation of models has limitation in expanding beyond conventional strategies.
- Many possible structural prototypes remain unexplored by the models.
Where Pith is reading between the lines
- Improving novelty would require techniques that penalize substitution patterns or encourage exploration in sparse high-symmetry areas.
- Similar analysis could be applied to other generative tasks such as molecule design to check for analogical novelty.
- The workflow provides a quantitative metric that future model papers could report to demonstrate true novelty.
Load-bearing premise
The rules used to detect substitution-derived structures accurately capture all possible elemental substitutions without misclassifying genuinely new structures.
What would settle it
A generated crystal structure that the workflow labels as unmatched but can be shown to be obtainable by elemental substitution from a training structure would falsify the novelty assessment for that example.
read the original abstract
There has been rapid progress in generative artificial intelligence (AI) models for inorganic crystal design, which can efficiently generate large numbers of candidate compounds after being trained on databases of known crystals. However, it remains unclear whether they genuinely expand the accessible materials search space beyond conventional strategies such as elemental substitution within known structure types. We address this question by developing a workflow to assess whether AI-generated crystals are duplicates of training structures, reproducible by elemental substitution, or unmatched by either criterion. Applying this workflow to representative generative models reveals that 81-92% of chemically valid and metastable generated crystals are either training duplicates or substitution-derived structures. This tendency is particularly strong in high-symmetry crystal systems, even though many possible structural prototypes remain unexplored. Further analysis of the underlying structural fingerprints shows that low-symmetry structures beyond duplication or substitution can be interpreted as interpolation in training-data-rich regions, while high-symmetry duplicates appear to result from memorisation in training-sparse regions. Our findings highlight a limitation in the current generation of models that exhibit a bias towards known structural prototypes in the high symmetry regions, but enable wider exploration of the low-symmetry structural space.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript develops a workflow using structure-type matching via fingerprints and elemental substitution enumeration within known prototypes to classify AI-generated crystals as training duplicates, substitution-derived, or unmatched. Application to representative generative models shows that 81-92% of chemically valid and metastable outputs are duplicates or substitution-derived, with the tendency stronger in high-symmetry systems; low-symmetry unmatched structures are interpreted as interpolations in data-rich regions and high-symmetry duplicates as memorization in sparse regions.
Significance. If the workflow holds, the result is significant because it provides a quantitative, falsifiable measure showing that current generative models for inorganic crystals largely reproduce known structural prototypes rather than expanding the search space, particularly in high-symmetry regimes. The >95% agreement on a held-out validation set in §3.2 and the explicit non-circular definition of the tests strengthen the assessment and offer a practical tool for future model evaluation.
minor comments (3)
- [§3.2] §3.2: the held-out validation set size, composition, and exact false-positive/false-negative breakdown for the substitution rules should be reported to allow readers to judge whether the >95% agreement is sufficient to bound the uncertainty in the 81-92% range.
- [Methods] Methods: the precise value and justification of the metastability energy cutoff (listed as the sole free parameter) should be stated explicitly, together with a sensitivity check showing how the reported percentages change when the cutoff is varied by ±10 meV/atom.
- The abstract and §4 could clarify which specific generative models were tested and how many structures per model entered the 81-92% statistic, to make the central numerical claim immediately reproducible from the text.
Simulated Author's Rebuttal
We thank the referee for the positive and constructive assessment of our manuscript, including the recognition of the workflow's quantitative value and the >95% validation agreement. We note the recommendation for minor revision and will prepare an updated version accordingly.
Circularity Check
No significant circularity
full rationale
The paper defines an explicit workflow for identifying training duplicates (via structure matching) and substitution-derived crystals (via elemental substitution enumeration within prototypes), validates the workflow on a held-out manually curated set showing >95% agreement, and reports the 81-92% statistic as a direct count from applying this workflow to external generative model outputs. No equations or quantities are defined in terms of the reported percentages; the detection rules are not fitted to the target result; no load-bearing self-citations or uniqueness theorems from the authors are invoked; and the central claim remains an empirical measurement against an independent benchmark rather than a self-referential derivation.
Axiom & Free-Parameter Ledger
free parameters (1)
- metastability energy cutoff
axioms (1)
- domain assumption Elemental substitution within known structure types does not constitute structural novelty.
Reference graph
Works this paper leans on
-
[1]
Rolnick, D.et al.Tackling climate change with machine learning.ACM Comput. Surv.55(2022). URL https://doi.org/10.1145/3485128
-
[2]
& Rehme, S
Zagorac, D., M¨ uller, H., Ruehl, S., Zagorac, J. & Rehme, S. Recent developments in the inorganic crystal structure database: theoretical crystal structure data and related features.Applied Crystallography52, 918–925 (2019)
2019
-
[3]
W.et al.Computational screening of all stoichiometric inorganic mate- rials.Chem1, 617–627 (2016)
Davies, D. W.et al.Computational screening of all stoichiometric inorganic mate- rials.Chem1, 617–627 (2016). URL https://www.sciencedirect.com/science/ article/pii/S2451929416301553
2016
-
[4]
& Walsh, A
Park, H., Li, Z. & Walsh, A. Has generative artificial intelligence solved inverse materials design?Matter7, 2355–2367 (2024)
2024
-
[5]
Cheng, M.et al.Ai-driven materials design: a mini-review.arXiv preprint arXiv:2502.02905(2025)
arXiv 2025
-
[6]
De Breuck, P.-P., Wang, H.-C., Rignanese, G.-M., Botti, S. & Marques, M. A. Generative ai for crystal structures: a review.npj Computational Materials11, 370 (2025). URL https://doi.org/10.1038/s41524-025-01881-2
-
[7]
Li, Z.et al.Materials generation in the era of artificial intelligence: A comprehen- sive survey.arXiv preprint arXiv:2505.16379(2025)
arXiv 2025
-
[8]
Recatala-Gomez, J.et al.Generative design of inorganic materials.arXiv preprint arXiv:2604.14082(2026)
Pith/arXiv arXiv 2026
-
[9]
Tipton, W. W. & Hennig, R. G. A grand canonical genetic algorithm for the prediction of multi-component phase diagrams and testing of empirical potentials. Journal of Physics: Condensed Matter25, 495401 (2013). URL https://doi.org/ 10.1088/0953-8984/25/49/495401
-
[10]
Lonie, D. C. & Zurek, E. Xtalopt: An open-source evolutionary algorithm for crys- tal structure prediction.Computer Physics Communications182, 372–387 (2011). URL https://www.sciencedirect.com/science/article/pii/S0010465510003140
2011
-
[11]
W., Oganov, A
Glass, C. W., Oganov, A. R. & Hansen, N. Uspex—evolutionary crystal structure prediction.Computer Physics Communications175, 713–720 (2006). URL https: //www.sciencedirect.com/science/article/pii/S0010465506002931
2006
-
[12]
& Zurek, E
Hajinazar, S. & Zurek, E. Xtalopt version 14: Variable-composition crystal structure search for functional materials through pareto optimization.Computer Physics Communications320, 109910 (2026). URL https://www.sciencedirect. com/science/article/pii/S0010465525004114. 17
2026
-
[13]
Szymanski, N. J. & Bartel, C. J. Establishing baselines for generative discovery of inorganic crystals.Materials Horizons12, 8000–8011 (2025)
2025
-
[14]
Negishi, M., Park, H., Mastej, K. O. & Walsh, A. Continuous sun (stable, unique, and novel) metric for generative modeling of inorganic crystals (2026). URL https: //arxiv.org/abs/2510.12405. arXiv:2510.12405
arXiv 2026
-
[15]
Fischer, C. C., Tibbetts, K. J., Morgan, D. & Ceder, G. Predicting crystal structure by merging data mining with quantum mechanics.Nature materials5, 641–646 (2006). URL https://doi.org/10.1038/nmat1691
-
[16]
Hautier, G., Fischer, C., Ehrlacher, V., Jain, A. & Ceder, G. Data mined ionic substitutions for the discovery of new compounds.Inorganic chemistry50, 656–663 (2011). URL https://doi.org/10.1021/ic102031h
-
[17]
Goodall, R. E. A., Parackal, A. S., Faber, F. A., Armiento, R. & Lee, A. A. Rapid discovery of stable materials by coordinate-free coarse graining.Science Advances8, eabn4117 (2022). URL https://www.science.org/doi/abs/10.1126/ sciadv.abn4117
2022
-
[18]
Wang, H.-C., Botti, S. & Marques, M. A. Predicting stable crystalline compounds using chemical similarity.npj Computational Materials7, 12 (2021). URL https: //doi.org/10.1038/s41524-020-00481-6
-
[19]
URL https: //doi.org/10.1038/s41524-024-01471-8
Liu, C.et al.Shotgun crystal structure prediction using machine-learned for- mation energies.npj Computational Materials10, 298 (2024). URL https: //doi.org/10.1038/s41524-024-01471-8
-
[20]
Zimmermann, N. E. & Jain, A. Local structure order parameters and site fingerprints for quantification of coordination environment and crystal structure similarity.RSC advances10, 6063–6081 (2020). URL http://dx.doi.org/10.1039/ D4DD00024B
2020
-
[21]
M., Pulido, A., Kurlin, V
Widdowson, D., Mosca, M. M., Pulido, A., Kurlin, V. & Cooper, A. I. Average minimum distances of periodic point sets - foundational invariants for mapping periodic crystals.MATCH Communications in Mathematical and in Computer Chemistry87, 529–559 (2022)
2022
-
[22]
& Marques, M
Glawe, H., Sanna, A., Gross, E. & Marques, M. A. The optimal one dimensional periodic table: a modified pettifor chemical scale from data mining.New Journal of Physics18, 093011 (2016)
2016
-
[23]
APL Photonics7(9), 096104 (2022) https://doi.org/10.1063/5
Batatia, I.et al.A foundation model for atomistic materials chemistry.The Journal of chemical physics163, 184110 (2025). URL https://doi.org/10.1063/5. 0297006. 18
work page doi:10.1063/5 2025
-
[24]
& Jaakkola, T
Xie, T., Fu, X., Ganea, O.-E., Barzilay, R. & Jaakkola, T. S.Crystal diffusion variational autoencoder for periodic material generation.International Confer- ence on Learning Representations(2022). URL https://openreview.net/forum? id=03RLpj-tc
2022
-
[25]
URL https://doi.org/10.1038/s41586-025-08628-5
Zeni, C.et al.A generative model for inorganic materials design.Nature639, 624–632 (2025). URL https://doi.org/10.1038/s41586-025-08628-5
-
[26]
P.et al.Python materials genomics (pymatgen): A robust, open- source python library for materials analysis.Computational Materials Science 68, 314–319 (2013)
Ong, S. P.et al.Python materials genomics (pymatgen): A robust, open- source python library for materials analysis.Computational Materials Science 68, 314–319 (2013). URL https://www.sciencedirect.com/science/article/pii/ S0927025612006295
2013
-
[27]
& Liu, Y.Space group constrained crystal generation.The Twelfth International Conference on Learning Representations (2024)
Jiao, R., Huang, W., Liu, Y., Zhao, D. & Liu, Y.Space group constrained crystal generation.The Twelfth International Conference on Learning Representations (2024). URL https://openreview.net/forum?id=jkvZ7v4OmP
2024
-
[28]
Wyckoff transformer: Generation of symmetric crystals.Forty- second International Conference on Machine Learning(2025)
Kazeev, N.et al. Wyckoff transformer: Generation of symmetric crystals.Forty- second International Conference on Machine Learning(2025). URL https:// openreview.net/forum?id=eFHfRQRjJo
2025
-
[29]
H., Rosenthal, J., Lonˇ cari´ c, I
Veljkovi´ c, T. H., Rosenthal, J., Lonˇ cari´ c, I. & van de Meent, J.-W. Crys- talite: A lightweight transformer for efficient crystal modeling.arXiv preprint arXiv:2604.02270(2026)
arXiv 2026
-
[30]
Park, H. & Walsh, A. Guiding generative models to uncover diverse and novel crystals via reinforcement learning.arXiv preprint arXiv:2511.07158(2025)
arXiv 2025
-
[31]
Jain, A.et al.Commentary: The materials project: A materials genome approach to accelerating materials innovation.APL materials1, 011002 (2013). URL https: //doi.org/10.1063/1.4812323
-
[32]
Betala, S.et al.Lemat-genbench: A unified evaluation framework for crystal generative models.arXiv preprint arXiv:2512.04562(2025)
arXiv 2025
-
[33]
URL https://www.sciencedirect
Hicks, D.et al.The aflow library of crystallographic prototypes: Part 2.Compu- tational Materials Science161, S1–S1011 (2019). URL https://www.sciencedirect. com/science/article/pii/S0927025618307146
2019
-
[34]
Scarvelis, C., de Oc´ ariz Borde, H. S. & Solomon, J. Closed-form diffusion models (2025). URL https://arxiv.org/abs/2310.12395. arXiv:2310.12395
arXiv 2025
-
[35]
& No, A.Understanding and mitigating memorization in gen- erative models via sharpness of probability landscapes.Forty-second International Conference on Machine Learning(2025)
Jeon, D., Kim, D. & No, A.Understanding and mitigating memorization in gen- erative models via sharpness of probability landscapes.Forty-second International Conference on Machine Learning(2025). URL https://openreview.net/forum?id= EW2JR5aVLm. 19
2025
-
[36]
URL https://openreview.net/forum?id=O33LAUliUF
Chen, Z.On the interpolation effect of score smoothing in diffusion models.The Fourteenth International Conference on Learning Representations(2026). URL https://openreview.net/forum?id=O33LAUliUF
2026
-
[37]
Substitution-Based Analysis of Structural Novelty for Generative Materials Models
Liu, D. C. & Nocedal, J. On the limited memory bfgs method for large scale optimization.Mathematical programming45, 503–528 (1989). Acknowledgements This work was supported by the AIchemy Hub through EPSRC grants EP/Y028775/1 and EP/Y028759/1, and by an Imperial College President’s PhD Scholarship. We acknowledge the EuroHPC Joint Undertaking for providin...
1989
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.