pith. sign in

arxiv: 2606.23166 · v1 · pith:J3VIOZE2new · submitted 2026-06-22 · 💻 cs.LG · cond-mat.mtrl-sci

Substitution-Based Analysis of Structural Novelty for Generative Models of Materials

Pith reviewed 2026-06-26 08:51 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.mtrl-sci
keywords generative modelscrystal designstructural noveltyelemental substitutionmaterials discoveryAI-generated structuresduplication detection
0
0 comments X

The pith

A workflow shows that 81-92% of crystals from generative models are either training duplicates or derived by elemental substitution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a workflow to determine if AI-generated inorganic crystals are duplicates of training data, can be obtained by elemental substitution within known structure types, or are unmatched by either. When applied to several representative generative models, it finds that the vast majority of chemically valid and metastable outputs fall into the first two categories. This indicates that current models do not expand the materials space much beyond conventional substitution strategies. Analysis of structural fingerprints further suggests that low-symmetry novel structures arise from interpolation in data-rich areas, while high-symmetry duplicates come from memorization in sparse regions.

Core claim

The developed workflow classifies generated crystals and reveals that 81-92% of chemically valid and metastable generated crystals are either training duplicates or substitution-derived structures. This tendency is particularly strong in high-symmetry crystal systems. Low-symmetry structures beyond duplication or substitution can be interpreted as interpolation in training-data-rich regions, while high-symmetry duplicates appear to result from memorisation in training-sparse regions.

What carries the argument

The substitution-based analysis workflow that identifies training duplicates, substitution-derived structures, or unmatched novel crystals.

If this is right

  • Generative models exhibit a bias towards known structural prototypes especially in high symmetry regions.
  • Models enable wider exploration of low-symmetry structural space through interpolation.
  • Current generation of models has limitation in expanding beyond conventional strategies.
  • Many possible structural prototypes remain unexplored by the models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Improving novelty would require techniques that penalize substitution patterns or encourage exploration in sparse high-symmetry areas.
  • Similar analysis could be applied to other generative tasks such as molecule design to check for analogical novelty.
  • The workflow provides a quantitative metric that future model papers could report to demonstrate true novelty.

Load-bearing premise

The rules used to detect substitution-derived structures accurately capture all possible elemental substitutions without misclassifying genuinely new structures.

What would settle it

A generated crystal structure that the workflow labels as unmatched but can be shown to be obtainable by elemental substitution from a training structure would falsify the novelty assessment for that example.

read the original abstract

There has been rapid progress in generative artificial intelligence (AI) models for inorganic crystal design, which can efficiently generate large numbers of candidate compounds after being trained on databases of known crystals. However, it remains unclear whether they genuinely expand the accessible materials search space beyond conventional strategies such as elemental substitution within known structure types. We address this question by developing a workflow to assess whether AI-generated crystals are duplicates of training structures, reproducible by elemental substitution, or unmatched by either criterion. Applying this workflow to representative generative models reveals that 81-92% of chemically valid and metastable generated crystals are either training duplicates or substitution-derived structures. This tendency is particularly strong in high-symmetry crystal systems, even though many possible structural prototypes remain unexplored. Further analysis of the underlying structural fingerprints shows that low-symmetry structures beyond duplication or substitution can be interpreted as interpolation in training-data-rich regions, while high-symmetry duplicates appear to result from memorisation in training-sparse regions. Our findings highlight a limitation in the current generation of models that exhibit a bias towards known structural prototypes in the high symmetry regions, but enable wider exploration of the low-symmetry structural space.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript develops a workflow using structure-type matching via fingerprints and elemental substitution enumeration within known prototypes to classify AI-generated crystals as training duplicates, substitution-derived, or unmatched. Application to representative generative models shows that 81-92% of chemically valid and metastable outputs are duplicates or substitution-derived, with the tendency stronger in high-symmetry systems; low-symmetry unmatched structures are interpreted as interpolations in data-rich regions and high-symmetry duplicates as memorization in sparse regions.

Significance. If the workflow holds, the result is significant because it provides a quantitative, falsifiable measure showing that current generative models for inorganic crystals largely reproduce known structural prototypes rather than expanding the search space, particularly in high-symmetry regimes. The >95% agreement on a held-out validation set in §3.2 and the explicit non-circular definition of the tests strengthen the assessment and offer a practical tool for future model evaluation.

minor comments (3)
  1. [§3.2] §3.2: the held-out validation set size, composition, and exact false-positive/false-negative breakdown for the substitution rules should be reported to allow readers to judge whether the >95% agreement is sufficient to bound the uncertainty in the 81-92% range.
  2. [Methods] Methods: the precise value and justification of the metastability energy cutoff (listed as the sole free parameter) should be stated explicitly, together with a sensitivity check showing how the reported percentages change when the cutoff is varied by ±10 meV/atom.
  3. The abstract and §4 could clarify which specific generative models were tested and how many structures per model entered the 81-92% statistic, to make the central numerical claim immediately reproducible from the text.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and constructive assessment of our manuscript, including the recognition of the workflow's quantitative value and the >95% validation agreement. We note the recommendation for minor revision and will prepare an updated version accordingly.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper defines an explicit workflow for identifying training duplicates (via structure matching) and substitution-derived crystals (via elemental substitution enumeration within prototypes), validates the workflow on a held-out manually curated set showing >95% agreement, and reports the 81-92% statistic as a direct count from applying this workflow to external generative model outputs. No equations or quantities are defined in terms of the reported percentages; the detection rules are not fitted to the target result; no load-bearing self-citations or uniqueness theorems from the authors are invoked; and the central claim remains an empirical measurement against an independent benchmark rather than a self-referential derivation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central percentages rest on the assumption that the substitution workflow is both complete and unbiased, plus the representativeness of the chosen generative models and the definition of chemical validity and metastability.

free parameters (1)
  • metastability energy cutoff
    Used to filter the generated crystals counted in the 81-92% statistic; value not stated in abstract.
axioms (1)
  • domain assumption Elemental substitution within known structure types does not constitute structural novelty.
    This premise defines the boundary between substitution-derived and unmatched structures.

pith-pipeline@v0.9.1-grok · 5731 in / 1253 out tokens · 33230 ms · 2026-06-26T08:51:27.599827+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 10 canonical work pages

  1. [1]

    Surv.55(2022)

    Rolnick, D.et al.Tackling climate change with machine learning.ACM Comput. Surv.55(2022). URL https://doi.org/10.1145/3485128

  2. [2]

    & Rehme, S

    Zagorac, D., M¨ uller, H., Ruehl, S., Zagorac, J. & Rehme, S. Recent developments in the inorganic crystal structure database: theoretical crystal structure data and related features.Applied Crystallography52, 918–925 (2019)

  3. [3]

    W.et al.Computational screening of all stoichiometric inorganic mate- rials.Chem1, 617–627 (2016)

    Davies, D. W.et al.Computational screening of all stoichiometric inorganic mate- rials.Chem1, 617–627 (2016). URL https://www.sciencedirect.com/science/ article/pii/S2451929416301553

  4. [4]

    & Walsh, A

    Park, H., Li, Z. & Walsh, A. Has generative artificial intelligence solved inverse materials design?Matter7, 2355–2367 (2024)

  5. [5]

    Cheng, M.et al.Ai-driven materials design: a mini-review.arXiv preprint arXiv:2502.02905(2025)

  6. [6]

    & Marques, M

    De Breuck, P.-P., Wang, H.-C., Rignanese, G.-M., Botti, S. & Marques, M. A. Generative ai for crystal structures: a review.npj Computational Materials11, 370 (2025). URL https://doi.org/10.1038/s41524-025-01881-2

  7. [7]

    Li, Z.et al.Materials generation in the era of artificial intelligence: A comprehen- sive survey.arXiv preprint arXiv:2505.16379(2025)

  8. [8]

    Recatala-Gomez, J.et al.Generative design of inorganic materials.arXiv preprint arXiv:2604.14082(2026)

  9. [9]

    Tipton, W. W. & Hennig, R. G. A grand canonical genetic algorithm for the prediction of multi-component phase diagrams and testing of empirical potentials. Journal of Physics: Condensed Matter25, 495401 (2013). URL https://doi.org/ 10.1088/0953-8984/25/49/495401

  10. [10]

    Lonie, D. C. & Zurek, E. Xtalopt: An open-source evolutionary algorithm for crys- tal structure prediction.Computer Physics Communications182, 372–387 (2011). URL https://www.sciencedirect.com/science/article/pii/S0010465510003140

  11. [11]

    W., Oganov, A

    Glass, C. W., Oganov, A. R. & Hansen, N. Uspex—evolutionary crystal structure prediction.Computer Physics Communications175, 713–720 (2006). URL https: //www.sciencedirect.com/science/article/pii/S0010465506002931

  12. [12]

    & Zurek, E

    Hajinazar, S. & Zurek, E. Xtalopt version 14: Variable-composition crystal structure search for functional materials through pareto optimization.Computer Physics Communications320, 109910 (2026). URL https://www.sciencedirect. com/science/article/pii/S0010465525004114. 17

  13. [13]

    Szymanski, N. J. & Bartel, C. J. Establishing baselines for generative discovery of inorganic crystals.Materials Horizons12, 8000–8011 (2025)

  14. [14]

    Negishi, M., Park, H., Mastej, K. O. & Walsh, A. Continuous sun (stable, unique, and novel) metric for generative modeling of inorganic crystals (2026). URL https: //arxiv.org/abs/2510.12405. arXiv:2510.12405

  15. [15]

    C., Tibbetts, K

    Fischer, C. C., Tibbetts, K. J., Morgan, D. & Ceder, G. Predicting crystal structure by merging data mining with quantum mechanics.Nature materials5, 641–646 (2006). URL https://doi.org/10.1038/nmat1691

  16. [16]

    & Ceder, G

    Hautier, G., Fischer, C., Ehrlacher, V., Jain, A. & Ceder, G. Data mined ionic substitutions for the discovery of new compounds.Inorganic chemistry50, 656–663 (2011). URL https://doi.org/10.1021/ic102031h

  17. [17]

    Goodall, R. E. A., Parackal, A. S., Faber, F. A., Armiento, R. & Lee, A. A. Rapid discovery of stable materials by coordinate-free coarse graining.Science Advances8, eabn4117 (2022). URL https://www.science.org/doi/abs/10.1126/ sciadv.abn4117

  18. [18]

    npj Comput

    Wang, H.-C., Botti, S. & Marques, M. A. Predicting stable crystalline compounds using chemical similarity.npj Computational Materials7, 12 (2021). URL https: //doi.org/10.1038/s41524-020-00481-6

  19. [19]

    URL https: //doi.org/10.1038/s41524-024-01471-8

    Liu, C.et al.Shotgun crystal structure prediction using machine-learned for- mation energies.npj Computational Materials10, 298 (2024). URL https: //doi.org/10.1038/s41524-024-01471-8

  20. [20]

    Zimmermann, N. E. & Jain, A. Local structure order parameters and site fingerprints for quantification of coordination environment and crystal structure similarity.RSC advances10, 6063–6081 (2020). URL http://dx.doi.org/10.1039/ D4DD00024B

  21. [21]

    M., Pulido, A., Kurlin, V

    Widdowson, D., Mosca, M. M., Pulido, A., Kurlin, V. & Cooper, A. I. Average minimum distances of periodic point sets - foundational invariants for mapping periodic crystals.MATCH Communications in Mathematical and in Computer Chemistry87, 529–559 (2022)

  22. [22]

    & Marques, M

    Glawe, H., Sanna, A., Gross, E. & Marques, M. A. The optimal one dimensional periodic table: a modified pettifor chemical scale from data mining.New Journal of Physics18, 093011 (2016)

  23. [23]

    A foundation model for atomistic materials chemistry

    Batatia, I.et al.A foundation model for atomistic materials chemistry.The Journal of chemical physics163, 184110 (2025). URL https://doi.org/10.1063/5. 0297006. 18

  24. [24]

    & Jaakkola, T

    Xie, T., Fu, X., Ganea, O.-E., Barzilay, R. & Jaakkola, T. S.Crystal diffusion variational autoencoder for periodic material generation.International Confer- ence on Learning Representations(2022). URL https://openreview.net/forum? id=03RLpj-tc

  25. [25]

    URL https://doi.org/10.1038/s41586-025-08628-5

    Zeni, C.et al.A generative model for inorganic materials design.Nature639, 624–632 (2025). URL https://doi.org/10.1038/s41586-025-08628-5

  26. [26]

    P.et al.Python materials genomics (pymatgen): A robust, open- source python library for materials analysis.Computational Materials Science 68, 314–319 (2013)

    Ong, S. P.et al.Python materials genomics (pymatgen): A robust, open- source python library for materials analysis.Computational Materials Science 68, 314–319 (2013). URL https://www.sciencedirect.com/science/article/pii/ S0927025612006295

  27. [27]

    & Liu, Y.Space group constrained crystal generation.The Twelfth International Conference on Learning Representations (2024)

    Jiao, R., Huang, W., Liu, Y., Zhao, D. & Liu, Y.Space group constrained crystal generation.The Twelfth International Conference on Learning Representations (2024). URL https://openreview.net/forum?id=jkvZ7v4OmP

  28. [28]

    Wyckoff transformer: Generation of symmetric crystals.Forty- second International Conference on Machine Learning(2025)

    Kazeev, N.et al. Wyckoff transformer: Generation of symmetric crystals.Forty- second International Conference on Machine Learning(2025). URL https:// openreview.net/forum?id=eFHfRQRjJo

  29. [29]

    H., Rosenthal, J., Lonˇ cari´ c, I

    Veljkovi´ c, T. H., Rosenthal, J., Lonˇ cari´ c, I. & van de Meent, J.-W. Crys- talite: A lightweight transformer for efficient crystal modeling.arXiv preprint arXiv:2604.02270(2026)

  30. [30]

    & Walsh, A

    Park, H. & Walsh, A. Guiding generative models to uncover diverse and novel crystals via reinforcement learning.arXiv preprint arXiv:2511.07158(2025)

  31. [31]

    , title =

    Jain, A.et al.Commentary: The materials project: A materials genome approach to accelerating materials innovation.APL materials1, 011002 (2013). URL https: //doi.org/10.1063/1.4812323

  32. [32]

    Betala, S.et al.Lemat-genbench: A unified evaluation framework for crystal generative models.arXiv preprint arXiv:2512.04562(2025)

  33. [33]

    URL https://www.sciencedirect

    Hicks, D.et al.The aflow library of crystallographic prototypes: Part 2.Compu- tational Materials Science161, S1–S1011 (2019). URL https://www.sciencedirect. com/science/article/pii/S0927025618307146

  34. [34]

    Scarvelis, C., de Oc´ ariz Borde, H. S. & Solomon, J. Closed-form diffusion models (2025). URL https://arxiv.org/abs/2310.12395. arXiv:2310.12395

  35. [35]

    & No, A.Understanding and mitigating memorization in gen- erative models via sharpness of probability landscapes.Forty-second International Conference on Machine Learning(2025)

    Jeon, D., Kim, D. & No, A.Understanding and mitigating memorization in gen- erative models via sharpness of probability landscapes.Forty-second International Conference on Machine Learning(2025). URL https://openreview.net/forum?id= EW2JR5aVLm. 19

  36. [36]

    URL https://openreview.net/forum?id=O33LAUliUF

    Chen, Z.On the interpolation effect of score smoothing in diffusion models.The Fourteenth International Conference on Learning Representations(2026). URL https://openreview.net/forum?id=O33LAUliUF

  37. [37]

    Substitution-Based Analysis of Structural Novelty for Generative Materials Models

    Liu, D. C. & Nocedal, J. On the limited memory bfgs method for large scale optimization.Mathematical programming45, 503–528 (1989). Acknowledgements This work was supported by the AIchemy Hub through EPSRC grants EP/Y028775/1 and EP/Y028759/1, and by an Imperial College President’s PhD Scholarship. We acknowledge the EuroHPC Joint Undertaking for providin...