deCIFer: Crystal Structure Prediction from Powder Diffraction Data using Autoregressive Language Models
Pith reviewed 2026-05-23 03:48 UTC · model grok-4.3
The pith
deCIFer autoregressive model generates crystal structures in CIF format from PXRD data at 94 percent match rate.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
deCIFer is an autoregressive language model for PXRD-conditioned crystal structure prediction that directly outputs crystal structures in CIF format; after training on nearly 2.3 million structures with PXRD patterns augmented by Gaussian noise and peak broadening, the model reaches a 94 percent structural match rate on diverse synthetic datasets of challenging inorganic materials when assessed by R_wp and match-rate metrics.
What carries the argument
Autoregressive language model that generates CIF token sequences conditioned on PXRD patterns augmented with Gaussian noise and instrumental peak broadening.
If this is right
- Provides an alternative to composition- or symmetry-driven CSP methods by conditioning directly on diffraction data.
- Produces structures in the standard CIF format that can be used immediately in downstream materials workflows.
- Achieves its reported match rate using metrics chosen for practical relevance in the underdetermined PXRD problem.
- Sets an explicit baseline that future work can extend to more complex experimental conditions.
Where Pith is reading between the lines
- Testing the trained model on actual laboratory PXRD measurements would show whether the limited synthetic augmentations suffice for real-world generalization.
- Extending the conditioning to include additional experimental effects such as preferred orientation or sample displacement could improve robustness without changing the core architecture.
- Pairing deCIFer outputs with rapid Rietveld refinement might reduce the number of candidate structures that need manual inspection.
Load-bearing premise
Augmenting PXRD conditioning with only Gaussian noise and instrumental peak broadening produces training signals representative enough for the model to generalize to real experimental powder diffraction data.
What would settle it
Apply deCIFer to a collection of real experimental PXRD patterns from structures whose ground-truth CIFs are already known and measure whether the structural match rate remains near 94 percent.
Figures
read the original abstract
Novel materials drive advancements in fields ranging from energy storage to electronics, with crystal structure characterization forming a crucial yet challenging step in materials discovery. In this work, we introduce \emph{deCIFer}, an autoregressive language model designed for powder X-ray diffraction (PXRD)-conditioned crystal structure prediction (PXRD-CSP). Unlike traditional CSP methods that rely primarily on composition or symmetry constraints, deCIFer explicitly incorporates PXRD data, directly generating crystal structures in the widely adopted Crystallographic Information File (CIF) format. The model is trained on nearly 2.3 million crystal structures, with PXRD conditioning augmented by basic forms of synthetic experimental artifacts, specifically Gaussian noise and instrumental peak broadening, to reflect fundamental real-world conditions. Validated across diverse synthetic datasets representative of challenging inorganic materials, deCIFer achieves a 94\% structural match rate. The evaluation is based on metrics such as the residual weighted profile ($R_{wp}$) and structural match rate (MR), chosen explicitly for their practical relevance in this inherently underdetermined problem. deCIFer establishes a robust baseline for future expansion toward more complex experimental scenarios, bridging the gap between computational predictions and experimental crystal structure determination.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces deCIFer, an autoregressive language model for PXRD-conditioned crystal structure prediction that directly generates CIF files. Trained on ~2.3 million structures with PXRD patterns augmented by Gaussian noise and peak broadening, it reports a 94% structural match rate (MR) on diverse synthetic test sets of inorganic materials, using R_wp and MR as primary metrics. The work positions the model as a baseline for future experimental PXRD scenarios.
Significance. A working autoregressive CIF generator conditioned on PXRD would address an important inverse problem in materials science. The training scale and choice of practically relevant metrics (R_wp, MR) are positive features. However, because all reported results use only synthetic data with limited augmentations, the practical significance for real experimental data remains unproven.
major comments (2)
- [Abstract] Abstract: the central performance claim of a 94% structural match rate is shown exclusively on synthetic PXRD patterns that incorporate only Gaussian noise and instrumental peak broadening. No experiments on measured laboratory diffractograms are reported, which directly affects the claim that the method 'bridges the gap between computational predictions and experimental crystal structure determination.'
- [Abstract] Abstract (training description paragraph): the assumption that the chosen augmentations suffice to simulate the underdetermined inverse problem of real PXRD is load-bearing for generalization claims, yet the manuscript provides no ablation or sensitivity analysis on additional common experimental effects such as preferred orientation, sample displacement, or impurity phases.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below, with revisions where appropriate to better reflect the scope of the work as a synthetic-data baseline.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central performance claim of a 94% structural match rate is shown exclusively on synthetic PXRD patterns that incorporate only Gaussian noise and instrumental peak broadening. No experiments on measured laboratory diffractograms are reported, which directly affects the claim that the method 'bridges the gap between computational predictions and experimental crystal structure determination.'
Authors: We agree that all quantitative results, including the 94% structural match rate, are obtained exclusively on synthetic PXRD patterns with Gaussian noise and peak broadening. The manuscript already describes deCIFer as establishing a baseline for future expansion toward experimental scenarios. To address the concern, we will revise the abstract to explicitly state that current performance is demonstrated on synthetic data and that bridging to experimental crystal structure determination remains a direction for future work rather than a completed achievement. revision: yes
-
Referee: [Abstract] Abstract (training description paragraph): the assumption that the chosen augmentations suffice to simulate the underdetermined inverse problem of real PXRD is load-bearing for generalization claims, yet the manuscript provides no ablation or sensitivity analysis on additional common experimental effects such as preferred orientation, sample displacement, or impurity phases.
Authors: The chosen augmentations capture core aspects of experimental PXRD (noise and broadening) to create a practical baseline. We acknowledge that no ablation studies on additional effects such as preferred orientation, sample displacement, or impurity phases are included. As the work is framed as an initial baseline rather than a comprehensive simulation of all experimental artifacts, we do not view exhaustive ablations as necessary for the present contribution. We will add a short clarifying statement in the abstract and discussion sections noting these unmodeled effects as important avenues for future research. revision: partial
- Results on measured laboratory diffractograms cannot be provided, as the current study contains only synthetic data experiments.
Circularity Check
No significant circularity detected
full rationale
The paper trains an autoregressive LM on 2.3M crystal structures with synthetic PXRD augmentations (Gaussian noise + peak broadening) and reports an empirical 94% structural match rate on held-out synthetic validation sets using standard metrics (R_wp, MR). No derivation chain, equation, or claim reduces the reported performance to a fitted input, self-citation, or definition by construction. The result is an independent empirical evaluation on generated vs. ground-truth structures and remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (2)
- autoregressive LM hyperparameters
- synthetic artifact parameters
axioms (1)
- domain assumption The collection of 2.3 million crystal structures is representative of the distribution of inorganic materials of interest.
Reference graph
Works this paper leans on
-
[1]
L. M. Antunes, K. T. Butler, and R. Grau-Crespo. Crystal structure generation with autoregressive large language modeling. Nature Communications, 15 0 (1): 0 10570, 2024. ISSN 2041-1723. doi:10.1038/s41467-024-54639-7
-
[2]
S. Baird. mp-time-split. accessed in 2024. https://github.com/sparks-baird/mp-time-split, 2023
work page 2024
-
[3]
J. K. Bunn, J. Hu, and J. R. Hattrick-Simpers. Semi-supervised approach to phase identification from combinatorial sample diffraction patterns. JOM, 68 0 (8): 0 2116--2125, Aug 2016. ISSN 1543-1851. doi:10.1007/s11837-016-2033-8
-
[4]
I. E. Castelli, D. D. Landis, K. S. Thygesen, S. Dahl, I. Chorkendorff, T. F. Jaramillo, and K. W. Jacobsen. New cubic perovskites for one- and two-photon water splitting using the computational materials repository. Energy Environ. Sci., 5: 0 9034--9043, 2012 a . doi:10.1039/C2EE22341D
-
[5]
I. E. Castelli, T. Olsen, S. Datta, D. D. Landis, S. Dahl, K. S. Thygesen, and K. W. Jacobsen. Computational screening of perovskite metal oxides for optimal solar light capture. Energy Environ. Sci., 5: 0 5814--5819, 2012 b . doi:10.1039/C1EE02717D
-
[6]
A. K. Cheetham and A. L. Goodwin. Crystallography with powders. Nature Materials, 13 0 (8): 0 760--762, Aug 2014
work page 2014
-
[7]
A. K. Cheetham and R. Seshadri. Artificial intelligence driving materials discovery? perspective on the article: Scaling deep learning for materials discovery. Chemistry of Materials, 36 0 (8): 0 3490--3495, 2024. doi:10.1021/acs.chemmater.4c00643
-
[8]
J. Dagdelen, A. Dunn, S. Lee, N. Walker, A. S. Rosen, G. Ceder, K. A. Persson, and A. Jain. Structured information extraction from scientific text with large language models. Nature Communications, 15 0 (1): 0 1418, 2024
work page 2024
-
[9]
A. Davariashtiyani, B. Wang, S. Hajinazar, E. Zurek, and S. Kadkhodaei. Impact of data bias on machine learning for crystal compound synthesizability predictions. Machine Learning: Science and Technology, 5 0 (4): 0 040501, nov 2024. doi:10.1088/2632-2153/ad9378
-
[10]
T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[11]
C. Draxl and M. Scheffler. The nomad laboratory: From data sharing to artificial intelligence. Journal of Physics: Materials, 2, 05 2019. doi:10.1088/2515-7639/ab13bb
-
[12]
A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
U. Friis-Jensen, F. L. Johansen, A. S. Anker, E. B. Dam, K. M. O. Jensen, and R. Selvan. Chili: Chemically-informed large-scale inorganic nanomaterials dataset for advancing graph machine learning. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD '24, page 4962–4973, New York, NY, USA, 2024. Association for Comp...
-
[14]
E. Gazzarrini, R. K. Cersonsky, M. Bercx, C. S. Adorf, and N. Marzari. The rule of four: anomalous distributions in the stoichiometries of inorganic compounds. npj Computational Materials, 10 0 (1): 0 73, Apr 2024. ISSN 2057-3960. doi:10.1038/s41524-024-01248-z
-
[15]
S. Gra z ulis, D. Chateigner, R. T. Downs, A. F. T. Yokochi, M. Quir \' o s, L. Lutterotti, E. Manakova, J. Butkus, P. Moeck, and A. Le Bail. Crystallography Open Database -- an open-access collection of crystal structures . Journal of Applied Crystallography, 42 0 (4): 0 726--729, Aug 2009. doi:10.1107/S0021889809016690
- [16]
-
[17]
G. Guo, T. Saidi, M. Terban, M. Valsecchi, S. J. Billinge, and H. Lipson. Ab initio structure solutions from nanocrystalline powder diffraction data, 2024
work page 2024
- [18]
-
[19]
G. M. Hocky and A. D. White. Natural language processing models that automate programming will transform chemistry research and teaching. Digital discovery, 1 0 (2): 0 79--83, 2022
work page 2022
-
[20]
K. M. Jablonka, P. Schwaller, A. Ortega-Guerrero, and B. Smit. Leveraging large language models for predictive chemistry. Nature Machine Intelligence, 6 0 (2): 0 161--169, 2024
work page 2024
-
[22]
A. Jain, S. P. Ong, G. Hautier, W. Chen, W. D. Richards, S. Dacek, S. Cholia, D. Gunter, D. Skinner, G. Ceder, and K. A. Persson. Commentary: The materials project: A materials genome approach to accelerating materials innovation. APL Materials, 1 0 (1): 0 011002, 07 2013 b . ISSN 2166-532X. doi:10.1063/1.4812323
-
[23]
R. Jiao, W. Huang, P. Lin, J. Han, P. Chen, Y. Lu, and Y. Liu. Crystal structure prediction by joint equivariant diffusion. Advances in Neural Information Processing Systems, 36: 0 17464--17497, 2023
work page 2023
-
[24]
S. Kirklin, J. E. Saal, B. Meredig, A. Thompson, J. W. Doak, M. Aykol, S. R \"u hl, and C. Wolverton. The open quantum materials database (oqmd): assessing the accuracy of dft formation energies. npj Computational Materials, 1 0 (1): 0 1--15, 2015
work page 2015
-
[25]
E. T. S. Kjær, A. S. Anker, M. N. Weng, S. J. L. Billinge, R. Selvan, and K. M. . Jensen. Deepstruc: towards structure solution from pair distribution function data using deep generative models. Digital Discovery, 2: 0 69--80, 2023. doi:10.1039/D2DD00086E
- [26]
-
[27]
Q. Lai, F. Xu, L. Yao, Z. Gao, S. Liu, H. Wang, S. Lu, D. He, L. Wang, L. Zhang, C. Wang, and G. Ke. End-to-end crystal structure prediction from powder x-ray diffraction. Advanced Science, page 2410722, 2025
work page 2025
-
[28]
Decoupled Weight Decay Regularization
I. Loshchilov and F. Hutter. Fixing weight decay regularization in adam. CoRR, abs/1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[29]
A. M. Bran, S. Cox, O. Schilter, C. Baldassari, A. D. White, and P. Schwaller. Augmenting large language models with chemistry tools. Nature Machine Intelligence, pages 1--11, 2024
work page 2024
-
[30]
B. K. Miller, R. T. Chen, A. Sriram, and B. M. Wood. Flowmm: Generating materials with riemannian flow matching. In Forty-first International Conference on Machine Learning, 2024
work page 2024
-
[31]
T. Mohanty, M. Mehta, H. M. Sayeed, V. Srikumar, and T. D. Sparks. Crystext: A generative ai approach for text-conditioned crystal structure generation using llm. ChemRxiv, 2024. doi:10.26434/chemrxiv-2024-gjhpq. This content is a preprint and has not been peer-reviewed
-
[32]
K. Momma and F. Izumi. VESTA : a three-dimensional visualization system for electronic and structural analysis . Journal of Applied Crystallography, 41 0 (3): 0 653--658, Jun 2008. doi:10.1107/S0021889808012016
-
[33]
S. P. Ong, W. D. Richards, A. Jain, G. Hautier, M. Kocher, S. Cholia, D. Gunter, V. Chevrier, K. A. Persson, and G. Ceder. Python materials genomics (pymatgen): A robust, open-source python library for materials analysis. Computational Materials Science, 68: 0 314--319, 2013. doi:10.1016/j.commatsci.2012.10.028
-
[34]
F. Oviedo, Z. Ren, S. Sun, C. Settens, Z. Liu, N. T. P. Hartono, S. Ramasamy, B. L. DeCost, S. I. P. Tian, G. Romano, A. Gilad Kusne, and T. Buonassisi. Fast and interpretable classification of small x-ray diffraction datasets using data augmentation and deep neural networks. npj Computational Materials, 5 0 (1): 0 60, May 2019. ISSN 2057-3960. doi:10.103...
- [35]
-
[36]
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F....
work page 2019
-
[37]
A. L. Patterson. Ambiguities in the x-ray analysis of crystal structures. Phys. Rev., 65: 0 195--201, Mar 1944. doi:10.1103/PhysRev.65.195
- [38]
-
[39]
M. P. Polak and D. Morgan. Extracting accurate materials data from research papers with conversational language models and prompt engineering. Nature Communications, 15 0 (1): 0 1569, 2024
work page 2024
-
[40]
E. A. Riesel, T. Mackey, H. Nilforoshan, M. Xu, C. K. Badding, A. B. Altman, J. Leskovec, and D. E. Freedman. Crystal structure determination from powder diffraction patterns with generative machine learning. Journal of the American Chemical Society, 146 0 (44): 0 30340--30348, 2024. doi:10.1021/jacs.4c10244. PMID: 39298266
-
[41]
A. N. Rubungo, K. Li, J. Hattrick-Simpers, and A. B. Dieng. LLM 4mat-bench: Benchmarking large language models for materials property prediction. In AI for Accelerated Materials Design - NeurIPS 2024, 2024
work page 2024
-
[42]
M. Schilling-Wilhelmi, M. R \' os-Garc \' a, S. Shabih, M. V. Gil, S. Miret, C. T. Koch, J. A. M \'a rquez, and K. M. Jablonka. From text to insight: large language models for chemical data extraction. Chemical Society Reviews, 2025
work page 2025
-
[43]
M. N. Schneider, M. Seibald, P. Lagally, and O. Oeckler. Ambiguities in the structure determination of antimony tellurides arising from almost homometric structure models and stacking disorder . Journal of Applied Crystallography, 43 0 (5 Part 1): 0 1012--1020, Oct 2010. doi:10.1107/S0021889810032644
-
[44]
Y. Shen, Y. Jiang, J. Lin, C. Wang, and J. Sun. A general method for searching for homometric structures . Acta Crystallographica Section B, 78 0 (1): 0 14--19, Feb 2022. doi:10.1107/S2052520621011859
-
[45]
N. J. Szymanski, B. Rendy, Y. Fei, R. E. Kumar, T. He, D. Milsted, M. J. McDermott, M. Gallant, E. D. Cubuk, A. Merchant, et al. An autonomous laboratory for the accelerated synthesis of novel materials. Nature, 624 0 (7990): 0 86--91, 2023
work page 2023
-
[46]
M. Tatlier. Artificial neural network methods for the prediction of framework crystal structures of zeolites from xrd data. Neural Computing and Applications, 20 0 (3): 0 365--371, Apr 2011. ISSN 1433-3058. doi:10.1007/s00521-010-0386-4
-
[47]
A. Togo and I. Tanaka. Spglib : a software library for crystal symmetry search , 2018
work page 2018
-
[48]
Llama 2: Open Foundation and Fine-Tuned Chat Models
H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017
work page 2017
-
[50]
H. Wang, Y. Xie, D. Li, H. Deng, Y. Zhao, M. Xin, and J. Lin. Rapid identification of x-ray diffraction patterns based on very limited data by interpretable convolutional neural networks. Journal of Chemical Information and Modeling, 60 0 (4): 0 2004--2011, Apr 2020. ISSN 1549-9596. doi:10.1021/acs.jcim.0c00020
-
[51]
A. R. West. Solid State Chemistry and its Applications. Wiley, 2nd edition, 2014
work page 2014
-
[52]
T. Xie, X. Fu, O.-E. Ganea, R. Barzilay, and T. Jaakkola. Crystal diffusion variational autoencoder for periodic material generation, 2022
work page 2022
-
[53]
R. Young. The Rietveld Method. IUCr monographs on crystallography. Oxford University Press, 1995. ISBN 9780198559122
work page 1995
-
[54]
C. Zeni, R. Pinsler, D. Z \"u gner, A. Fowler, M. Horton, X. Fu, Z. Wang, A. Shysheya, J. Crabb \'e , S. Ueda, et al. A generative model for inorganic materials design. Nature, pages 1--3, 2025
work page 2025
-
[55]
D. Zhang, X. Liu, X. Zhang, C. Zhang, C. Cai, H. Bi, Y. Du, X. Qin, A. Peng, J. Huang, B. Li, Y. Shan, J. Zeng, Y. Zhang, S. Liu, Y. Li, J. Chang, X. Wang, S. Zhou, J. Liu, X. Luo, Z. Wang, W. Jiang, J. Wu, Y. Yang, J. Yang, M. Yang, F.-Q. Gong, L. Zhang, M. Shi, F.-Z. Dai, D. M. York, S. Liu, T. Zhu, Z. Zhong, J. Lv, J. Cheng, W. Jia, M. Chen, G. Ke, W. ...
-
[56]
H. Zhang, W. W. Chen, J. M. Rondinelli, and W. Chen. Et-al: Entropy-targeted active learning for bias mitigation in materials data. Applied Physics Reviews, 10 0 (2): 0 021403, 04 2023. ISSN 1931-9401. doi:10.1063/5.0138913
-
[57]
D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1909
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.