General-purpose LLMs as Constrained Crystal Composition Generators
Pith reviewed 2026-06-28 21:51 UTC · model grok-4.3
The pith
General-purpose LLMs recover 96 percent of low-energy Elpasolite compositions through iterative prompting.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using Elpasolite materials as an established benchmark for generative tasks in large chemical spaces, an iterative prompt-and-response framework recovers on average 96 percent of all low-energy Elpasolites in the target region. This performance, driven mainly by iterative in-context learning, surpasses the generative abilities of previous, task-specific models. The results establish general-purpose LLMs as flexible and accessible components for inverse materials design workflows.
What carries the argument
The iterative prompt-and-response framework that uses successive in-context exchanges to systematically sample and refine compositions inside a defined chemical region.
If this is right
- General-purpose LLMs can cover entire regions of a targeted property space without exhaustive screening.
- No collection of task-specific training data is needed before generation begins.
- The same framework outperforms earlier models built specifically for composition generation on the Elpasolite benchmark.
- LLMs function as ready-to-use modules inside broader inverse materials design pipelines.
Where Pith is reading between the lines
- The approach may extend to other inorganic crystal families once the same prompting pattern is tested on them.
- Researchers without large curated datasets or GPU clusters for custom model training could still perform targeted composition searches.
- One could check whether recovery rates remain high when the target region is instead defined by a different physical property such as band gap rather than formation energy.
Load-bearing premise
The target region and low-energy threshold are fixed in advance without reference to the model's own outputs or to any implicit selection rules introduced by the prompting sequence.
What would settle it
Applying the identical iterative prompting procedure to a second crystal family with a comparably large composition space and obtaining a recovery rate below 70 percent of the low-energy members would falsify the reported advantage.
Figures
read the original abstract
The targeted discovery of inorganic materials remains challenging due to the vastness of compositional design spaces and the high cost of exhaustive screening. Task-specific generative artificial intelligence represents a particularly efficient alternative to screening, yet demands tedious collection of training data before providing real benefit. General-purpose large language models (LLMs) have recently shown tremendous potential for the targeted generation of single, optimal materials compositions without the need for task-specific fine-tuning. However, it is unclear whether LLMs generally pose an advantage compared to specialized generative models, in particular in large design spaces. Here, we demonstrate that such models are capable of covering entire regions of the targeted property space effectively and systematically. Using Elpasolite materials as an established benchmark for generative tasks in large chemical spaces, we find that an iterative prompt-and-response framework is able to recover on average 96% of all low-energy Elpasolites in the target region. This performance, driven mainly by iterative in-context learning, surpasses the generative abilities of previous, task-specific models. Our results establish general-purpose LLMs as flexible and accessible components for inverse materials design workflows.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that general-purpose LLMs, via an iterative prompt-and-response framework leveraging in-context learning, can systematically recover on average 96% of all low-energy Elpasolite compositions within a designated target region of compositional space. This performance is presented as surpassing prior task-specific generative models for crystal composition generation, without requiring task-specific fine-tuning or training data collection, thereby positioning LLMs as flexible components for inverse materials design.
Significance. If the 96% recovery metric is demonstrated to rest on an independently fixed target region and energy threshold (e.g., via exhaustive pre-computed database enumeration independent of LLM outputs), the result would indicate that general-purpose LLMs can achieve broad, systematic coverage of property spaces in large chemical design spaces. This would reduce reliance on specialized models that demand curated training sets and support more accessible workflows for targeted inorganic materials discovery.
major comments (2)
- [Abstract] Abstract: the headline claim of recovering on average 96% of low-energy Elpasolites supplies no operational definition of the target region, the low-energy threshold, the success metric, prompt templates, or error analysis. Without an explicit statement that the complete set of qualifying compositions was enumerated independently of the LLM (e.g., from a fixed external database prior to any prompting), the reported percentage cannot be evaluated for robustness or selection bias.
- [Abstract] The comparison to task-specific generative models is load-bearing for the central claim of superiority, yet the manuscript provides no detail on whether the iterative framework implicitly encodes domain-specific selection rules or post-hoc filtering unavailable to the baselines. If such rules are present, the performance advantage is not attributable to the LLM approach alone.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight opportunities to improve clarity in the abstract. We address each point below and will revise the abstract accordingly while preserving the manuscript's core claims, which rest on independent database enumeration and standard benchmark comparisons as detailed in the full text.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline claim of recovering on average 96% of low-energy Elpasolites supplies no operational definition of the target region, the low-energy threshold, the success metric, prompt templates, or error analysis. Without an explicit statement that the complete set of qualifying compositions was enumerated independently of the LLM (e.g., from a fixed external database prior to any prompting), the reported percentage cannot be evaluated for robustness or selection bias.
Authors: We agree the abstract would benefit from explicit definitions. The target region and low-energy threshold (E < 0.1 eV/atom above hull) were fixed in advance from the exhaustive pre-computed Elpasolite database (Materials Project) before any LLM prompting occurred; the 96% figure is the average fraction of that fixed set recovered across 10 independent runs. The success metric is exact compositional match to this pre-enumerated set. Prompt templates appear in SI Section S1, and error analysis (including variance across runs and failure modes) is in Results Section 3.2. We will add a concise statement to the abstract confirming independent enumeration from the fixed external database. revision: yes
-
Referee: [Abstract] The comparison to task-specific generative models is load-bearing for the central claim of superiority, yet the manuscript provides no detail on whether the iterative framework implicitly encodes domain-specific selection rules or post-hoc filtering unavailable to the baselines. If such rules are present, the performance advantage is not attributable to the LLM approach alone.
Authors: The iterative in-context learning framework uses only general-purpose prompts and the model's native response generation; no additional domain-specific selection rules, chemical heuristics, or post-hoc filtering steps are applied beyond the prompt-response loop itself. All baselines are reproduced exactly as reported in the original benchmark papers (e.g., the task-specific models of Ref. 12 and 15), which likewise operate without external filtering. The performance difference is therefore attributable to the LLM's in-context learning capability. We will insert a clarifying sentence in the abstract and expand the methods paragraph on the framework to make this explicit. revision: yes
Circularity Check
No significant circularity; result benchmarked against external established Elpasolite dataset
full rationale
The paper reports LLM recovery performance against the established Elpasolite benchmark for generative tasks in large chemical spaces. The abstract and description indicate the target region and low-energy threshold are defined via this pre-existing external reference, with no equations, fitted parameters, or self-citations that reduce the 96% recovery metric to a self-referential definition or input. The central claim compares against prior task-specific models using independent benchmarks, making the derivation self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Elpasolite materials constitute an established benchmark for generative tasks in large chemical spaces
Reference graph
Works this paper leans on
-
[1]
T., Davies, D
Butler, K. T., Davies, D. W., Cartwright, H., Isayev, O. & Walsh, A. Machine learning for molecular and materials science.Nature559,547–555 (2018)
2018
-
[2]
& Aspuru-Guzik, A
Sanchez-Lengeling, B. & Aspuru-Guzik, A. Inverse molecular design using machine learning: Generative models for matter engineering.Science361,360–365 (2018)
2018
-
[3]
& Walsh, A
Park, H., Li, Z. & Walsh, A. Has generative artificial intelligence solved inverse materials design?Matter 7,2355–2367 (2024)
2024
-
[4]
Kneiding, H., Morán-González, L., Kuriakose, N., Nova, A. & Balcells, D. Inverse Design of Inorganic Compounds with Generative AI.arXiv preprint arXiv:2604.11827(2026)
Pith/arXiv arXiv 2026
-
[5]
& Ong, S
Hautier, G., Jain, A. & Ong, S. P. From the computer to the laboratory: materials discovery and design using first-principles calculations.Journal of Materials Science47,7317–7340 (2012)
2012
-
[6]
& Shi, S
Liu, Y., Zhao, T., Ju, W. & Shi, S. Materials discovery and design using machine learning.Journal of Materiomics3,159–177 (2017)
2017
-
[7]
& Marques, M
Wang, H.-C., Schmidt, J., Botti, S. & Marques, M. A. A high-throughput study of oxynitride, oxyfluoride and nitrofluoride perovskites.Journal of Materials Chemistry A9,8501–8513 (2021)
2021
-
[8]
& Marques, M
Borlido, P., Schmidt, J., Wang, H.-C., Botti, S. & Marques, M. A. Computational screening of materials with extreme gap deformation potentials.npj Computational Materials8,156 (2022)
2022
-
[9]
Cheng, M.et al.AI-driven materials design: a mini-review.arXiv preprint arXiv:2502.02905(2025)
arXiv 2025
-
[10]
& Grossman, J
Xie, T. & Grossman, J. C. Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties.Physical review letters120,145301 (2018)
2018
-
[11]
Batatia, I.et al.A foundation model for atomistic materials chemistry.The Journal of chemical physics 163(2025). 15
2025
-
[12]
Türk,H.,Landini,E.,Kunkel,C.,Margraf,J.T.&Reuter,K.Assessingdeepgenerativemodelsinchemical composition space.Chemistry of Materials34,9455–9467 (2022)
2022
-
[13]
Anstine, D. M. & Isayev, O. Generative models as an emerging paradigm in the chemical sciences.Journal of the American Chemical Society145,8736–8750 (2023)
2023
-
[14]
Du, Y.et al.Machine learning-aided generative molecular design.Nature Machine Intelligence6,589–604 (2024)
2024
-
[15]
Gómez-Bombarelli, R.et al.Automatic chemical design using a data-driven continuous representation of molecules.ACS central science4,268–276 (2018)
2018
-
[16]
Chen, Z.et al.Crystal structure prediction meets artificial intelligence.The Journal of Physical Chemistry Letters16,2581–2591 (2025)
2025
-
[17]
Zeni, C.et al.A generative model for inorganic materials design.Nature639,624–632 (2025)
2025
-
[18]
S., Walsh, A., Reuter, K
Jakob, K. S., Walsh, A., Reuter, K. & Margraf, J. T. Learning crystallographic disorder: bridging prediction and experiment in materials discovery.Advanced Materials38,e14226 (2026)
2026
-
[19]
Van, M.-H., Verma, P., Zhao, C. & Wu, X. A survey of AI for materials science: foundation models, LLM agents, datasets, and tools.arXiv preprint arXiv:2506.20743(2025)
arXiv 2025
-
[20]
& Wang, Q
Zhang, L., Liu, Z., Ni, B. & Wang, Q. Large Language Models (LLMs) for Materials Design.Advanced Functional Materials,e25897 (2025)
2025
-
[21]
M., Schwaller, P., Ortega-Guerrero, A
Jablonka, K. M., Schwaller, P., Ortega-Guerrero, A. & Smit, B. Leveraging large language models for predictive chemistry.Nature Machine Intelligence6,161–169 (2024)
2024
-
[22]
Guo, T.et al.What can large language models do in chemistry? a comprehensive benchmark on eight tasks.Advances in neural information processing systems36,59662–59688 (2023)
2023
-
[23]
Tang, Y.et al.Matterchat: A multi-modal llm for material science.arXiv preprint arXiv:2502.13107 (2025)
arXiv 2025
-
[24]
Bran, A.et al.Augmenting large language models with chemistry tools.Nature machine intelligence 6,525–535 (2024)
M. Bran, A.et al.Augmenting large language models with chemistry tools.Nature machine intelligence 6,525–535 (2024)
2024
-
[25]
& Barati Farimani, A
Ock, J., Guntuboina, C. & Barati Farimani, A. Catalyst energy prediction with CatBERTa: unveiling feature exploration strategies through large language models.ACS Catalysis13,16032–16044 (2023)
2023
-
[26]
Lv, S.et al.Bridging language models and computational materials science: A prompt-driven framework for material property prediction.Materials Genome Engineering Advances3,e70013 (2025)
2025
-
[27]
Mitchener, L.et al.Kosmos: An AI Scientist for Autonomous Discovery.arXiv preprint arXiv:2511.02824 (2025)
Pith/arXiv arXiv 2025
-
[28]
A., MacKnight, R., Kline, B
Boiko, D. A., MacKnight, R., Kline, B. & Gomes, G. Autonomous chemical research with large language models.Nature624,570–578 (2023). 16
2023
-
[29]
Nduma, R., Park, H. & Walsh, A. Crystalyse: a multi-tool agent for materials design.arXiv preprint arXiv:2512.00977(2025)
arXiv 2025
-
[30]
E.et al.A multi-agent system for automating scientific discovery.Nature,1–3 (2026)
Ghareeb, A. E.et al.A multi-agent system for automating scientific discovery.Nature,1–3 (2026)
2026
-
[31]
Aygün, E.et al.An AI system to help scientists write expert-level empirical software.Nature,1–3 (2026)
2026
-
[32]
Gottweis, J.et al.Accelerating scientific discovery with Co-Scientist.Nature,1–3 (2026)
2026
-
[33]
Gruver, N.et al.Fine-tuned language models generate stable inorganic materials as text.arXiv preprint arXiv:2402.04379(2024)
arXiv 2024
-
[34]
Flam-Shepherd, D. & Aspuru-Guzik, A. Language models can generate molecules, materials, and protein binding sites directly in three dimensions as xyz, cif, and pdb files.arXiv preprint arXiv:2305.05708(2023)
arXiv 2023
-
[35]
M., Butler, K
Antunes, L. M., Butler, K. T. & Grau-Crespo, R. Crystal structure generation with autoregressive large language modeling.Nature Communications15,10570 (2024)
2024
-
[36]
Wang, H.et al.Efficient evolutionary search over chemical space with large language models.arXiv preprint arXiv:2406.16976(2024)
arXiv 2024
-
[37]
ACS Central Science11,2108–2120 (2025)
Sun, K.et al.SynLlama: generating synthesizable molecules and their analogs with large language models. ACS Central Science11,2108–2120 (2025)
2025
-
[38]
arXiv preprint arXiv:2502.20933(2025)
Gan, J.et al.MatLLMSearch: Crystal Structure Discovery with Evolution-Guided Large Language Models. arXiv preprint arXiv:2502.20933(2025)
arXiv 2025
- [39]
-
[40]
Sprueill, H. W.et al.ChemReasoner: Heuristic search over a large language model’s knowledge space using quantum-chemical feedback.arXiv preprint arXiv:2402.10980(2024)
arXiv 2024
-
[41]
& Liu, B
Takahara, I., Mizoguchi, T. & Liu, B. Accelerated inorganic materials design with generative AI agents. Cell Reports Physical Science6(2025)
2025
-
[42]
A., Lindmaa, A., Von Lilienfeld, O
Faber, F. A., Lindmaa, A., Von Lilienfeld, O. A. & Armiento, R. Machine learning energies of 2 million elpasolite (ABC 2 D 6) crystals.Physical review letters117,135502 (2016)
2016
-
[43]
N., Walsh, A
Savory, C. N., Walsh, A. & Scanlon, D. O. Can Pb-free halide double perovskites support high-efficiency solar cells?ACS energy letters1,949–955 (2016)
2016
-
[44]
Landini, E., Reuter, K. & Oberhofer, H. Machine-learning based screening of lead-free halide double perovskites for photovoltaic applications.arXiv preprint arXiv:2208.12736(2022)
arXiv 2022
-
[45]
Singh, A.et al.Openai gpt-5 system card.arXiv preprint arXiv:2601.03267(2025)
Pith/arXiv arXiv 2025
-
[46]
Wei, J.et al.Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35,24824–24837 (2022)
2022
-
[47]
Brown, T.et al.Language models are few-shot learners.Advances in neural information processing systems 33,1877–1901 (2020). 17
1901
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.