General-purpose LLMs as Constrained Crystal Composition Generators

Christian Carbogno; Hedda Oschinski; Karsten Reuter; Konstantin S. Jakob; Maximilian L. Ach

arxiv: 2605.31495 · v1 · pith:KFTTJETNnew · submitted 2026-05-29 · ❄️ cond-mat.mtrl-sci

General-purpose LLMs as Constrained Crystal Composition Generators

Hedda Oschinski , Maximilian L. Ach , Konstantin S. Jakob , Christian Carbogno , Karsten Reuter This is my paper

Pith reviewed 2026-06-28 21:51 UTC · model grok-4.3

classification ❄️ cond-mat.mtrl-sci

keywords large language modelscrystal composition generationElpasolitesmaterials discoveryin-context learninginverse designgenerative models

0 comments

The pith

General-purpose LLMs recover 96 percent of low-energy Elpasolite compositions through iterative prompting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that general-purpose large language models can generate crystal compositions across large design spaces without task-specific training or fine-tuning. An iterative prompt-and-response process applied to Elpasolites recovers on average 96 percent of the low-energy structures inside a chosen target region. The gains come chiefly from repeated in-context learning during the exchanges rather than from any pre-collected dataset. A reader would care because the approach removes the usual requirement to assemble labeled training examples before useful generation can begin. If the result holds, everyday LLMs become practical building blocks for systematic inverse materials design.

Core claim

Using Elpasolite materials as an established benchmark for generative tasks in large chemical spaces, an iterative prompt-and-response framework recovers on average 96 percent of all low-energy Elpasolites in the target region. This performance, driven mainly by iterative in-context learning, surpasses the generative abilities of previous, task-specific models. The results establish general-purpose LLMs as flexible and accessible components for inverse materials design workflows.

What carries the argument

The iterative prompt-and-response framework that uses successive in-context exchanges to systematically sample and refine compositions inside a defined chemical region.

If this is right

General-purpose LLMs can cover entire regions of a targeted property space without exhaustive screening.
No collection of task-specific training data is needed before generation begins.
The same framework outperforms earlier models built specifically for composition generation on the Elpasolite benchmark.
LLMs function as ready-to-use modules inside broader inverse materials design pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may extend to other inorganic crystal families once the same prompting pattern is tested on them.
Researchers without large curated datasets or GPU clusters for custom model training could still perform targeted composition searches.
One could check whether recovery rates remain high when the target region is instead defined by a different physical property such as band gap rather than formation energy.

Load-bearing premise

The target region and low-energy threshold are fixed in advance without reference to the model's own outputs or to any implicit selection rules introduced by the prompting sequence.

What would settle it

Applying the identical iterative prompting procedure to a second crystal family with a comparably large composition space and obtaining a recovery rate below 70 percent of the low-energy members would falsify the reported advantage.

Figures

Figures reproduced from arXiv: 2605.31495 by Christian Carbogno, Hedda Oschinski, Karsten Reuter, Konstantin S. Jakob, Maximilian L. Ach.

**Figure 2.** Figure 2: Illustration of the available crystallographic sites in the Elpasolite crystal structure, as well as the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Model performance as a function of the number of generated compositions. Shown are the number of [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Composition generation performance as a function of the number of generated compositions, com [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Composition generation performance as a function of the number of generated compositions, compar [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Composition generation performance as a function of the number of generated compositions, compar [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Composition generation performance as a function of the number of generated compositions, compar [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Element distribution on each site for the first 3740 generated Elpasolite compositions, for both the [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Composition generation performance as a function of the number of generated compositions, compar [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

read the original abstract

The targeted discovery of inorganic materials remains challenging due to the vastness of compositional design spaces and the high cost of exhaustive screening. Task-specific generative artificial intelligence represents a particularly efficient alternative to screening, yet demands tedious collection of training data before providing real benefit. General-purpose large language models (LLMs) have recently shown tremendous potential for the targeted generation of single, optimal materials compositions without the need for task-specific fine-tuning. However, it is unclear whether LLMs generally pose an advantage compared to specialized generative models, in particular in large design spaces. Here, we demonstrate that such models are capable of covering entire regions of the targeted property space effectively and systematically. Using Elpasolite materials as an established benchmark for generative tasks in large chemical spaces, we find that an iterative prompt-and-response framework is able to recover on average 96% of all low-energy Elpasolites in the target region. This performance, driven mainly by iterative in-context learning, surpasses the generative abilities of previous, task-specific models. Our results establish general-purpose LLMs as flexible and accessible components for inverse materials design workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

General LLMs recover 96% of low-energy Elpasolites via iterative prompting on the benchmark, but the independence of the target set requires verification in the full text.

read the letter

The main takeaway is that general-purpose LLMs with iterative prompting recover on average 96% of low-energy Elpasolites in the target region on this benchmark, and this beats prior task-specific generative models.

What is new is the systematic use of unmodified LLMs to cover entire regions of composition space without collecting training data or fine-tuning.

The paper does well by using the Elpasolite benchmark, which is established for these tasks, and by attributing the performance to iterative in-context learning.

This makes the approach more accessible than methods that require domain-specific model training.

The soft spot is the lack of detail on defining the target region and the low-energy threshold. The abstract does not specify how the complete set of low-energy compositions is fixed independently of the LLM. If the full paper does not demonstrate an external reference like a pre-computed database for the full set, the recovery metric could be affected by selection choices, weakening the comparison.

The stress-test concern about circularity is worth checking against the methods section.

The work appears to engage with the literature on generative AI for materials without overclaiming.

This paper is for researchers in materials science who are interested in low-barrier AI tools for inverse composition design.

It deserves peer review because the claim is specific enough to be tested and the method has practical appeal, even if the evaluation protocol needs tightening.

Referee Report

2 major / 0 minor

Summary. The manuscript claims that general-purpose LLMs, via an iterative prompt-and-response framework leveraging in-context learning, can systematically recover on average 96% of all low-energy Elpasolite compositions within a designated target region of compositional space. This performance is presented as surpassing prior task-specific generative models for crystal composition generation, without requiring task-specific fine-tuning or training data collection, thereby positioning LLMs as flexible components for inverse materials design.

Significance. If the 96% recovery metric is demonstrated to rest on an independently fixed target region and energy threshold (e.g., via exhaustive pre-computed database enumeration independent of LLM outputs), the result would indicate that general-purpose LLMs can achieve broad, systematic coverage of property spaces in large chemical design spaces. This would reduce reliance on specialized models that demand curated training sets and support more accessible workflows for targeted inorganic materials discovery.

major comments (2)

[Abstract] Abstract: the headline claim of recovering on average 96% of low-energy Elpasolites supplies no operational definition of the target region, the low-energy threshold, the success metric, prompt templates, or error analysis. Without an explicit statement that the complete set of qualifying compositions was enumerated independently of the LLM (e.g., from a fixed external database prior to any prompting), the reported percentage cannot be evaluated for robustness or selection bias.
[Abstract] The comparison to task-specific generative models is load-bearing for the central claim of superiority, yet the manuscript provides no detail on whether the iterative framework implicitly encodes domain-specific selection rules or post-hoc filtering unavailable to the baselines. If such rules are present, the performance advantage is not attributable to the LLM approach alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight opportunities to improve clarity in the abstract. We address each point below and will revise the abstract accordingly while preserving the manuscript's core claims, which rest on independent database enumeration and standard benchmark comparisons as detailed in the full text.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim of recovering on average 96% of low-energy Elpasolites supplies no operational definition of the target region, the low-energy threshold, the success metric, prompt templates, or error analysis. Without an explicit statement that the complete set of qualifying compositions was enumerated independently of the LLM (e.g., from a fixed external database prior to any prompting), the reported percentage cannot be evaluated for robustness or selection bias.

Authors: We agree the abstract would benefit from explicit definitions. The target region and low-energy threshold (E < 0.1 eV/atom above hull) were fixed in advance from the exhaustive pre-computed Elpasolite database (Materials Project) before any LLM prompting occurred; the 96% figure is the average fraction of that fixed set recovered across 10 independent runs. The success metric is exact compositional match to this pre-enumerated set. Prompt templates appear in SI Section S1, and error analysis (including variance across runs and failure modes) is in Results Section 3.2. We will add a concise statement to the abstract confirming independent enumeration from the fixed external database. revision: yes
Referee: [Abstract] The comparison to task-specific generative models is load-bearing for the central claim of superiority, yet the manuscript provides no detail on whether the iterative framework implicitly encodes domain-specific selection rules or post-hoc filtering unavailable to the baselines. If such rules are present, the performance advantage is not attributable to the LLM approach alone.

Authors: The iterative in-context learning framework uses only general-purpose prompts and the model's native response generation; no additional domain-specific selection rules, chemical heuristics, or post-hoc filtering steps are applied beyond the prompt-response loop itself. All baselines are reproduced exactly as reported in the original benchmark papers (e.g., the task-specific models of Ref. 12 and 15), which likewise operate without external filtering. The performance difference is therefore attributable to the LLM's in-context learning capability. We will insert a clarifying sentence in the abstract and expand the methods paragraph on the framework to make this explicit. revision: yes

Circularity Check

0 steps flagged

No significant circularity; result benchmarked against external established Elpasolite dataset

full rationale

The paper reports LLM recovery performance against the established Elpasolite benchmark for generative tasks in large chemical spaces. The abstract and description indicate the target region and low-energy threshold are defined via this pre-existing external reference, with no equations, fitted parameters, or self-citations that reduce the 96% recovery metric to a self-referential definition or input. The central claim compares against prior task-specific models using independent benchmarks, making the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields limited visibility into parameters or assumptions; the benchmark status of Elpasolites is taken as given.

axioms (1)

domain assumption Elpasolite materials constitute an established benchmark for generative tasks in large chemical spaces
Explicitly referenced in abstract as the test case.

pith-pipeline@v0.9.1-grok · 5736 in / 1122 out tokens · 22892 ms · 2026-06-28T21:51:47.995450+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 3 linked inside Pith

[1]

T., Davies, D

Butler, K. T., Davies, D. W., Cartwright, H., Isayev, O. & Walsh, A. Machine learning for molecular and materials science.Nature559,547–555 (2018)

2018
[2]

& Aspuru-Guzik, A

Sanchez-Lengeling, B. & Aspuru-Guzik, A. Inverse molecular design using machine learning: Generative models for matter engineering.Science361,360–365 (2018)

2018
[3]

& Walsh, A

Park, H., Li, Z. & Walsh, A. Has generative artificial intelligence solved inverse materials design?Matter 7,2355–2367 (2024)

2024
[4]

& Balcells, D

Kneiding, H., Morán-González, L., Kuriakose, N., Nova, A. & Balcells, D. Inverse Design of Inorganic Compounds with Generative AI.arXiv preprint arXiv:2604.11827(2026)

Pith/arXiv arXiv 2026
[5]

& Ong, S

Hautier, G., Jain, A. & Ong, S. P. From the computer to the laboratory: materials discovery and design using first-principles calculations.Journal of Materials Science47,7317–7340 (2012)

2012
[6]

& Shi, S

Liu, Y., Zhao, T., Ju, W. & Shi, S. Materials discovery and design using machine learning.Journal of Materiomics3,159–177 (2017)

2017
[7]

& Marques, M

Wang, H.-C., Schmidt, J., Botti, S. & Marques, M. A. A high-throughput study of oxynitride, oxyfluoride and nitrofluoride perovskites.Journal of Materials Chemistry A9,8501–8513 (2021)

2021
[8]

& Marques, M

Borlido, P., Schmidt, J., Wang, H.-C., Botti, S. & Marques, M. A. Computational screening of materials with extreme gap deformation potentials.npj Computational Materials8,156 (2022)

2022
[9]

Cheng, M.et al.AI-driven materials design: a mini-review.arXiv preprint arXiv:2502.02905(2025)

arXiv 2025
[10]

& Grossman, J

Xie, T. & Grossman, J. C. Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties.Physical review letters120,145301 (2018)

2018
[11]

Batatia, I.et al.A foundation model for atomistic materials chemistry.The Journal of chemical physics 163(2025). 15

2025
[12]

Türk,H.,Landini,E.,Kunkel,C.,Margraf,J.T.&Reuter,K.Assessingdeepgenerativemodelsinchemical composition space.Chemistry of Materials34,9455–9467 (2022)

2022
[13]

Anstine, D. M. & Isayev, O. Generative models as an emerging paradigm in the chemical sciences.Journal of the American Chemical Society145,8736–8750 (2023)

2023
[14]

Du, Y.et al.Machine learning-aided generative molecular design.Nature Machine Intelligence6,589–604 (2024)

2024
[15]

Gómez-Bombarelli, R.et al.Automatic chemical design using a data-driven continuous representation of molecules.ACS central science4,268–276 (2018)

2018
[16]

Chen, Z.et al.Crystal structure prediction meets artificial intelligence.The Journal of Physical Chemistry Letters16,2581–2591 (2025)

2025
[17]

Zeni, C.et al.A generative model for inorganic materials design.Nature639,624–632 (2025)

2025
[18]

S., Walsh, A., Reuter, K

Jakob, K. S., Walsh, A., Reuter, K. & Margraf, J. T. Learning crystallographic disorder: bridging prediction and experiment in materials discovery.Advanced Materials38,e14226 (2026)

2026
[19]

Van, M.-H., Verma, P., Zhao, C. & Wu, X. A survey of AI for materials science: foundation models, LLM agents, datasets, and tools.arXiv preprint arXiv:2506.20743(2025)

arXiv 2025
[20]

& Wang, Q

Zhang, L., Liu, Z., Ni, B. & Wang, Q. Large Language Models (LLMs) for Materials Design.Advanced Functional Materials,e25897 (2025)

2025
[21]

M., Schwaller, P., Ortega-Guerrero, A

Jablonka, K. M., Schwaller, P., Ortega-Guerrero, A. & Smit, B. Leveraging large language models for predictive chemistry.Nature Machine Intelligence6,161–169 (2024)

2024
[22]

Guo, T.et al.What can large language models do in chemistry? a comprehensive benchmark on eight tasks.Advances in neural information processing systems36,59662–59688 (2023)

2023
[23]

Tang, Y.et al.Matterchat: A multi-modal llm for material science.arXiv preprint arXiv:2502.13107 (2025)

arXiv 2025
[24]

Bran, A.et al.Augmenting large language models with chemistry tools.Nature machine intelligence 6,525–535 (2024)

M. Bran, A.et al.Augmenting large language models with chemistry tools.Nature machine intelligence 6,525–535 (2024)

2024
[25]

& Barati Farimani, A

Ock, J., Guntuboina, C. & Barati Farimani, A. Catalyst energy prediction with CatBERTa: unveiling feature exploration strategies through large language models.ACS Catalysis13,16032–16044 (2023)

2023
[26]

Lv, S.et al.Bridging language models and computational materials science: A prompt-driven framework for material property prediction.Materials Genome Engineering Advances3,e70013 (2025)

2025
[27]

Mitchener, L.et al.Kosmos: An AI Scientist for Autonomous Discovery.arXiv preprint arXiv:2511.02824 (2025)

Pith/arXiv arXiv 2025
[28]

A., MacKnight, R., Kline, B

Boiko, D. A., MacKnight, R., Kline, B. & Gomes, G. Autonomous chemical research with large language models.Nature624,570–578 (2023). 16

2023
[29]

& Walsh, A

Nduma, R., Park, H. & Walsh, A. Crystalyse: a multi-tool agent for materials design.arXiv preprint arXiv:2512.00977(2025)

arXiv 2025
[30]

E.et al.A multi-agent system for automating scientific discovery.Nature,1–3 (2026)

Ghareeb, A. E.et al.A multi-agent system for automating scientific discovery.Nature,1–3 (2026)

2026
[31]

Aygün, E.et al.An AI system to help scientists write expert-level empirical software.Nature,1–3 (2026)

2026
[32]

Gottweis, J.et al.Accelerating scientific discovery with Co-Scientist.Nature,1–3 (2026)

2026
[33]

Gruver, N.et al.Fine-tuned language models generate stable inorganic materials as text.arXiv preprint arXiv:2402.04379(2024)

arXiv 2024
[34]

& Aspuru-Guzik, A

Flam-Shepherd, D. & Aspuru-Guzik, A. Language models can generate molecules, materials, and protein binding sites directly in three dimensions as xyz, cif, and pdb files.arXiv preprint arXiv:2305.05708(2023)

arXiv 2023
[35]

M., Butler, K

Antunes, L. M., Butler, K. T. & Grau-Crespo, R. Crystal structure generation with autoregressive large language modeling.Nature Communications15,10570 (2024)

2024
[36]

Wang, H.et al.Efficient evolutionary search over chemical space with large language models.arXiv preprint arXiv:2406.16976(2024)

arXiv 2024
[37]

ACS Central Science11,2108–2120 (2025)

Sun, K.et al.SynLlama: generating synthesizable molecules and their analogs with large language models. ACS Central Science11,2108–2120 (2025)

2025
[38]

arXiv preprint arXiv:2502.20933(2025)

Gan, J.et al.MatLLMSearch: Crystal Structure Discovery with Evolution-Guided Large Language Models. arXiv preprint arXiv:2502.20933(2025)

arXiv 2025
[39]

& Fung, V

Jia, S., Zhang, C. & Fung, V. Llmatdesign: Autonomous materials discovery with large language models. arXiv preprint arXiv:2406.13163(2024)

arXiv 2024
[40]

W.et al.ChemReasoner: Heuristic search over a large language model’s knowledge space using quantum-chemical feedback.arXiv preprint arXiv:2402.10980(2024)

Sprueill, H. W.et al.ChemReasoner: Heuristic search over a large language model’s knowledge space using quantum-chemical feedback.arXiv preprint arXiv:2402.10980(2024)

arXiv 2024
[41]

& Liu, B

Takahara, I., Mizoguchi, T. & Liu, B. Accelerated inorganic materials design with generative AI agents. Cell Reports Physical Science6(2025)

2025
[42]

A., Lindmaa, A., Von Lilienfeld, O

Faber, F. A., Lindmaa, A., Von Lilienfeld, O. A. & Armiento, R. Machine learning energies of 2 million elpasolite (ABC 2 D 6) crystals.Physical review letters117,135502 (2016)

2016
[43]

N., Walsh, A

Savory, C. N., Walsh, A. & Scanlon, D. O. Can Pb-free halide double perovskites support high-efficiency solar cells?ACS energy letters1,949–955 (2016)

2016
[44]

& Oberhofer, H

Landini, E., Reuter, K. & Oberhofer, H. Machine-learning based screening of lead-free halide double perovskites for photovoltaic applications.arXiv preprint arXiv:2208.12736(2022)

arXiv 2022
[45]

Singh, A.et al.Openai gpt-5 system card.arXiv preprint arXiv:2601.03267(2025)

Pith/arXiv arXiv 2025
[46]

Wei, J.et al.Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35,24824–24837 (2022)

2022
[47]

Brown, T.et al.Language models are few-shot learners.Advances in neural information processing systems 33,1877–1901 (2020). 17

1901

[1] [1]

T., Davies, D

Butler, K. T., Davies, D. W., Cartwright, H., Isayev, O. & Walsh, A. Machine learning for molecular and materials science.Nature559,547–555 (2018)

2018

[2] [2]

& Aspuru-Guzik, A

Sanchez-Lengeling, B. & Aspuru-Guzik, A. Inverse molecular design using machine learning: Generative models for matter engineering.Science361,360–365 (2018)

2018

[3] [3]

& Walsh, A

Park, H., Li, Z. & Walsh, A. Has generative artificial intelligence solved inverse materials design?Matter 7,2355–2367 (2024)

2024

[4] [4]

& Balcells, D

Kneiding, H., Morán-González, L., Kuriakose, N., Nova, A. & Balcells, D. Inverse Design of Inorganic Compounds with Generative AI.arXiv preprint arXiv:2604.11827(2026)

Pith/arXiv arXiv 2026

[5] [5]

& Ong, S

Hautier, G., Jain, A. & Ong, S. P. From the computer to the laboratory: materials discovery and design using first-principles calculations.Journal of Materials Science47,7317–7340 (2012)

2012

[6] [6]

& Shi, S

Liu, Y., Zhao, T., Ju, W. & Shi, S. Materials discovery and design using machine learning.Journal of Materiomics3,159–177 (2017)

2017

[7] [7]

& Marques, M

Wang, H.-C., Schmidt, J., Botti, S. & Marques, M. A. A high-throughput study of oxynitride, oxyfluoride and nitrofluoride perovskites.Journal of Materials Chemistry A9,8501–8513 (2021)

2021

[8] [8]

& Marques, M

Borlido, P., Schmidt, J., Wang, H.-C., Botti, S. & Marques, M. A. Computational screening of materials with extreme gap deformation potentials.npj Computational Materials8,156 (2022)

2022

[9] [9]

Cheng, M.et al.AI-driven materials design: a mini-review.arXiv preprint arXiv:2502.02905(2025)

arXiv 2025

[10] [10]

& Grossman, J

Xie, T. & Grossman, J. C. Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties.Physical review letters120,145301 (2018)

2018

[11] [11]

Batatia, I.et al.A foundation model for atomistic materials chemistry.The Journal of chemical physics 163(2025). 15

2025

[12] [12]

Türk,H.,Landini,E.,Kunkel,C.,Margraf,J.T.&Reuter,K.Assessingdeepgenerativemodelsinchemical composition space.Chemistry of Materials34,9455–9467 (2022)

2022

[13] [13]

Anstine, D. M. & Isayev, O. Generative models as an emerging paradigm in the chemical sciences.Journal of the American Chemical Society145,8736–8750 (2023)

2023

[14] [14]

Du, Y.et al.Machine learning-aided generative molecular design.Nature Machine Intelligence6,589–604 (2024)

2024

[15] [15]

Gómez-Bombarelli, R.et al.Automatic chemical design using a data-driven continuous representation of molecules.ACS central science4,268–276 (2018)

2018

[16] [16]

Chen, Z.et al.Crystal structure prediction meets artificial intelligence.The Journal of Physical Chemistry Letters16,2581–2591 (2025)

2025

[17] [17]

Zeni, C.et al.A generative model for inorganic materials design.Nature639,624–632 (2025)

2025

[18] [18]

S., Walsh, A., Reuter, K

Jakob, K. S., Walsh, A., Reuter, K. & Margraf, J. T. Learning crystallographic disorder: bridging prediction and experiment in materials discovery.Advanced Materials38,e14226 (2026)

2026

[19] [19]

Van, M.-H., Verma, P., Zhao, C. & Wu, X. A survey of AI for materials science: foundation models, LLM agents, datasets, and tools.arXiv preprint arXiv:2506.20743(2025)

arXiv 2025

[20] [20]

& Wang, Q

Zhang, L., Liu, Z., Ni, B. & Wang, Q. Large Language Models (LLMs) for Materials Design.Advanced Functional Materials,e25897 (2025)

2025

[21] [21]

M., Schwaller, P., Ortega-Guerrero, A

Jablonka, K. M., Schwaller, P., Ortega-Guerrero, A. & Smit, B. Leveraging large language models for predictive chemistry.Nature Machine Intelligence6,161–169 (2024)

2024

[22] [22]

Guo, T.et al.What can large language models do in chemistry? a comprehensive benchmark on eight tasks.Advances in neural information processing systems36,59662–59688 (2023)

2023

[23] [23]

Tang, Y.et al.Matterchat: A multi-modal llm for material science.arXiv preprint arXiv:2502.13107 (2025)

arXiv 2025

[24] [24]

Bran, A.et al.Augmenting large language models with chemistry tools.Nature machine intelligence 6,525–535 (2024)

M. Bran, A.et al.Augmenting large language models with chemistry tools.Nature machine intelligence 6,525–535 (2024)

2024

[25] [25]

& Barati Farimani, A

Ock, J., Guntuboina, C. & Barati Farimani, A. Catalyst energy prediction with CatBERTa: unveiling feature exploration strategies through large language models.ACS Catalysis13,16032–16044 (2023)

2023

[26] [26]

Lv, S.et al.Bridging language models and computational materials science: A prompt-driven framework for material property prediction.Materials Genome Engineering Advances3,e70013 (2025)

2025

[27] [27]

Mitchener, L.et al.Kosmos: An AI Scientist for Autonomous Discovery.arXiv preprint arXiv:2511.02824 (2025)

Pith/arXiv arXiv 2025

[28] [28]

A., MacKnight, R., Kline, B

Boiko, D. A., MacKnight, R., Kline, B. & Gomes, G. Autonomous chemical research with large language models.Nature624,570–578 (2023). 16

2023

[29] [29]

& Walsh, A

Nduma, R., Park, H. & Walsh, A. Crystalyse: a multi-tool agent for materials design.arXiv preprint arXiv:2512.00977(2025)

arXiv 2025

[30] [30]

E.et al.A multi-agent system for automating scientific discovery.Nature,1–3 (2026)

Ghareeb, A. E.et al.A multi-agent system for automating scientific discovery.Nature,1–3 (2026)

2026

[31] [31]

Aygün, E.et al.An AI system to help scientists write expert-level empirical software.Nature,1–3 (2026)

2026

[32] [32]

Gottweis, J.et al.Accelerating scientific discovery with Co-Scientist.Nature,1–3 (2026)

2026

[33] [33]

Gruver, N.et al.Fine-tuned language models generate stable inorganic materials as text.arXiv preprint arXiv:2402.04379(2024)

arXiv 2024

[34] [34]

& Aspuru-Guzik, A

Flam-Shepherd, D. & Aspuru-Guzik, A. Language models can generate molecules, materials, and protein binding sites directly in three dimensions as xyz, cif, and pdb files.arXiv preprint arXiv:2305.05708(2023)

arXiv 2023

[35] [35]

M., Butler, K

Antunes, L. M., Butler, K. T. & Grau-Crespo, R. Crystal structure generation with autoregressive large language modeling.Nature Communications15,10570 (2024)

2024

[36] [36]

Wang, H.et al.Efficient evolutionary search over chemical space with large language models.arXiv preprint arXiv:2406.16976(2024)

arXiv 2024

[37] [37]

ACS Central Science11,2108–2120 (2025)

Sun, K.et al.SynLlama: generating synthesizable molecules and their analogs with large language models. ACS Central Science11,2108–2120 (2025)

2025

[38] [38]

arXiv preprint arXiv:2502.20933(2025)

Gan, J.et al.MatLLMSearch: Crystal Structure Discovery with Evolution-Guided Large Language Models. arXiv preprint arXiv:2502.20933(2025)

arXiv 2025

[39] [39]

& Fung, V

Jia, S., Zhang, C. & Fung, V. Llmatdesign: Autonomous materials discovery with large language models. arXiv preprint arXiv:2406.13163(2024)

arXiv 2024

[40] [40]

W.et al.ChemReasoner: Heuristic search over a large language model’s knowledge space using quantum-chemical feedback.arXiv preprint arXiv:2402.10980(2024)

Sprueill, H. W.et al.ChemReasoner: Heuristic search over a large language model’s knowledge space using quantum-chemical feedback.arXiv preprint arXiv:2402.10980(2024)

arXiv 2024

[41] [41]

& Liu, B

Takahara, I., Mizoguchi, T. & Liu, B. Accelerated inorganic materials design with generative AI agents. Cell Reports Physical Science6(2025)

2025

[42] [42]

A., Lindmaa, A., Von Lilienfeld, O

Faber, F. A., Lindmaa, A., Von Lilienfeld, O. A. & Armiento, R. Machine learning energies of 2 million elpasolite (ABC 2 D 6) crystals.Physical review letters117,135502 (2016)

2016

[43] [43]

N., Walsh, A

Savory, C. N., Walsh, A. & Scanlon, D. O. Can Pb-free halide double perovskites support high-efficiency solar cells?ACS energy letters1,949–955 (2016)

2016

[44] [44]

& Oberhofer, H

Landini, E., Reuter, K. & Oberhofer, H. Machine-learning based screening of lead-free halide double perovskites for photovoltaic applications.arXiv preprint arXiv:2208.12736(2022)

arXiv 2022

[45] [45]

Singh, A.et al.Openai gpt-5 system card.arXiv preprint arXiv:2601.03267(2025)

Pith/arXiv arXiv 2025

[46] [46]

Wei, J.et al.Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35,24824–24837 (2022)

2022

[47] [47]

Brown, T.et al.Language models are few-shot learners.Advances in neural information processing systems 33,1877–1901 (2020). 17

1901