pith. sign in

arxiv: 2606.21395 · v1 · pith:SQULBE3Unew · submitted 2026-06-19 · 💻 cs.LG · cond-mat.mtrl-sci

Atomistic Language Models Understand and Generate Materials

Pith reviewed 2026-06-26 14:53 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.mtrl-sci
keywords atomistic language modelscrystal structure generationdenoising diffusion modelstext-conditioned materials designmultimodal modelscontinuous projectorsFeynman-Kac sampling
0
0 comments X

The pith

Atomistic Language Models map language embeddings continuously into diffusion space to generate and optimize crystals from text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a single model backbone can natively handle both natural language instructions and 3D atomistic structures by connecting a pretrained atomistic encoder, a large language model, and a denoising diffusion model through continuous projectors and staged training. This approach avoids calling separate tools or using lossy text encodings of structures. A sympathetic reader would care because it enables direct text-driven crystal structure prediction, de novo generation, and optimization in one system. The work introduces a continuous bridge between language embeddings and the diffusion steering space plus a particle-based sampler called Text-to-Crystal Feynman-Kac to enforce stoichiometry during inference.

Core claim

By unifying a pretrained atomistic encoder, large language model, and denoising diffusion model through purely continuous projectors and staged training, ALMs achieve state-of-the-art results on crystal structure prediction and de novo generation. ALMs are enabled by a continuous bridge that maps language model embeddings directly into the steering space of atomistic diffusion, and are assisted by Text-to-Crystal Feynman-Kac (T2C-FK), a particle-based sampler that scores partial denoising trajectories to enforce stoichiometric targets at inference time.

What carries the argument

The continuous bridge that maps language model embeddings directly into the steering space of atomistic diffusion, together with staged training of the unified components.

If this is right

  • ALMs can take 3D atom coordinates or natural-language prompts as input and output optimized crystal structures.
  • The Text-to-Crystal Feynman-Kac sampler enforces exact stoichiometry during the denoising process at inference time.
  • A new benchmark, ALM Bench, provides standardized evaluation for text-conditioned crystal generation and optimization tasks.
  • The architecture supports native multimodality without fine-tuning the language model on textual encodings of structures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the continuous projectors generalize, the same bridging technique could connect language models to other continuous generative models beyond diffusion, such as flow-matching frameworks for molecules.
  • Success on crystal tasks suggests the method might extend to non-periodic systems like molecules or surfaces once equivalent atomistic encoders are swapped in.
  • The staged training schedule implies that freezing the atomistic encoder early prevents catastrophic forgetting of structural priors when the language component is added.

Load-bearing premise

Mapping language-model embeddings directly into the steering space of atomistic diffusion via continuous projectors preserves enough atomistic detail to beat prior separate-tool or lossy-text methods without creating new interface failures.

What would settle it

A head-to-head test on the ALM Bench where text-prompted structures generated by the unified ALM model show lower success rates or worse property matches than an equivalent pipeline that keeps the language model and diffusion model as separate tools.

Figures

Figures reproduced from arXiv: 2606.21395 by Ju Li, Krithik Ramesh, Rafael G\'omez-Bombarelli, Sathya Edamadaka.

Figure 1
Figure 1. Figure 1: Atomistic Language Models bridge natural language and 3D atomic coordinates to understand, generate, and optimize materials. This new paradigm allows a single autoregressive backbone to characterize the structure, properties, and applications of a material, as well as guide the discovery of new ones, all without lossy text representations. Abstract Atomistic structure and natural language have long been mo… view at source ↗
Figure 2
Figure 2. Figure 2: Atomistic Language Models understand atoms as soft tokens from a machine learning interatomic potential and generate inorganic crystals by steering diffusion models with classifier-free guidance. A. ALMs are comprised of an MLIP encoder, LLM, and diffusion decoder, unified by continuous projectors. B. Staged curriculum training which progressively unfreezes the model and instruction-tunes it, enabling prop… view at source ↗
Figure 3
Figure 3. Figure 3: A language-to-atomistic bridge enables the steering of crystal generation. ALM Edit uses all components above. ALM Gen swaps the Q-Former-style [23] producer for a lightweight per-token MLP (no learned queries or prompt context) feeding the same consumer, and does not emit composition embeddings. The decoder D observes atomic-number assignment A, initializing each node accordingly and never changing any at… view at source ↗
Figure 4
Figure 4. Figure 4: Text-to-Crystal Feynman–Kac (T2C-FK) enables ALM Gen, a de novo model, to generate structures with desired element sets and stoichiometry ratios. A. Unphysical structures are removed throughout sampling, and any differences from the reference stoichiometry are fixed at the last step via Hungarian scoring. ALM Edit is designed to output a material with the desired element set and stoichiometry ratio. ALM Ge… view at source ↗
Figure 5
Figure 5. Figure 5: Atomistic Language Models can accurately predict physical properties of materials. Spider performance plots for selected materials property prediction tasks from A. LLM4Mat-Bench [26] (MAD/MAE, with a performance threshold of ≥ 5) and B. MatterChat (MAE, baseline from and normalized to [19]). Parity plots are shown for formation energy per atom (Ef ) using C. Materials Project data [36] and D. GNoME data [… view at source ↗
Figure 6
Figure 6. Figure 6: Strong scaling laws emerge under fixed training and evaluation for several property prediction tasks. A. For increasing Qwen3 model size, property prediction performance on several tasks, including JARVIS-QETB potential energy per atom above, improves monotonically in MAD/MAE on LLM4Mat-Bench. B. Representational analysis of embeddings extracted from each continuous latent space throughout ALM Edit for 2,0… view at source ↗
Figure 7
Figure 7. Figure 7: CSP M@K=64 is flat in denoising timesteps T. (ALM Edit , MP-20). The bridge is trained with classifier-free-guidance dropout: with probability pdrop=0.2 per step the alm_embedding conditioning is replaced by a learned zeros vector, so the network jointly learns the conditional and unconditional scores. At inference we apply the standard CFG extrapolation, seθ(ut, At, t | C, g) = sθ(· | ∅) + g · [PITH_FULL… view at source ↗
Figure 8
Figure 8. Figure 8: Auxiliary supervision target comparison. Composition BCE (the optimum) vs aux-off across MSUN, perovskite any-of-K, and the LM judge; aux-off keeps MSUN but loses prompt-following entirely. The contrastive loss over atomistic token hidden states Z is also essential; without it, the Z collapses to a cosine distance of 0.12 between different prompts. With λaux=1, the average cosine distance between Z across … view at source ↗
Figure 9
Figure 9. Figure 9: SMACT charge-validity of the training compositions. Left, middle: MatterSim Eh versus DFT formation energy and density, coloured by SMACT charge-validity. Most training compositions are charge-invalid and high-Eh (lower-right quadrant). Right: SMACT charge-valid fraction by training source. In addition, the aggregate distribution of materials that we post-train both models on has a large amount of volume a… view at source ↗
Figure 10
Figure 10. Figure 10: Right-tailed energy distribution of the generation training corpus. Main shows per-bucket MatterSim energy above hull (Eh, eV/atom) on a log-count axis over the full range. The mass of the distribution sits far above the hull with a tail extending to ∼8 eV/atom (the 92 percentile are OQMD/AFLOW structures). The metastable (Eh≤0.10, dashed) and stable (Eh≤0.016, dotted) thresholds are marked, and the inset… view at source ↗
Figure 11
Figure 11. Figure 11: ALM Edit CFG guidance scale tradeoff between CSP and inverse design. Stronger g hurts crystal structure prediction (left, Match@K and RMSE for matches) but helps ALM Bench inverse design performance. The guidance score produces very different behavior for ALM Gen. Here, g controls how much a global conditioning vector steers Mattergen Base away from its strongly performing frozen base. There is a positive… view at source ↗
Figure 12
Figure 12. Figure 12: CFG guidance scale sweep for ALM Gen. SUN (Eh ≤ 0.016, left axis, circles) and MSUN (Eh ≤ 0.1, right axis, squares) depend heavily on g. The producer is a shallow learnable-query transformer [70, 23] that compresses S into a fixed-shape conditioning sequence C = fP [PITH_FULL_IMAGE:figures/full_fig_p029_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: generation LoRA learning-rate sweep. MSUN as a function of fresh-rank-8 LoRA learning rate (log scale). lr=0 leaves the atomistic tokens out-of-vocabulary; lr=2e-4 collapses cross-prompt cosine distance to 0.12 (triangles on the right axis). lr=1e-5 is the selected setting. An ablation over full- versus LoRA-finetuning ALM Core to develop Gen is shown in [PITH_FULL_IMAGE:figures/full_fig_p031_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Alternate bridge architectures on DNG SUN and CSP M@K. Left: MSUN (Ehull < 0.1) at each bridge’s optimal g. Right: CSP M@K at K = 64 on the CSP-mode backbone. N = 500 rows were drawn for each task from the test set, explaining the higher-than-reported CSP and MSUN than ALMs achieve in the main text. 0.450 0.475 0.500 0.525 0.550 0.575 Mean Ehull (eV/atom) lower better 0.12 0.13 0.14 0.15 Metastable fracti… view at source ↗
Figure 15
Figure 15. Figure 15: De-novo generative metrics across condition-token count M ∈ {8, 16, 32, 64}. A.3 Text-to-Crystal Feynman-Kac algorithm details Text-to-Crystal Feynman-Kac steering (T2C-FK; Section 2.4) is an inference-time mechanism that makes ALM Gen generate crystals with a requested element set and stoichiometry. ALM Gen is a strong but deliberately weakly-conditioned de-novo sampler: its backbone produces stable cell… view at source ↗
Figure 16
Figure 16. Figure 16: Q-Former producer source-length (context-window) ablation (g=0.5, honest denominator over all 768 atomtxt attempts; rest of the final recipe held fixed). Both ways of feeding the producer more source — widening the window (N=512) or prepending an explicit Lin=32 input-<atoms> segment — collapse raise-Ef direction-correctness below chance (the dashed line at 0.5) and lift lower-Ef , landing at overall dire… view at source ↗
Figure 17
Figure 17. Figure 17: Output-token ordering determines whether ALM Edit follows directional instructions. Direction-correct rate (fraction of generations that moved formation energy Ef the requested way relative to the input) versus classifier-free-guidance strength g, for two output-token orderings: the ALM Edit ordering (composition JSON before the Ai atom tokens, blue, highlighted) and an ablation that teacher-forces the Ai… view at source ↗
Figure 18
Figure 18. Figure 18: Unconditional and conditional outputs are similar for ALM Edit. Left: relative magnitudes (log scale) of setting g to be 0.5 (conditional) or 0 (unconditional). Right: direction-correctness is flat at 0.62–0.64 over a 16× range of g, indicating that the ceiling is a representation-quality limit, not a magnitude limit. begins only once t < T(1−τstart); with τstart=0.5 this halves the scoring compute at no … view at source ↗
Figure 19
Figure 19. Figure 19: Model-size scaling across seven property-prediction metrics. Each panel uses the available Qwen3 ALM Core evaluation suite for 0.6B, 1.7B, 4B, 8B, and 14B-sized LMs. The top row reports MAD/MAE skill ratios on LLM4Mat-Bench [26] JARVIS-QETB properties (higher is better). The bottom reports raw MAE on JARVIS-QETB, MatText, and Cantor-HEA properties (lower is better). is parsed from the prompt into the rewa… view at source ↗
Figure 20
Figure 20. Figure 20: Warm-start alignment loss trajectory (K=8). Causal LM loss on ∼ 1.35M structure-description pairs drops below 0.2 in ∼300 optimizer steps and plateaus near 0.10. B.1.1 Five-bucket training mixture Stage 1: Alignment data and optimizer. Only Pin is trained, under the standard causal language modeling loss on nearly 1.35M structure–description pairs drawn from LLM4Mat-Bench [26] and the four GPT-Narratives … view at source ↗
Figure 21
Figure 21. Figure 21: Target-property distributions over the training data (formation energy, energy above hull, band gap, density). These set the support of the values ALM is asked to read off and, in the editing phase, to move. size is 256. Validation buckets are partitioned with the same seed via split_seed=42: arXiv and CAMEL hold out 500 rows each, MaScQA holds out 20% stratified by topic (131/650). B.1.2 The URL leak fai… view at source ↗
Figure 22
Figure 22. Figure 22: Representative understanding-phase interactions, one per bucket type (user prompt on the left, ALM response on the right): captioning (describe), property and applications VQA (property_apps), and the text-only science Q&A buckets (arxiv/camel/mascqa). The <atoms> placeholder is expanded inline to OrbV3 node features for the structure-conditioned turns. ( [PITH_FULL_IMAGE:figures/full_fig_p040_22.png] view at source ↗
Figure 23
Figure 23. Figure 23 [PITH_FULL_IMAGE:figures/full_fig_p041_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: URL-leak rate is set by how the arXiv bucket is formatted. is therefore structural rather than data-scale-bound; a different bridge architecture and finetuning method would lead to higher performance. C ALM Bench We introduce ALM Bench, a benchmark for conditional crystal generation in which the conditioning carries both a structure and a natural-language instruction. It is the first benchmark to score th… view at source ↗
Figure 25
Figure 25. Figure 25: Per-dataset atom-count distributions for the structural training data. The structural buckets follow the LLM4Mat/GPT-Narratives distribution; the generation/editing pairs of Appendix B.2 are capped at ≤20 atoms per cell. (#moved the requested way)/(all scored candidates), with degenerate / NaN-property gens kept in the denominator. Sub-categories. There are three types of directional editing tasks (within… view at source ↗
Figure 26
Figure 26. Figure 26: ALM-Bench chat examples. D.1 Property prediction metric definitions MAD / MAE. For a regression property we report the ratio of the test-set Mean Absolute Deviation to the model’s Mean Absolute Error, MAD/MAE = 1 n P i |yi − y¯| 1 n P i |yˆi − yi | , (28) where y¯ is the test-set mean. This is the scale-free skill score of LLM4Mat-Bench [26]: higher is better, MAD/MAE=1 is no better than the mean predicto… view at source ↗
Figure 27
Figure 27. Figure 27: ALM-Bench chat examples (continued) [PITH_FULL_IMAGE:figures/full_fig_p046_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Materials-knowledge retention judge: verbatim graded exchanges. 50 [PITH_FULL_IMAGE:figures/full_fig_p050_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Materials-knowledge retention judge (continued). 51 [PITH_FULL_IMAGE:figures/full_fig_p051_29.png] view at source ↗
read the original abstract

Atomistic structure and natural language have long been modeled separately, with language models either calling atomistic models as tools or being fine-tuned on lossy textual encodings that discard atomistic information. We introduce Atomistic Language Models (ALMs) to pursue native multimodality, in which a single language backbone understands atomistic structures, generates materials from natural language, and optimizes crystal structures as instructed by text. By unifying a pretrained atomistic encoder, large language model, and denoising diffusion model through purely continuous projectors and staged training, ALMs achieve state-of-the-art results on crystal structure prediction and de novo generation. ALMs are enabled by a continuous bridge that maps language model embeddings directly into the steering space of atomistic diffusion, and are assisted by Text-to-Crystal Feynman-Kac (T2C-FK), a particle-based sampler that scores partial denoising trajectories to enforce stoichiometric targets at inference time. To evaluate the ability of ALMs to optimize and generate materials from natural-language prompts and 3D atom-coordinate inputs, we introduce ALM Bench, the first benchmark for text-conditioned crystal generation and optimization. Code, training data, and model weights will be released soon.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces Atomistic Language Models (ALMs) that integrate a pretrained atomistic encoder, a large language model, and a denoising diffusion model using purely continuous projectors and staged training. This enables the model to understand atomistic structures from 3D inputs, generate materials from natural language prompts, and optimize crystal structures as instructed by text. The authors claim state-of-the-art results on crystal structure prediction and de novo generation, introduce the Text-to-Crystal Feynman-Kac (T2C-FK) sampler for enforcing stoichiometric targets, and propose the ALM Bench benchmark for evaluating text-conditioned crystal generation and optimization.

Significance. If the results hold and the continuous bridge preserves atomistic information effectively without introducing interface failure modes, this work could represent a significant advance in creating native multimodal models for materials science, moving beyond tool-calling or lossy text encodings. The introduction of a new benchmark and the planned release of code and models would contribute to reproducibility and further research in the field.

major comments (1)
  1. [Abstract] Abstract: The abstract asserts state-of-the-art results on crystal structure prediction and de novo generation but provides no quantitative metrics, baselines, error bars, or dataset details. This absence makes it impossible to evaluate whether the central claim is supported by the data or experiments.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the single major comment below and will revise the abstract accordingly to improve clarity and support for our claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract asserts state-of-the-art results on crystal structure prediction and de novo generation but provides no quantitative metrics, baselines, error bars, or dataset details. This absence makes it impossible to evaluate whether the central claim is supported by the data or experiments.

    Authors: We agree that the abstract should include key quantitative results to substantiate the SOTA claims. The full manuscript reports these metrics (including baselines, error bars, and dataset details) in the Experiments section, but we acknowledge the abstract must be self-contained. In the revision we will add concise quantitative highlights, such as the specific performance gains on crystal structure prediction and de novo generation tasks, along with brief dataset references. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The abstract and available description outline an architecture that unifies pretrained components via continuous projectors and staged training, with a new sampler (T2C-FK) and benchmark (ALM Bench). No equations, parameter-fitting steps presented as predictions, or load-bearing self-citations are supplied that would allow any reduction of a claimed result to its own inputs by construction. The derivation chain therefore remains self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the continuous projector and T2C-FK sampler are introduced but their internal details and any fitted hyperparameters are not visible.

pith-pipeline@v0.9.1-grok · 5746 in / 1091 out tokens · 13990 ms · 2026-06-26T14:53:05.411597+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

79 extracted references · 15 linked inside Pith

  1. [1]

    Wood, Misko Dzamba, Xiang Fu, Meng Gao, Muhammed Shuaibi, Luis Barroso-Luque, Kareem Abdelmaqsoud, Vahe Gharakhanyan, John R

    Brandon M. Wood, Misko Dzamba, Xiang Fu, Meng Gao, Muhammed Shuaibi, Luis Barroso-Luque, Kareem Abdelmaqsoud, Vahe Gharakhanyan, John R. Kitchin, Daniel S. Levine, Kyle Michel, Anuroop Sriram, Taco Cohen, Abhishek Das, Ammar Rizvi, Sushree Jagriti Sahoo, Zachary W. Ulissi, and C. Lawrence Zitnick. Uma: A family of universal models for atoms, 2026

  2. [2]

    Elena, Dávid P

    Ilyes Batatia, Philipp Benner, Yuan Chiang, Alin M. Elena, Dávid P. Kovács, Janosh Riebesell, Xavier R. Advincula, Mark Asta, Matthew Avaylon, William J. Baldwin, Fabian Berger, Noam Bernstein, Arghya Bhowmik, Filippo Bigi, Samuel M. Blau, Vlad C ˘arare, Michele Ceriotti, Sanggyu Chong, James P. Darby, Sandip De, Flaviano Della Pia, V olker L. Deringer, R...

  3. [3]

    Orb-v3: atomistic simulation at scale, 2025

    Benjamin Rhodes, Sander Vandenhaute, Vaidotas Šimkus, James Gin, Jonathan Godwin, Tim Duignan, and Mark Neumann. Orb-v3: atomistic simulation at scale, 2025

  4. [4]

    Antunes, Keith T

    Luis M. Antunes, Keith T. Butler, and Ricardo Grau-Crespo. Crystal structure generation with autoregressive large language modeling.Nature Communications, 15(1):10570, December 2024

  5. [5]

    Plaid++: A preference aligned language model for targeted inorganic materials design, 2026

    Andy Xu, Rohan Desai, Larry Wang, Ethan Ritz, and Gabriel Hope. Plaid++: A preference aligned language model for targeted inorganic materials design, 2026

  6. [6]

    Lawrence Zitnick, and Zachary Ulissi

    Nate Gruver, Anuroop Sriram, Andrea Madotto, Andrew Gordon Wilson, C. Lawrence Zitnick, and Zachary Ulissi. Fine-tuned language models generate stable inorganic materials as text. In International Conference on Learning Representations (ICLR), 2024. arXiv:2402.04379. 12

  7. [7]

    Less can be more for predicting properties with large language models

    Nawaf Alampara, Santiago Miret, and Kevin Maik Jablonka. Mattext: Do language models need more than text & scale for materials modeling?, 2024. arXiv:2406.17295; v3 (2025) retitled "Less can be more for predicting properties with large language models"

  8. [8]

    Tanishq Gupta, Mohd Zaki, N. M. Anoop Krishnan, and Mausam. Matscibert: A materials domain language model for text mining and information extraction.npj Computational Materials, 8(1):102, 2022. arXiv:2109.15290

  9. [9]

    Universally converging representations of matter across scientific foundation models, 2025

    Sathya Edamadaka, Soojung Yang, Ju Li, and Rafael Gómez-Bombarelli. Universally converging representations of matter across scientific foundation models, 2025

  10. [10]

    Keisuke Ozawa, Teppei Suzuki, Shunsuke Tonogai, and Tomoya Itakura. Graph-text contrastive learning of inorganic crystal structure toward a foundation model of inorganic materials.Science and Technology of Advanced Materials: Methods, 4(1):2406219, December 2024

  11. [11]

    Towards end-to-end automation of ai research.Nature, 651(8107):914–919, March 2026

    Chris Lu, Cong Lu, Robert Tjarko Lange, Yutaro Yamada, Shengran Hu, Jakob Foerster, David Ha, and Jeff Clune. Towards end-to-end automation of ai research.Nature, 651(8107):914–919, March 2026

  12. [12]

    Crystal diffusion variational autoencoder for periodic material generation

    Tian Xie, Xiang Fu, Octavian-Eugen Ganea, Regina Barzilay, and Tommi Jaakkola. Crystal diffusion variational autoencoder for periodic material generation. InInternational Conference on Learning Representations (ICLR), 2022. arXiv:2110.06197

  13. [13]

    Crystal structure prediction by joint equivariant diffusion

    Rui Jiao, Wenbing Huang, Peijia Lin, Jiaqi Han, Pin Chen, Yutong Lu, and Yang Liu. Crystal structure prediction by joint equivariant diffusion. InAdvances in Neural Information Processing Systems 36 (NeurIPS 2023), 2023. arXiv:2309.04475

  14. [14]

    A generative model for inorganic materials design.Nature, 639(8055):624–632, March 2025

    Claudio Zeni, Robert Pinsler, Daniel Zügner, Andrew Fowler, Matthew Horton, Xiang Fu, Zilong Wang, Aliaksandra Shysheya, Jonathan Crabbé, Shoko Ueda, Roberto Sordillo, Lixin Sun, Jake Smith, Bichlien Nguyen, Hannes Schulz, Sarah Lewis, Chin-Wei Huang, Ziheng Lu, Yichi Zhou, Han Yang, Hongxia Hao, Jielan Li, Chunlei Yang, Wenjie Li, Ryota Tomioka, and Tian...

  15. [15]

    Gaunt, Brendan McMorrow, Danilo J

    Sherry Yang, Simon Batzner, Ruiqi Gao, Muratahan Aykol, Alexander L. Gaunt, Brendan McMorrow, Danilo J. Rezende, Dale Schuurmans, Igor Mordatch, and Ekin D. Cubuk. Generative hierarchical materials search. InAdvances in Neural Information Processing Systems 37 (NeurIPS 2024), 2024. arXiv:2409.06762

  16. [16]

    Musgrave III, Anirban Chandra, Abhirup Patra, Detlef Hohl, Connor W

    Bowen Deng, Bohan Li, Matthew Cox, Hoje Chun, Juno Nam, Artur Lyssenko, Sathya Edamadaka, Jurgis Ruza, Xiaochen Du, Nofit Segal, Jesus Diaz Sanchez, Mingrou Xie, Ty Perez, Yu Yao, Miguel Steiner, Sauradeep Majumdar, Charles B. Musgrave III, Anirban Chandra, Abhirup Patra, Detlef Hohl, Connor W. Coley, Ju Li, and Rafael Gómez-Bombarelli. Harnessing atomist...

  17. [17]

    Lu, Thomas Christensen, and Marin Soljaˇci´c

    Viggo Moro, Charlotte Loh, Rumen Dangovski, Ali Ghorashi, Andrew Ma, Zhuo Chen, Samuel Kim, Peter Y . Lu, Thomas Christensen, and Marin Soljaˇci´c. Multimodal foundation models for material property prediction and discovery.Newton, 1(1):100016, March 2025

  18. [18]

    Bridging text and crystal structures: Literature-driven contrastive learning for materials science.Machine Learning: Science and Technology, 6(3):035006, September 2025

    Yuta Suzuki, Tatsunori Taniai, Ryo Igarashi, Kotaro Saito, Naoya Chiba, Yoshitaka Ushiku, and Kanta Ono. Bridging text and crystal structures: Literature-driven contrastive learning for materials science.Machine Learning: Science and Technology, 6(3):035006, September 2025. arXiv:2501.12919

  19. [19]

    Mahoney, Andy Nonaka, and Zhi Jackie Yao

    Yingheng Tang, Wenbin Xu, Jie Cao, Weilu Gao, Steven Farrell, Benjamin Erichson, Michael W. Mahoney, Andy Nonaka, and Zhi Jackie Yao. A multimodal large language model for materials science.Nature Machine Intelligence, 8(4):588–601, April 2026

  20. [20]

    Cooper, and Yejin Choi

    Jiyu Cui, Fang Wu, Haokai Zhao, Minggao Feng, Xenophon Evangelopoulos, Andrew I. Cooper, and Yejin Choi. L2m3of: A large language multimodal model for metal-organic frameworks, 2025. 13

  21. [21]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems 36 (NeurIPS 2023), 2023. arXiv:2304.08485

  22. [22]

    Janus-pro: Unified multimodal understanding and generation with data and model scaling, 2025

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling, 2025

  23. [23]

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 ofProceedings of Machine Learning Research, pages 19730–19742. PMLR, 2023. arXiv:...

  24. [24]

    Qwen3 technical report, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  25. [25]

    Classifier-free diffusion guidance, 2022

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022

  26. [26]

    Llm4mat-bench: benchmarking large language models for materials property prediction

    Andre Niyongabo Rubungo, Kangming Li, Jason Hattrick-Simpers, and Adji Bousso Dieng. Llm4mat-bench: benchmarking large language models for materials property prediction. Machine Learning: Science and Technology, 6(2):020501, 2025. arXiv:2411.00177

  27. [27]

    Gleason, Ali Ramlaoui, Andy Xu, Georgia Channing, Daniel Levy, Clémentine Fourrier, Nikita Kazeev, Chaitanya K

    Siddharth Betala, Samuel P. Gleason, Ali Ramlaoui, Andy Xu, Georgia Channing, Daniel Levy, Clémentine Fourrier, Nikita Kazeev, Chaitanya K. Joshi, Sékou-Oumar Kaba, Félix Therrien, Alex Hernandez-Garcia, Rocío Mercado, N. M. Anoop Krishnan, and Alexandre Duval. Lemat-genbench: A unified evaluation framework for crystal generative models, 2026

  28. [28]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski,...

  29. [29]

    Neural discrete representation learning

    Aäron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. InAdvances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 6306–6315, 2017. arXiv:1711.00937

  30. [30]

    Atomic cluster expansion for accurate and transferable interatomic potentials

    Ralf Drautz. Atomic cluster expansion for accurate and transferable interatomic potentials. Physical Review B, 99(1):014104, January 2019

  31. [31]

    Ganose and Anubhav Jain

    Alex M. Ganose and Anubhav Jain. Robocrystallographer: automated crystal structure text descriptions and analysis.MRS Communications, 9(3):874–881, September 2019

  32. [32]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. arXiv:2106.09685

  33. [33]

    Alexander H. Liu, Andy Ehrenberg, Andy Lo, Clément Denoix, Corentin Barreau, Guillaume Lample, Jean-Malo Delignon, Khyathi Raghavi Chandu, Patrick von Platen, Pavankumar Reddy Muddireddy, Sanchit Gandhi, Soham Ghosh, Srijan Mishra, Thomas Foubert, Abhinav Rastogi, Adam Yang, Albert Q. Jiang, Alexandre Sablayrolles, Amélie Héliou, Amélie Martin, Anmol 14 A...

  34. [34]

    Trippe, Christian A

    Luhuan Wu, Brian L. Trippe, Christian A. Naesseth, David M. Blei, and John P. Cunningham. Practical and asymptotically exact conditional sampling in diffusion models. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10-16, 2023, 2023. arXiv...

  35. [35]

    A general framework for inference-time scaling and steering of diffusion models

    Raghav Singhal, Zachary Horvitz, Ryan Teehan, Mengye Ren, Zhou Yu, Kathleen McKeown, and Rajesh Ranganath. A general framework for inference-time scaling and steering of diffusion models. InProceedings of the 42nd International Conference on Machine Learning (ICML 2025), PMLR 267, pages 55810–55827, 2025. arXiv:2501.06848

  36. [36]

    Anubhav Jain, Shyue Ping Ong, Geoffroy Hautier, Wei Chen, William Davidson Richards, Stephen Dacek, Shreyas Cholia, Dan Gunter, David Skinner, Gerbrand Ceder, and Kristin A. Persson. Commentary: The materials project: A materials genome approach to accelerating materials innovation.APL Materials, 1(1):011002, 2013

  37. [37]

    Schoenholz, Muratahan Aykol, Gowoon Cheon, and Ekin Dogus Cubuk

    Amil Merchant, Simon Batzner, Samuel S. Schoenholz, Muratahan Aykol, Gowoon Cheon, and Ekin Dogus Cubuk. Scaling deep learning for materials discovery.Nature, 624(7990):80–85, December 2023

  38. [38]

    1.5 million materials narratives generated by chatbots.Scientific Data, 11(1):1060, September 2024

    Yang Jeong Park, Sung Eun Jerng, Sungroh Yoon, and Ju Li. 1.5 million materials narratives generated by chatbots.Scientific Data, 11(1):1060, September 2024

  39. [39]

    Crystalreasoner: Reasoning and rl for property-conditioned crystal structure generation, 2026

    Yuyang Wu, Stefano Falletta, Delia McGrath, and Sherry Yang. Crystalreasoner: Reasoning and rl for property-conditioned crystal structure generation, 2026

  40. [40]

    Camel: Communicative agents for "mind" exploration of large language model society

    Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for "mind" exploration of large language model society. InAdvances in Neural Information Processing Systems 36 (NeurIPS 2023), 2023. arXiv:2303.17760

  41. [41]

    Garrity, Andrew C

    Kamal Choudhary, Kevin F. Garrity, Andrew C. E. Reid, Brian DeCost, Adam J. Biacchi, Angela R. Hight Walker, Zachary Trautt, Jason Hattrick-Simpers, A. Gilad Kusne, Andrea Centrone, Albert Davydov, Jie Jiang, Ruth Pachter, Gowoon Cheon, Evan Reed, Ankit Agrawal, Xiaofeng Qian, Vinit Sharma, Houlong Zhuang, Sergei V . Kalinin, Bobby G. Sumpter, Ghanshyam P...

  42. [42]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling (COLM), 2024. arXiv:2311.12022. 15

  43. [43]

    Benjamin Kurt Miller, Ricky T. Q. Chen, Anuroop Sriram, and Brandon M. Wood. Flowmm: Generating materials with riemannian flow matching. InProceedings of the 41st International Conference on Machine Learning (ICML), pages 35664–35686, 2024. arXiv:2406.04713

  44. [44]

    Crystalflow: a flow-based generative model for crystalline materials

    Xiaoshan Luo, Zhenyu Wang, Qingchang Wang, Xuechen Shao, Jian Lv, Lei Wang, Yanchao Wang, and Yanming Ma. Crystalflow: a flow-based generative model for crystalline materials. Nature Communications, 16(1):9267, 2025. arXiv:2412.11693

  45. [45]

    Martirossyan, Eric Fuemmeler, Zeren Shui, Amit Gupta, Pawan Prakash, Adrian Roitberg, Mingjie Liu, George Karypis, Mark Transtrum, Richard G

    Philipp Höllmer, Thomas Egg, Maya M. Martirossyan, Eric Fuemmeler, Zeren Shui, Amit Gupta, Pawan Prakash, Adrian Roitberg, Mingjie Liu, George Karypis, Mark Transtrum, Richard G. Hennig, Ellad B. Tadmor, and Stefano Martiniani. Open materials generation with stochastic interpolants. InProceedings of the 42nd International Conference on Machine Learning (I...

  46. [46]

    Multimodal crystal flow: Any-to-any modality generation for unified crystal modeling, 2026

    Kiyoung Seong, Sungsoo Ahn, Sehui Han, and Changyoung Park. Multimodal crystal flow: Any-to-any modality generation for unified crystal modeling, 2026

  47. [47]

    Martirossyan, Thomas Egg, Philipp Hoellmer, George Karypis, Mark Transtrum, Adrian Roitberg, Mingjie Liu, Richard G

    Maya M. Martirossyan, Thomas Egg, Philipp Hoellmer, George Karypis, Mark Transtrum, Adrian Roitberg, Mingjie Liu, Richard G. Hennig, Ellad B. Tadmor, and Stefano Martiniani. All that structure matches does not glitter. InAdvances in Neural Information Processing Systems 39 (NeurIPS 2025) Datasets and Benchmarks Track, 2025. arXiv:2509.12178

  48. [48]

    Wyckoff transformer: Generation of symmetric crystals

    Nikita Kazeev, Wei Nong, Ignat Romanov, Ruiming Zhu, Andrey Ustyuzhanin, Shuya Yamazaki, and Kedar Hippalgaonkar. Wyckoff transformer: Generation of symmetric crystals. In Proceedings of the 42nd International Conference on Machine Learning (ICML), pages 29495–29526, 2025. arXiv:2503.02407

  49. [49]

    Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models, 2022

  50. [50]

    Ranking the information content of distance measures.PNAS Nexus, 1(2):pgac039, 05 2022

    Aldo Glielmo, Claudio Zeni, Bingqing Cheng, Gábor Csányi, and Alessandro Laio. Ranking the information content of distance measures.PNAS Nexus, 1(2):pgac039, 05 2022

  51. [51]

    Krishnapriyan

    Tobias Kreiman, Yutong Bai, Fadi Atieh, Elizabeth Weaver, Eric Qu, and Aditi S. Krishnapriyan. Transformers discover molecular structure without graph priors, 2025

  52. [52]

    Position: The platonic representation hypothesis

    Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. Position: The platonic representation hypothesis. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024, volume 235 ofProceedings of Machine Learning Research, pages 20617–20642. PMLR, 2024. arXiv:2405.07987

  53. [53]

    Smiles, a chemical language and information system

    David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules.Journal of Chemical Information and Computer Sciences, 28(1):31–36, February 1988

  54. [54]

    Self-referencing embedded strings (selfies): A 100% robust molecular string representation

    Mario Krenn, Florian Häse, AkshatKumar Nigam, Pascal Friederich, and Alán Aspuru-Guzik. Self-referencing embedded strings (selfies): A 100% robust molecular string representation. Machine Learning: Science and Technology, 1(4):045024, December 2020. arXiv:1905.13741

  55. [55]

    Multi-modal molecule structure–text model for text-based retrieval and editing.Nature Machine Intelligence, 5(12):1447–1457, 2023

    Shengchao Liu, Weili Nie, Chengpeng Wang, Jiarui Lu, Zhuoran Qiao, Ling Liu, Jian Tang, Chaowei Xiao, and Animashree Anandkumar. Multi-modal molecule structure–text model for text-based retrieval and editing.Nature Machine Intelligence, 5(12):1447–1457, 2023. arXiv:2212.10789

  56. [56]

    Llm-fusion: A novel multimodal fusion model for accelerated material discovery, 2025

    Onur Boyar, Indra Priyadarsini, Seiji Takeda, and Lisa Hamada. Llm-fusion: A novel multimodal fusion model for accelerated material discovery, 2025

  57. [57]

    Towards 3d molecule-text interpretation in language models

    Sihang Li, Zhiyuan Liu, Yanchen Luo, Xiang Wang, Xiangnan He, Kenji Kawaguchi, Tat-Seng Chua, and Qi Tian. Towards 3d molecule-text interpretation in language models. InInternational Conference on Learning Representations (ICLR), 2024. arXiv:2401.13923. 16

  58. [58]

    Translation between molecules and natural language

    Carl Edwards, Tuan Lai, Kevin Ros, Garrett Honke, Kyunghyun Cho, and Heng Ji. Translation between molecules and natural language. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 375–413, 2022. arXiv:2204.11817

  59. [59]

    A merged molecular representation learning for molecular properties prediction with a web-based service.Scientific Reports, 11(1):11028, May 2021

    Hyunseob Kim, Jeongcheol Lee, Sunil Ahn, and Jongsuk Ruth Lee. A merged molecular representation learning for molecular properties prediction with a web-based service.Scientific Reports, 11(1):11028, May 2021

  60. [60]

    Can large language models empower molecular property prediction?, 2023

    Chen Qian, Huayi Tang, Zhirui Yang, Hong Liang, and Yong Liu. Can large language models empower molecular property prediction?, 2023

  61. [61]

    Multimodal fusion with relational learning for molecular property prediction.Communications Chemistry, 8(1):200, July 2025

    Zhengyang Zhou, Yunrui Li, Pengyu Hong, and Hao Xu. Multimodal fusion with relational learning for molecular property prediction.Communications Chemistry, 8(1):200, July 2025

  62. [62]

    Rand, and Adji Bousso Dieng

    Andre Niyongabo Rubungo, Craig Arnold, Barry P. Rand, and Adji Bousso Dieng. Llm-prop: predicting the properties of crystalline materials using large language models.npj Computational Materials, 11(1):186, June 2025

  63. [63]

    Diffusion models beat gans on image synthesis

    Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion models beat gans on image synthesis. InAdvances in Neural Information Processing Systems 34 (NeurIPS 2021), pages 8780–8794,

  64. [64]

    Perception encoder: The best visual embeddings are not at the output of the network

    Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, Junke Wang, Marco Monteiro, Hu Xu, Shiyu Dong, Nikhila Ravi, Daniel Li, Piotr Dollár, and Christoph Feichtenhofer. Perception encoder: The best visual embeddings are not at the output of the network. InAdvances in N...

  65. [65]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26296–26306, 2024. arXiv:2310.03744

  66. [66]

    Hénaff, Matthew M

    Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier J. Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, and João Carreira. Perceiver io: A general architecture for structured inputs & outputs. InInternational Conference on Lea...

  67. [67]

    Finite scalar quantization: Vq-vae made simple

    Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: Vq-vae made simple. InInternational Conference on Learning Representations (ICLR), 2024. arXiv:2309.15505

  68. [68]

    Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models, 2023

    Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models, 2023

  69. [69]

    Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg

    Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces. InAdvances in Neural Information Processing Systems (NeurIPS), volume 34, pages 17981–17993, 2021. arXiv:2107.03006

  70. [70]

    Generating images with multimodal language models

    Jing Yu Koh, Daniel Fried, and Ruslan Salakhutdinov. Generating images with multimodal language models. InAdvances in Neural Information Processing Systems (NeurIPS), volume 36,

  71. [71]

    Ms-diffusion: Multi-subject zero-shot image personalization with layout guidance

    Xierui Wang, Siming Fu, Qihan Huang, Wanggui He, and Hao Jiang. Ms-diffusion: Multi-subject zero-shot image personalization with layout guidance. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28,

  72. [72]

    arXiv:2406.07209

    OpenReview.net, 2025. arXiv:2406.07209

  73. [73]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 3813–3824, 2023. arXiv:2302.05543. 17

  74. [74]

    Film: Visual reasoning with a general conditioning layer

    Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

  75. [75]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations (ICLR), 2019. arXiv:1711.05101

  76. [76]

    good model

    Nian Liu, Nikita Kazeev, Stephen Gregory Dale, Artem Maevskiy, Yuwei Zeng, Ryoji Kubo, Pengru Huang, Thomas Laurent, Yann LeCun, Kostya S. Novoselov, and Xavier Bresson. Crys-jepa: Accelerating crystal discovery via embedding screening and generative refinement, 2026. 18 Appendix A Architecture design choices and ablations . . . . . . . . . . . . . . . . ...

  77. [77]

    n=Kgenerations per row

    Alongside the mean score (normalized to [0,1] in Table 1) we track an independent degeneracy diagnostic,loop-rate: the fraction of answers whose most-frequent 4-gram repeats ≥4 times. Figures 28 and 29 demonstrate examples of the LLM judge. D.2 Crystal structure prediction metric details This section fixes the exact, reproducible definitions of the CSP me...

  78. [78]

    partially correct; second oxide is incomplete and lacks an application

    Barium titanate (BaTiO3): used as a dielectric in multilayer ceramic capacitors (MLCCs) for electronic devices, thanks to its high dielectric constant and low loss. 2. Lead zirconate titanate (Pb(Zr,Ti)O3 or PZT): another well-known perovskite-structured oxide, lead zirconate titanate, ALM Edit (full-FT) - judge: “partially correct; second oxide is incomp...

  79. [79]

    second example is repetitive, incoherent

    Barium Titanate (BaTiO3): used as a dielectric material in capacitors due to its high permittivity. 2. Perovskite Solar Cells (e.g., MAPbI MAPbI MAPbI MAPbI MAPbI MAPbI MAPbI MAPbI MAPb MAPb MAPb MAPb MAPb MAPb MAPb MAPb MAPb MAPb ALM Gen (bridge-LoRA) - loop detected (4-gram repeats >= 4x); judge: “second example is repetitive, incoherent” LLM judge: 1/2...