ReadMOF: Structure-Free Semantic Embeddings from Systematic MOF Nomenclature for Machine Learning

Ashleigh M. Chester; Bartosz Mazur; Cameron Wilson; Kewei Zhu; Peyman Z. Moghadam; Yi Li

arxiv: 2604.10568 · v1 · submitted 2026-04-12 · 💻 cs.LG · cond-mat.mtrl-sci

ReadMOF: Structure-Free Semantic Embeddings from Systematic MOF Nomenclature for Machine Learning

Kewei Zhu , Cameron Wilson , Bartosz Mazur , Yi Li , Ashleigh M. Chester , Peyman Z. Moghadam This is my paper

Pith reviewed 2026-05-10 16:34 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.mtrl-sci

keywords metal-organic frameworksMOF nomenclaturelanguage model embeddingsstructure-property relationshipsmaterials informaticsIUPAC namesgeometry-independent modelingmachine learning

0 comments

The pith

Systematic MOF names yield vector embeddings via language models that match structure-based descriptors for property tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

ReadMOF converts the systematic IUPAC-style names of metal-organic frameworks into numerical vector embeddings using pretrained language models. These embeddings capture enough structural and compositional detail to support machine learning applications such as property prediction, material similarity searches, and clustering, reaching performance levels close to methods that require full atomic coordinates and connectivity. A sympathetic reader would care because this removes dependence on crystallographic data, which is frequently unavailable or costly to compute, and allows direct use of existing name databases for larger-scale materials analysis. The method further enables chemically grounded reasoning when the embeddings feed into large language models using only text input.

Core claim

ReadMOF is a nomenclature-based framework that employs pretrained language models to transform systematic MOF names from the Cambridge Structural Database into vector embeddings. These embeddings closely approximate traditional structure-based descriptors and support geometry-independent tasks including property prediction, similarity retrieval, and clustering at comparable accuracy. Integration with large language models further permits chemically meaningful reasoning from textual input alone, establishing structured chemical language as a scalable alternative to conventional molecular representations.

What carries the argument

Pretrained language models that map systematic MOF nomenclature directly to semantic vector embeddings serving as proxies for geometry-dependent descriptors.

If this is right

Property prediction for MOFs becomes feasible using only name strings from existing databases.
Similarity retrieval and clustering of MOFs can proceed without computing or storing 3D structures.
Large language models gain chemically coherent reasoning capabilities when supplied with name embeddings instead of coordinate data.
Materials informatics gains a scalable, text-only representation that avoids geometry computations for large libraries.
Language-driven discovery pipelines can now operate directly on standardized chemical nomenclature.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same name-to-embedding route could apply to other systematically named material families such as zeolites or covalent organic frameworks where full structures are sparse.
Name embeddings might serve as cheap priors in hybrid models that later incorporate partial experimental data.
Direct tests on newly synthesized MOFs with measured properties not present in the CSD would check whether the embeddings generalize beyond database-internal correlations.
Integration with reaction or synthesis text could extend the approach from static structures to process-aware predictions.

Load-bearing premise

Systematic IUPAC-style names for MOFs already encode enough structural and compositional information to model key property relationships without any atomic coordinates or connectivity information.

What would settle it

A benchmark comparison in which name-derived embeddings produce materially lower accuracy than geometry-based descriptors on a held-out set of MOF property predictions such as gas adsorption capacity or surface area.

read the original abstract

Systematic chemical names, such as IUPAC-style nomenclature for metal-organic frameworks (MOFs), contain rich structural and compositional information in a standardized textual format. Here we introduce ReadMOF, which is, to our knowledge, the first nomenclature-free machine learning framework that leverages these names to model structure-property relationships without requiring atomic coordinates or connectivity graphs. By employing pretrained language models, ReadMOF converts systematic MOF names from the Cambridge Structural Database (CSD) into vector embeddings that closely represent traditional structure-based descriptors. These embeddings enable applications in materials informatics, including property prediction, similarity retrieval, and clustering, with performance comparable to geometry-dependent methods. When combined with large language models, ReadMOF also establishes chemically meaningful reasoning ability with textual input only. Our results show that structured chemical language, interpreted through modern natural language processing techniques, can provide a scalable, interpretable, and geometry-independent alternative to conventional molecular representations. This approach opens new opportunities for language-driven discovery in materials science.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReadMOF's name-based embeddings offer a practical text-only route for MOF work but probably cannot fully substitute for geometry on properties that depend on pore metrics or bond details.

read the letter

The core claim is that pretrained language models can turn systematic MOF names from the CSD into embeddings that perform comparably to structure-based descriptors for prediction, retrieval, and clustering. The paper positions this as the first such nomenclature-driven, structure-free framework for MOFs. That positioning looks accurate based on the abstract and the absence of prior citations to equivalent methods. The approach is genuinely new in its direct use of IUPAC-style names without any coordinate input or graph construction. It also does something useful by highlighting how existing database names can support quick, scalable tasks like similarity search or initial screening when full structures are unavailable or expensive to process. The combination with large language models for textual reasoning is a reasonable extension that could appeal to users who prefer to stay in text space. Credit is due for keeping the method simple and for grounding it in real CSD nomenclature rather than inventing new labels. The experiments, if they show reasonable numbers on standard benchmarks, would make the work worth testing in practice. The main soft spot is that systematic names specify composition, metal nodes, linkers, and basic topology but routinely omit quantitative geometric parameters such as bond lengths, angles, unit-cell dimensions, and pore-size distributions. Many MOF properties, especially adsorption and diffusion, are sensitive to those details. If the reported comparability holds mainly on composition-driven tasks while weakening on geometry-sensitive ones, the substitution claim does not generalize as stated. The abstract gives no datasets, baselines, error bars, or validation protocols, so it is impossible to judge how large the gap actually is. A reader would need to see ablations that isolate geometry effects and direct comparisons against simple composition-only vectors. This paper is aimed at materials-informatics groups that already work with MOF databases and want lower-cost alternatives to full structure featurization. Anyone building text-based pipelines for chemical reasoning would also find the setup worth examining. It deserves a serious referee because the idea is fresh, the data source is public, and the practical payoff is clear even if the performance claims require tighter evidence. I would send it for review once the authors supply the missing experimental controls and property breakdowns.

Referee Report

2 major / 0 minor

Summary. The paper introduces ReadMOF, a framework that uses pretrained language models to generate vector embeddings from systematic IUPAC-style names of metal-organic frameworks (MOFs) sourced from the Cambridge Structural Database (CSD). These embeddings are claimed to closely approximate traditional structure-based descriptors, enabling machine learning tasks such as property prediction, similarity retrieval, and clustering without the need for atomic coordinates or connectivity graphs. Additionally, when integrated with large language models, it allows for chemically meaningful reasoning using only textual input. The approach is presented as a scalable and geometry-independent alternative to conventional molecular representations in materials informatics.

Significance. If the central claims are substantiated with rigorous validation, this work could meaningfully advance materials informatics by showing that standardized chemical nomenclature encodes sufficient information for effective ML representations of MOFs. It would reduce reliance on computationally expensive structure computations and enable text-only workflows, particularly valuable for large databases like the CSD. The use of off-the-shelf pretrained models is a practical strength that could facilitate adoption.

major comments (2)

Abstract: The core claim that name-derived embeddings 'closely represent traditional structure-based descriptors' and deliver 'performance comparable to geometry-dependent methods' is load-bearing but unsupported by any reported datasets, baselines, metrics (e.g., R², MAE with error bars), or validation protocols in the provided text. This prevents assessment of whether the substitution holds beyond composition-driven tasks.
Results/evaluation section: Systematic MOF nomenclature encodes metal nodes, linkers, and topology but omits quantitative geometric parameters (bond lengths, angles, unit-cell metrics, pore-size distributions) that govern many properties. The manuscript must explicitly test and report whether prediction performance remains comparable on geometry-sensitive tasks (e.g., gas adsorption isotherms) versus purely compositional ones; without this disaggregation, the geometry-independent claim cannot be evaluated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We have carefully reviewed each major comment and provide point-by-point responses below. We agree with several observations and have revised the manuscript to strengthen the presentation of our claims and evaluations.

read point-by-point responses

Referee: Abstract: The core claim that name-derived embeddings 'closely represent traditional structure-based descriptors' and deliver 'performance comparable to geometry-dependent methods' is load-bearing but unsupported by any reported datasets, baselines, metrics (e.g., R², MAE with error bars), or validation protocols in the provided text. This prevents assessment of whether the substitution holds beyond composition-driven tasks.

Authors: We acknowledge that the abstract, as written, summarizes the high-level claims without embedding specific quantitative support. The full manuscript contains a Results section with CSD-derived datasets, direct comparisons to structure-based descriptors (e.g., via graph neural networks and geometric fingerprints), and reported metrics including R², MAE, and error bars across multiple tasks. To make these claims immediately verifiable from the abstract, we will revise it to include concise references to key performance numbers and the validation protocol used. revision: yes
Referee: Results/evaluation section: Systematic MOF nomenclature encodes metal nodes, linkers, and topology but omits quantitative geometric parameters (bond lengths, angles, unit-cell metrics, pore-size distributions) that govern many properties. The manuscript must explicitly test and report whether prediction performance remains comparable on geometry-sensitive tasks (e.g., gas adsorption isotherms) versus purely compositional ones; without this disaggregation, the geometry-independent claim cannot be evaluated.

Authors: We agree that IUPAC-style names do not encode explicit geometric quantities and that this is a central consideration for the geometry-independent claim. Our existing evaluations focus on tasks where name-encoded compositional and topological information is sufficient (property prediction, similarity search, and clustering). To address the request for disaggregation, we will expand the evaluation section with new comparisons on geometry-sensitive tasks such as gas adsorption, reporting performance separately for compositional versus geometry-dependent properties and discussing the observed differences. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method uses external pretrained language models on CSD nomenclature without self-referential fitting or load-bearing self-citations.

full rationale

The paper's central approach applies off-the-shelf pretrained language models to systematic MOF names drawn from the external Cambridge Structural Database, producing embeddings for downstream tasks. No equations, parameter-fitting procedures, or derivation steps are described that reduce the claimed equivalence between name-based embeddings and structure-based descriptors to a tautology or to a fit performed on the target data itself. The abstract and available text invoke no uniqueness theorems, ansatzes smuggled via self-citation, or renamings of known results; performance comparability is presented as an empirical outcome rather than a definitional necessity. This leaves the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that nomenclature text is information-rich enough for LM embeddings to substitute for geometry; no free parameters, invented entities, or additional axioms are stated.

axioms (1)

domain assumption Pretrained language models extract chemically meaningful representations from systematic MOF nomenclature.
Invoked as the mechanism converting names to structure-representative embeddings.

pith-pipeline@v0.9.0 · 5494 in / 1061 out tokens · 49674 ms · 2026-05-10T16:34:27.861343+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Computation-Ready

DOI: 10.1021/jz501586e. (4) Li, J. -R.; Sculley, J.; Zhou, H. -C. Metal –Organic Frameworks for Separations. Chemical Reviews 2011, 112, 869–932. DOI: 10.1021/cr200190s. (5) Sumida, K.; Rogow, D. L.; Mason, J. A.; McDonald, T. M.; Bloch, E. D.; Herm, Z. R.; Bae, T.-H.; Long, J. R. Carbon Dioxide Capture in Metal –Organic Frameworks. Chemical Reviews 2011,...

work page doi:10.1021/jz501586e 2011
[2]

The Llama 3 Herd of Models

DOI: 10.1246/cl.2009.654. (95) Conesa-Egea, J.; Redondo, C. D.; Martínez, J. I.; Gómez-García, C. J.; Castillo, Ó.; Zamora, F.; Amo -Ochoa, P. Supramolecular Interactions Modulating Electrical Conductivity and Nanoprocessing of Copper –Iodine Double -Chain Coordination Polymers. Inorganic Chemistry 2018, 57 (13), 7568–7577. DOI: 10.1021/acs.inorgchem.8b00...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1246/cl.2009.654 2009

[1] [1]

Computation-Ready

DOI: 10.1021/jz501586e. (4) Li, J. -R.; Sculley, J.; Zhou, H. -C. Metal –Organic Frameworks for Separations. Chemical Reviews 2011, 112, 869–932. DOI: 10.1021/cr200190s. (5) Sumida, K.; Rogow, D. L.; Mason, J. A.; McDonald, T. M.; Bloch, E. D.; Herm, Z. R.; Bae, T.-H.; Long, J. R. Carbon Dioxide Capture in Metal –Organic Frameworks. Chemical Reviews 2011,...

work page doi:10.1021/jz501586e 2011

[2] [2]

The Llama 3 Herd of Models

DOI: 10.1246/cl.2009.654. (95) Conesa-Egea, J.; Redondo, C. D.; Martínez, J. I.; Gómez-García, C. J.; Castillo, Ó.; Zamora, F.; Amo -Ochoa, P. Supramolecular Interactions Modulating Electrical Conductivity and Nanoprocessing of Copper –Iodine Double -Chain Coordination Polymers. Inorganic Chemistry 2018, 57 (13), 7568–7577. DOI: 10.1021/acs.inorgchem.8b00...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1246/cl.2009.654 2009