Interpretable Inverse Design of Metal-Organic Frameworks with Large Language Model Agents

Jihan Kim; Kyungmin Nam; Seunghee Han

arxiv: 2606.29459 · v1 · pith:PJVDI7YDnew · submitted 2026-06-28 · 💻 cs.LG · cond-mat.mtrl-sci· cs.AI· cs.CL

Interpretable Inverse Design of Metal-Organic Frameworks with Large Language Model Agents

Kyungmin Nam , Seunghee Han , Jihan Kim This is my paper

Pith reviewed 2026-06-30 07:58 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.mtrl-scics.AIcs.CL

keywords metal-organic frameworksinverse designlarge language modelsinterpretable designclosed-loop optimizationsimulation-guided searchadsorption and separation

0 comments

The pith

Language-model agents design top metal-organic frameworks by iterating on chemistry hypotheses and validating them in simulation within 400 evaluations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that language-model agents can carry out inverse design of MOFs by generating interpretable hypotheses about metal nodes, linkers, pore geometry, and functional groups, then translating those into candidate structures for simulation testing. Two agents work in a closed loop across ten iterations, with four diagnostic beams that apply different constraint subsets so performance differences reveal which design element matters. This concentrates effort on high-performing structures for adsorption, separation, and electronic tasks without any per-objective model training and also produces new MOFs de novo that adapt to requested conditions. A reader would care because the approach stays transparent while using far fewer expensive property evaluations than random search or genetic algorithms.

Core claim

LLM4MOF shows that language-model agents can run interpretable, simulation-grounded inverse design without training a model per objective. One agent proposes hypotheses over metal nodes, linkers, pore geometry, and functional chemistry; a second turns them into constraints that select MOFs; each hypothesis is tested through four diagnostic beams that apply different constraint subsets so comparing beams isolates whether geometry, chemistry, or metal choice drives performance. Even blind to the global property landscape, the loop concentrates on top structures across six tasks within 400 evaluations and generates new MOFs de novo that adapt geometry to each requested condition.

What carries the argument

The LLM4MOF closed-loop framework with a hypothesis-proposing agent, a constraint-translating agent, and four diagnostic beams that test constraint subsets to isolate performance drivers.

If this is right

The framework locates top-performing structures for adsorption, separation, and electronic-structure properties across six tasks within 400 evaluations even without access to the full property landscape.
It generates and validates entirely new MOFs in live simulation, adapting their geometry to match each requested condition.
Performance gains over random search and genetic algorithms occur at roughly one dollar per campaign.
Comparing results across the four diagnostic beams reveals whether geometry, chemistry, or metal choice is responsible for success on a given task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same agent-plus-beam structure could be tested on other classes of porous materials such as covalent organic frameworks where similar combinatorial spaces exist.
Replacing the simulation backend with experimental measurements would turn the loop into a physical discovery system, though the paper does not demonstrate that step.
Because the hypotheses remain human-readable, the method could supply starting points for human chemists to refine before committing to synthesis.

Load-bearing premise

Language-model agents can produce chemically valid, non-trivial design hypotheses and constraint sets whose performance differences can be isolated by the four diagnostic beams and whose simulation results reliably reflect real material behavior.

What would settle it

Running the framework on a known small MOF database, then checking whether the structures it ranks highest after 400 evaluations match the actual top performers found by exhaustive enumeration of the same database.

read the original abstract

Inverse design of metal-organic frameworks (MOFs) requires searching a combinatorially vast space where property labels are expensive and most machine-learning models reveal little about why a structure succeeds. We introduce LLM4MOF, a closed-loop framework in which language-model agents reason about chemistry, build candidate MOFs, and test them in simulation, refining hypotheses over ten autonomous iterations. One agent proposes interpretable design hypotheses over metal nodes, linkers, pore geometry, and functional chemistry, and a second translates them into constraints that select candidate MOFs, each made of a metal node, organic linker, and matching topology. Each hypothesis is tested through four diagnostic beams that apply different subsets of its constraints, so comparing them shows whether geometry, chemistry, or metal choice drives performance. Even when blind to the global property landscape of databases, LLM4MOF concentrates its search on top-performing structures across six adsorption, separation, and electronic-structure tasks within 400 property evaluations. The same loop also generates new MOFs de novo and validates them in live simulation, where it adapts the geometry to each requested condition, outperforming random search and a genetic algorithm at roughly $1 per campaign. LLM4MOF shows that language-model agents can run interpretable, simulation-grounded inverse design without training a model per objective.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLM4MOF shows agentic hypothesis generation plus diagnostic beams can steer MOF inverse design in simulation with built-in checks on what drives performance.

read the letter

The core contribution is a closed-loop system where one LLM agent generates interpretable hypotheses about metal nodes, linkers, and geometry, a second turns them into constraint sets, and four diagnostic beams test subsets of those constraints to isolate which factor actually matters. The loop runs for ten iterations, stays inside roughly 400 simulations, and also produces new MOFs de novo that are validated live. It reports better concentration on top structures than random search or a genetic algorithm across adsorption, separation, and electronic tasks.

The diagnostic beams are the part that stands out. Most optimization work in this space either fits a black-box model or just reports the final structure; here the comparison across beams gives a direct way to attribute success to geometry versus chemistry without extra post-hoc analysis. Using external molecular simulation as the evaluator keeps the method from fitting to its own outputs, and the lack of per-task surrogate training is practical when labels are costly.

The abstract supplies no error bars, trial counts, or baseline implementation details, so the size of the reported advantage is difficult to judge from the summary alone. If the full results show stable gains with proper controls on the six tasks, the claim holds; otherwise the work reads more as a demonstration than a definitive benchmark. The de-novo generation step also needs to show that the generated structures remain chemically reasonable and that the adaptation to requested conditions is reproducible.

This paper is aimed at researchers combining LLMs with materials simulation for inverse design, especially those who want interpretability without heavy model training. A reader already working on agentic workflows or MOF property prediction would find the constraint-beam idea worth testing.

I would send it to peer review so the quantitative claims and baseline details can be examined directly.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces LLM4MOF, a closed-loop framework in which language-model agents generate interpretable design hypotheses over metal nodes, linkers, pore geometry, and functional chemistry for metal-organic frameworks (MOFs), translate them into constraint sets, evaluate candidates via four diagnostic beams that isolate the contribution of each constraint subset, and refine hypotheses over ten autonomous iterations. The central claims are that the method concentrates search on top-performing structures across six adsorption, separation, and electronic-structure tasks within a 400-evaluation budget, generates and validates new de-novo MOFs in live simulation while adapting geometry to requested conditions, and outperforms random search and a genetic algorithm at roughly $1 per campaign, all without training a per-objective model.

Significance. If the quantitative results hold, the work would demonstrate that LLM agents can autonomously perform interpretable, simulation-grounded inverse design in a combinatorially large materials space. Strengths include the use of external molecular simulation as independent ground truth, the absence of per-task model training, the explicit isolation of design drivers via diagnostic beams, and the low per-campaign cost. These elements address a recognized need for explainable methods in expensive-label inverse design.

major comments (2)

[Abstract] Abstract and results presentation: the manuscript states that LLM4MOF 'concentrates its search on top-performing structures' and 'outperforming random search and a genetic algorithm' but supplies no quantitative metrics (e.g., mean rank, success rate, or property values), error bars, number of independent trials, or implementation details for the baselines. This absence prevents evaluation of the central empirical claim that the agent loop is superior within the 400-evaluation budget.
[Method] The description of the four diagnostic beams (which apply different subsets of constraints to isolate geometry, chemistry, or metal effects) is load-bearing for the interpretability claim, yet no concrete example of beam outputs, constraint subsets, or statistical comparison across beams is referenced in the provided text.

minor comments (1)

The cost estimate of roughly $1 per campaign would be clearer if the breakdown (LLM API calls versus simulation time) were stated explicitly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the clarity of our empirical claims and the supporting evidence for interpretability. We address each major comment below and will revise the manuscript to incorporate the requested details.

read point-by-point responses

Referee: [Abstract] Abstract and results presentation: the manuscript states that LLM4MOF 'concentrates its search on top-performing structures' and 'outperforming random search and a genetic algorithm' but supplies no quantitative metrics (e.g., mean rank, success rate, or property values), error bars, number of independent trials, or implementation details for the baselines. This absence prevents evaluation of the central empirical claim that the agent loop is superior within the 400-evaluation budget.

Authors: We agree that the abstract and results lack the quantitative metrics needed to substantiate the central claims. In the revised manuscript we will update the abstract and add a results subsection that reports mean ranks (with standard deviations) of the top structures identified, success rates for recovering top-percentile performers, achieved property values, error bars from multiple independent trials (minimum of five), and full implementation details for the random-search and genetic-algorithm baselines, including their hyper-parameters and execution within the identical 400-evaluation budget. revision: yes
Referee: [Method] The description of the four diagnostic beams (which apply different subsets of constraints to isolate geometry, chemistry, or metal effects) is load-bearing for the interpretability claim, yet no concrete example of beam outputs, constraint subsets, or statistical comparison across beams is referenced in the provided text.

Authors: We acknowledge that a concrete example is required to make the diagnostic-beam analysis fully transparent. The revised manuscript will include an explicit worked example of one design hypothesis together with the four constraint subsets, the resulting simulation outputs from each beam, and a statistical comparison (e.g., pairwise tests or effect-size measures) across beams that isolates the contribution of geometry, chemistry, and metal choice. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's workflow consists of LLM agents generating hypotheses, translating them into constraints, and evaluating candidate MOFs via four diagnostic beams against external molecular simulation results. Performance is benchmarked by direct comparison to random search and a genetic algorithm within a fixed 400-evaluation budget, with de-novo generation also validated in live simulation. No equations, parameter fits, or derivations are described that reduce to self-definition, fitted inputs renamed as predictions, or load-bearing self-citations. The central claims rest on simulation-grounded empirical outcomes rather than internal re-derivation of quantities from the method's own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that current molecular simulation packages produce sufficiently accurate property rankings for the diagnostic comparisons to be meaningful and that LLM reasoning in the chemistry domain is reliable enough to generate useful hypotheses.

axioms (2)

domain assumption Molecular simulation packages produce sufficiently accurate property rankings for diagnostic beam comparisons
The method treats simulation outputs as the ground truth for hypothesis testing.
domain assumption LLMs can generate chemically valid and non-trivial design hypotheses from natural-language reasoning
The framework depends on the agents producing usable constraints without task-specific fine-tuning.

pith-pipeline@v0.9.1-grok · 5774 in / 1418 out tokens · 36029 ms · 2026-06-30T07:58:58.169646+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 1 canonical work pages · 1 internal anchor

[1]

EGMOF: Efficient Generation of Metal-Organic Frameworks Using a Hybrid Diffusion-Transformer Architecture

1 Kim, B., Lee, S. & Kim, J. Inverse design of porous materials using artificial neural networks. Science advances 6, eaax9324 (2020). 2 Park, H., Li, Z. & Walsh, A. Has generative artificial intelligence solved inverse materials design? Matter 7, 2355–2367 (2024). 3 Sanchez-Lengeling, B. & Aspuru- Guzik, A. Inverse molecular design using machine learning...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[2]

The Journal of Physical Chemistry B 102, 2569–2577 (1998)

United- atom description of n-alkanes. The Journal of Physical Chemistry B 102, 2569–2577 (1998)

1998

[1] [1]

EGMOF: Efficient Generation of Metal-Organic Frameworks Using a Hybrid Diffusion-Transformer Architecture

1 Kim, B., Lee, S. & Kim, J. Inverse design of porous materials using artificial neural networks. Science advances 6, eaax9324 (2020). 2 Park, H., Li, Z. & Walsh, A. Has generative artificial intelligence solved inverse materials design? Matter 7, 2355–2367 (2024). 3 Sanchez-Lengeling, B. & Aspuru- Guzik, A. Inverse molecular design using machine learning...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[2] [2]

The Journal of Physical Chemistry B 102, 2569–2577 (1998)

United- atom description of n-alkanes. The Journal of Physical Chemistry B 102, 2569–2577 (1998)

1998