AtomComposer: Discovering Chemical Space from First Principles with Reinforcement Learning

Arghya Bhowmik; Bjarke Hastrup; Francois Cornet; Tejs Vegge

arxiv: 2605.28287 · v1 · pith:FDND5ZDXnew · submitted 2026-05-27 · 💻 cs.LG · cond-mat.mtrl-sci

AtomComposer: Discovering Chemical Space from First Principles with Reinforcement Learning

Bjarke Hastrup , Francois Cornet , Tejs Vegge , Arghya Bhowmik This is my paper

Pith reviewed 2026-06-29 14:29 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.mtrl-sci

keywords reinforcement learningmolecular isomerschemical space explorationonline learning3D molecular generationgeneralizationstoichiometric constraints

0 comments

The pith

A multi-composition reinforcement learning agent constructs valid 3D isomers from scratch and generalizes to find up to ten times more on unseen chemical formulas than single-composition baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AtomComposer as an autonomous agent that builds 3D molecular isomers under given stoichiometric constraints using only online reinforcement learning and no pretraining data. It trains the agent across many different chemical formulas at once rather than fixing it to one composition. This scheme produces substantially more valid isomers when the agent is later tested on formulas it has never encountered. If the result holds, molecular discovery can shift from models that require large curated datasets to agents that explore chemical space directly through interaction.

Core claim

AtomComposer is a self-guided reinforcement learning agent that autonomously assembles valid three-dimensional isomers while respecting stoichiometric constraints. It is trained exclusively online with energy- and validity-based rewards under a multi-composition scheme that exposes the agent to many formulas simultaneously. This yields up to an order of magnitude more valid isomers on unseen test formulas than existing single-composition reinforcement-learning baselines that use per-step energy rewards.

What carries the argument

The multi-composition training scheme, which trains the agent across diverse chemical formulas at once so that learned policies generalize instead of overfitting to any single stoichiometry.

If this is right

Molecular generation no longer requires large pre-curated datasets that introduce bias.
Exploration of chemical configuration space can proceed from scratch via online interaction.
A single trained agent can address many different stoichiometric targets without retraining.
The same online reinforcement learning loop scales to larger or more complex composition spaces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same multi-composition principle could be tested on discovering molecules with targeted properties beyond validity and energy.
If the agent learns reusable construction rules, it may transfer to related tasks such as crystal structure prediction under different constraints.
The approach suggests that constraint-satisfaction problems in other discrete configuration spaces could benefit from simultaneous training on varied instances.

Load-bearing premise

Training the agent on multiple compositions at once is sufficient to produce broad generalization that works on entirely new chemical formulas without overfitting.

What would settle it

Run the multi-composition agent on a held-out set of formulas never seen during training and measure whether the count of valid isomers remains within a factor of two of the single-composition baselines rather than reaching an order-of-magnitude improvement.

Figures

Figures reproduced from arXiv: 2605.28287 by Arghya Bhowmik, Bjarke Hastrup, Francois Cornet, Tejs Vegge.

**Figure 1.** Figure 1: The AtomComposer multi-composition training and evaluation workflow. AtomComposer constructs isomer generation tasks by extracting chemical formulas from a reference dataset and introduces new terminal rewards based on validity and total energy. We evaluate the RL agents’ isomer discovery capabilities at just a single checkpoint, as well as cumulatively across the entire discovery campaign. Crucially, no … view at source ↗

**Figure 2.** Figure 2: illustrates our training and evaluation scheme. Through linear combinations of the 3 fundamental reward components (A, V, F) introduced in Box 2.1(a+b), we define 5 distinct reward functions A, AV, F, FV, and AFV, each corresponding to a separate agent that is trained independently three times using different random seeds (the linear coefficients are shown in [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Learning Curves (in-sample). (a) Validity and (b) unrelaxed Relative Atomic Energy (RAE) of continuously collected training rollouts, plotted against total number of single-atom placements (environment steps). The RAE metric quantifies the excess energy relative to the average energies of QM7 molecules with the same chemical formula (see page 26 for detailed metric definitions). Shading represents ±1 stand… view at source ↗

**Figure 4.** Figure 4: Q1 visualizations. Histograms of formation energy per atom (center column) together with top 3 best scoring molecules (lowest energy) after structural relaxation for Agent A on the left and Agent AV on the right. Agent A samples molecules of significantly lower energies. In [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Q2: Out-of-sample agent comparison. We report discovery metrics (top row) and geometry metrics (bottom row) in the multi-bag evaluation setting outlined in Fig. 2b (see page 26 for detailed metric definitions). Each point reflects a weighted average across 20 test bags. Error bars denote standard deviation across three random seeds. Results show that agent A consistently outperforms on 3D metrics, while AV… view at source ↗

**Figure 6.** Figure 6: Q3. Cumulative discovery campaign. (a) Number of novel SMILES discovered during training. (b) Number of QM7 SMILES rediscovered. (c)-(d) Total expansion and rediscovery relative to the size of QM7. Although the agents are able to discover many novel molecules and expand on the QM7 dataset by several multiples, their rediscovery ratios are remarkably consistently capped around 40%, thus indicating a subclas… view at source ↗

**Figure 7.** Figure 7: Functional group analysis. Normalized frequency difference of the 50 most common functional groups in the QM7 dataset between rediscovered molecules and the full QM7 reference set. To avoid rare functional groups dominating the extremes of the distribution, we report the normalized frequency difference ∆˜ i f = (f i RL − f i QM7)/ q f i QM7 where f i D denotes the fraction of molecules in dataset D that co… view at source ↗

**Figure 8.** Figure 8: Rediscovery energy distributions. The figure shows the formation energy distribution of all QM7 training molecules (grey), together with the energy distribution of rediscovered molecules for the two agents A and AV, with mean values shown vertically. Despite rediscovering less than 50%, the RL rediscovered energies are actually better (more negative) than the QM7 average [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗

**Figure 9.** Figure 9: Q4. Property-directed finetuning. (a) Validity and (b) dipole moment magnitude of training rollouts during finetuning, plotted against environment steps. A reward coefficient schedule gradually introduces the dipole moment reward alongside the existing atomization energy term. The red cross marks the evaluation checkpoint selected prior to significant validity collapse. (c) Relaxed dipole moment distributi… view at source ↗

**Figure 10.** Figure 10: Q4. Complete finetuning evaluation results. Each row corresponds to one of the 10 evaluated carbonate and ether formulas, with the two formulas highlighted in the main text (H4C3O3 and H6C4O3) shown the first and third row. For each formula, three columns are shown: (left) unrelaxed dipole moment distributions of all sampled molecules, (center) relaxed dipole moment distributions of all valid molecules, a… view at source ↗

read the original abstract

Discovering novel stable molecules without training data remains a grand scientific challenge. Current molecular generative models are trained on large, pre-curated datasets, which introduce biases and limit exploration of novel chemistry. In contrast, we propose a new paradigm: autonomous, generalized agents capable of mapping vast, unknown chemical spaces without any pretraining. For the first time, we present AtomComposer, a self-guided agent that autonomously constructs valid 3D isomers under stoichiometric constraints and is trained exclusively online using reinforcement learning. Unlike existing approaches that generally overfit to a specific chemical formula, we establish a multi-composition training scheme that enables a broad generalization across diverse chemistry, guided by energy- and validity-based rewards. Our agent can discover up to an order of magnitude more valid isomers on unseen test formulas than existing single-composition reinforcement-learning baselines trained with per-step energy rewards. These results fulfill the promise of online reinforcement learning as a powerful paradigm for scalable, from-scratch exploration of chemical configuration space.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AtomComposer claims order-of-magnitude gains on unseen formulas via multi-composition RL from scratch, but the generalization evidence needs the full experimental splits and results to be convincing.

read the letter

The main takeaway is that AtomComposer uses reinforcement learning to build 3D molecular isomers from scratch with no pretraining data, relying on a multi-composition training scheme to generalize to new formulas, and it claims much higher success rates than single-composition baselines.

The approach is new in applying online RL with energy and validity rewards across varied compositions rather than fixing one formula per agent. This setup aims to let the agent learn general rules for valid structures instead of overfitting to specific atom counts. The paper does a good job explaining why data-driven models have biases and how this from-scratch method could avoid them.

The soft spots are around the generalization evidence. The big performance claim on unseen formulas requires that the training distribution truly covers diverse chemistry without the test cases being too close to training ones. The abstract does not spell out the sampling procedure for compositions or the exact definition of unseen, so it is possible the gains come from better training dynamics rather than broad chemical understanding. If the full paper has detailed OOD splits and ablations showing the multi-composition is key, that would strengthen it. Minor issues include needing more on how the 3D construction is handled in the RL actions.

This paper is aimed at researchers in machine learning for chemistry who want to move beyond dataset-dependent generative models. Someone working on exploration of chemical space would get value from seeing how the RL agent is set up and whether the results replicate.

It deserves serious refereeing because the problem is important and the method is distinct, though the evidence needs to be examined closely.

Recommendation: Send to peer review.

Referee Report

2 major / 2 minor

Summary. The paper introduces AtomComposer, a reinforcement learning agent trained from scratch to autonomously construct valid 3D molecular isomers under stoichiometric constraints. It proposes a multi-composition training scheme using energy- and validity-based rewards that enables generalization across diverse chemistry, claiming the agent discovers up to an order of magnitude more valid isomers on unseen test formulas than single-composition RL baselines trained with per-step energy rewards.

Significance. If the generalization results hold under rigorous OOD evaluation, the work would demonstrate a viable path for dataset-free exploration of chemical configuration space via online RL, addressing biases in pre-curated training data and potentially enabling broader chemical discovery.

major comments (2)

[Abstract, §3] Abstract and §3 (Methods): The headline claim of order-of-magnitude gains on 'unseen test formulas' is load-bearing for the central generalization thesis, yet the manuscript provides no explicit description of the multi-composition sampling procedure, the element sets or stoichiometry ranges used in training episodes versus test, or the precise operationalization of 'unseen' (e.g., novel element combinations, larger atom counts, or merely held-out stoichiometries). Without these details, it is impossible to distinguish genuine OOD generalization from reduced overfitting within statistically similar chemical spaces.
[§4] §4 (Experiments): The comparison to 'existing single-composition reinforcement-learning baselines' requires a clear statement of whether those baselines were also evaluated under the same multi-composition regime or retrained per formula; if the latter, the performance gap may be attributable to training protocol differences rather than the multi-composition scheme itself.

minor comments (2)

[§3] Notation for the validity and energy reward functions should be introduced with explicit equations rather than prose descriptions to allow reproducibility.
[Figures] Figure captions should include the exact number of independent runs and error bars used to generate the reported performance statistics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us improve the clarity of our manuscript regarding the multi-composition training and baseline comparisons. We address each major comment below.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (Methods): The headline claim of order-of-magnitude gains on 'unseen test formulas' is load-bearing for the central generalization thesis, yet the manuscript provides no explicit description of the multi-composition sampling procedure, the element sets or stoichiometry ranges used in training episodes versus test, or the precise operationalization of 'unseen' (e.g., novel element combinations, larger atom counts, or merely held-out stoichiometries). Without these details, it is impossible to distinguish genuine OOD generalization from reduced overfitting within statistically similar chemical spaces.

Authors: We agree that these details are essential and were insufficiently described in the original submission. We have revised §3 to explicitly detail the multi-composition sampling procedure (including how compositions are sampled per episode), the element sets (C, H, O, N and extensions) and stoichiometry ranges used in training versus test, and the definition of 'unseen' as novel element combinations and larger atom counts outside the training distribution. This revision clarifies the OOD evaluation. revision: yes
Referee: [§4] §4 (Experiments): The comparison to 'existing single-composition reinforcement-learning baselines' requires a clear statement of whether those baselines were also evaluated under the same multi-composition regime or retrained per formula; if the latter, the performance gap may be attributable to training protocol differences rather than the multi-composition scheme itself.

Authors: We agree this distinction must be stated clearly. The baselines were retrained per formula, consistent with their single-composition design. Our method's advantage stems from training one agent across multiple compositions. We have revised §4 to explicitly describe the protocols for our agent and the baselines, and added discussion noting that the gap reflects the multi-composition scheme rather than protocol alone. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical RL performance claim is self-contained

full rationale

The paper reports an empirical result: an RL agent trained online with external energy/validity rewards discovers more valid isomers on held-out formulas than single-composition baselines. No derivation chain, equations, or self-citations are presented that reduce the performance claim to fitted parameters or prior author work by construction. The multi-composition scheme is an experimental training protocol whose generalization is tested directly against baselines; it does not contain self-definitional, fitted-input, or uniqueness-imported steps. The result is therefore not circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review is abstract-only; no implementation details available to enumerate free parameters, axioms, or invented entities. The approach implicitly relies on standard RL assumptions and molecular validity definitions.

axioms (1)

domain assumption Reinforcement learning with energy and validity rewards can guide construction of stable 3D molecular isomers.
Central to the online training scheme described in the abstract.

pith-pipeline@v0.9.1-grok · 5711 in / 1191 out tokens · 40722 ms · 2026-06-29T14:29:11.965653+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 4 internal anchors

[1]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Ameya Daigavane, Song Kim, Mario Geiger, and Tess Smidt

URL https://openreview.net/ forum?id=Snqhqz4LdK. Ameya Daigavane, Song Kim, Mario Geiger, and Tess Smidt. Symphony: Symmetry-equivariant point-centered spherical harmonics for molecule generation.arXiv preprint arXiv:2311.16199,

work page arXiv
[3]

Molminer: Transformer architecture for fragment- based autoregressive generation of molecular stories.arXiv preprint arXiv:2411.06608,

Raul Ortega Ochoa, Tejs Vegge, and Jes Frellsen. Molminer: Transformer architecture for fragment- based autoregressive generation of molecular stories.arXiv preprint arXiv:2411.06608,

work page arXiv
[4]

Scalable fragment-based 3d molecular design with reinforcement learning.arXiv preprint arXiv:2202.00658,

Daniel Flam-Shepherd, Alexander Zhigalin, and Alán Aspuru-Guzik. Scalable fragment-based 3d molecular design with reinforcement learning.arXiv preprint arXiv:2202.00658,

work page arXiv
[5]

An empirical investigation of the challenges of real-world reinforcement learning

Gabriel Dulac-Arnold, Nir Levine, Daniel J Mankowitz, Jerry Li, Cosmin Paduraru, Sven Gowal, and Todd Hester. An empirical investigation of the challenges of real-world reinforcement learning. arXiv preprint arXiv:2003.11881,

work page arXiv 2003
[6]

Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control

Riashat Islam, Peter Henderson, Maziar Gomrokchi, and Doina Precup. Reproducibility of bench- marked deep reinforcement learning tasks for continuous control.arXiv preprint arXiv:1708.04133,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

A systematic survey of chemical pre-trained models.arXiv preprint arXiv:2210.16484,

22 Jun Xia, Yanqiao Zhu, Yuanqi Du, and Stan Z Li. A systematic survey of chemical pre-trained models.arXiv preprint arXiv:2210.16484,

work page arXiv
[8]

A hitchhiker’s guide to statistical comparisons of reinforcement learning algorithms.arXiv preprint arXiv:1904.06979,

Cédric Colas, Olivier Sigaud, and Pierre-Yves Oudeyer. A hitchhiker’s guide to statistical comparisons of reinforcement learning algorithms.arXiv preprint arXiv:1904.06979,

work page arXiv 1904
[9]

Mars: Markov molecular sampling for multi-objective drug discovery.arXiv preprint arXiv:2103.10432,

Yutong Xie, Chence Shi, Hao Zhou, Yuwei Yang, Weinan Zhang, Yong Yu, and Lei Li. Mars: Markov molecular sampling for multi-objective drug discovery.arXiv preprint arXiv:2103.10432,

work page arXiv
[10]

URL https://doi.org/10.1021/ci300415d

doi: 10.1021/ci300415d. URL https://doi.org/10.1021/ci300415d. PMID: 23088335. Raghunathan Ramakrishnan, Pavlo O Dral, Matthias Rupp, and O Anatole von Lilienfeld. Quantum chemistry structures and properties of 134 kilo molecules.Scientific Data, 1,

work page doi:10.1021/ci300415d
[11]

doi: 10.26434/chemrxiv-2022-v5p6m-v3. Conor F Hayes, Roxana R ˘adulescu, Eugenio Bargiacchi, Johan Källström, Matthew Macfarlane, Mathieu Reymond, Timothy Verstraeten, Luisa M Zintgraf, Richard Dazeley, Fredrik Heintz, et al. A practical guide to multi-objective reinforcement learning and planning: Cf hayes et al. Autonomous Agents and Multi-Agent Systems...

work page doi:10.26434/chemrxiv-2022-v5p6m-v3 2022
[12]

Safe RLHF: Safe Reinforcement Learning from Human Feedback

Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe rlhf: Safe reinforcement learning from human feedback.arXiv preprint arXiv:2310.12773,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438,

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Ameya Daigavane, Song Kim, Mario Geiger, and Tess Smidt

URL https://openreview.net/ forum?id=Snqhqz4LdK. Ameya Daigavane, Song Kim, Mario Geiger, and Tess Smidt. Symphony: Symmetry-equivariant point-centered spherical harmonics for molecule generation.arXiv preprint arXiv:2311.16199,

work page arXiv

[3] [3]

Molminer: Transformer architecture for fragment- based autoregressive generation of molecular stories.arXiv preprint arXiv:2411.06608,

Raul Ortega Ochoa, Tejs Vegge, and Jes Frellsen. Molminer: Transformer architecture for fragment- based autoregressive generation of molecular stories.arXiv preprint arXiv:2411.06608,

work page arXiv

[4] [4]

Scalable fragment-based 3d molecular design with reinforcement learning.arXiv preprint arXiv:2202.00658,

Daniel Flam-Shepherd, Alexander Zhigalin, and Alán Aspuru-Guzik. Scalable fragment-based 3d molecular design with reinforcement learning.arXiv preprint arXiv:2202.00658,

work page arXiv

[5] [5]

An empirical investigation of the challenges of real-world reinforcement learning

Gabriel Dulac-Arnold, Nir Levine, Daniel J Mankowitz, Jerry Li, Cosmin Paduraru, Sven Gowal, and Todd Hester. An empirical investigation of the challenges of real-world reinforcement learning. arXiv preprint arXiv:2003.11881,

work page arXiv 2003

[6] [6]

Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control

Riashat Islam, Peter Henderson, Maziar Gomrokchi, and Doina Precup. Reproducibility of bench- marked deep reinforcement learning tasks for continuous control.arXiv preprint arXiv:1708.04133,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

A systematic survey of chemical pre-trained models.arXiv preprint arXiv:2210.16484,

22 Jun Xia, Yanqiao Zhu, Yuanqi Du, and Stan Z Li. A systematic survey of chemical pre-trained models.arXiv preprint arXiv:2210.16484,

work page arXiv

[8] [8]

A hitchhiker’s guide to statistical comparisons of reinforcement learning algorithms.arXiv preprint arXiv:1904.06979,

Cédric Colas, Olivier Sigaud, and Pierre-Yves Oudeyer. A hitchhiker’s guide to statistical comparisons of reinforcement learning algorithms.arXiv preprint arXiv:1904.06979,

work page arXiv 1904

[9] [9]

Mars: Markov molecular sampling for multi-objective drug discovery.arXiv preprint arXiv:2103.10432,

Yutong Xie, Chence Shi, Hao Zhou, Yuwei Yang, Weinan Zhang, Yong Yu, and Lei Li. Mars: Markov molecular sampling for multi-objective drug discovery.arXiv preprint arXiv:2103.10432,

work page arXiv

[10] [10]

URL https://doi.org/10.1021/ci300415d

doi: 10.1021/ci300415d. URL https://doi.org/10.1021/ci300415d. PMID: 23088335. Raghunathan Ramakrishnan, Pavlo O Dral, Matthias Rupp, and O Anatole von Lilienfeld. Quantum chemistry structures and properties of 134 kilo molecules.Scientific Data, 1,

work page doi:10.1021/ci300415d

[11] [11]

doi: 10.26434/chemrxiv-2022-v5p6m-v3. Conor F Hayes, Roxana R ˘adulescu, Eugenio Bargiacchi, Johan Källström, Matthew Macfarlane, Mathieu Reymond, Timothy Verstraeten, Luisa M Zintgraf, Richard Dazeley, Fredrik Heintz, et al. A practical guide to multi-objective reinforcement learning and planning: Cf hayes et al. Autonomous Agents and Multi-Agent Systems...

work page doi:10.26434/chemrxiv-2022-v5p6m-v3 2022

[12] [12]

Safe RLHF: Safe Reinforcement Learning from Human Feedback

Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe rlhf: Safe reinforcement learning from human feedback.arXiv preprint arXiv:2310.12773,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438,

work page internal anchor Pith review Pith/arXiv arXiv