Reinforcement Learning with LLM-Guided Action Spaces for Synthesizable Lead Optimization

Carl Yang; Kaiyuan Hou; Monika Raj; Tao Li; Tuan Vinh; Zhichun Guo

arxiv: 2604.07669 · v2 · submitted 2026-04-09 · 💻 cs.LG · cs.AI· cs.CE

Reinforcement Learning with LLM-Guided Action Spaces for Synthesizable Lead Optimization

Tao Li , Kaiyuan Hou , Tuan Vinh , Monika Raj , Zhichun Guo , Carl Yang This is my paper

Pith reviewed 2026-05-10 18:22 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CE

keywords optimizationreactionleadmolreacttaskstemplatesacrossaction

0 comments

The pith

MolReAct uses an LLM agent to define only chemically valid reaction steps as the action space for reinforcement learning in molecular lead optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets up lead optimization as a sequence of molecular changes that must each correspond to a real, template-backed chemical reaction rather than arbitrary edits. An LLM equipped with chemistry analysis tools examines the current molecule, identifies reactive sites and functional groups, and outputs a small list of feasible next transformations drawn from matched reaction templates. A separate policy model trained with Group Relative Policy Optimization then chooses among those constrained options to maximize long-term property rewards across multiple steps. The result is molecules that score higher on standard optimization benchmarks than prior methods while each carrying an explicit synthetic route.

Core claim

MolReAct formulates lead optimization as a Markov Decision Process whose action space is generated on the fly by a tool-augmented LLM agent that invokes chemical analysis tools to locate reactive sites and then proposes a compact set of chemically grounded transformations from validated reaction templates; a policy trained via Group Relative Policy Optimization selects actions to maximize cumulative oracle reward, and a SMILES caching layer speeds up repeated evaluations.

What carries the argument

The tool-augmented LLM agent that acts as the dynamic reaction environment by matching the current molecule against reaction templates and emitting only a small set of valid transformations to serve as the constrained action space for the reinforcement learning policy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the reaction templates and tool calls remain reliable on novel molecular scaffolds, the same trained policy could be reused across additional property objectives without retraining.
The explicit template grounding opens the possibility of feeding the proposed synthetic steps directly into automated synthesis planners or experimental validation loops.
Because the action space shrinks dramatically at each step, longer optimization trajectories become computationally tractable compared with fully generative approaches.
The caching of SMILES evaluations suggests that performance gains could compound when the same intermediates appear across multiple independent optimization runs.

Load-bearing premise

The LLM agent must correctly identify all relevant reactive sites and functional groups and then propose a complete, valid collection of transformations from the templates without missing productive reactions or suggesting invalid ones.

What would settle it

Running the system on a new set of molecules where the LLM either proposes a chemically invalid transformation or omits a known productive reaction route, producing final molecules whose property scores fall below those obtained by an unconstrained generative baseline.

Figures

Figures reproduced from arXiv: 2604.07669 by Carl Yang, Kaiyuan Hou, Monika Raj, Tao Li, Tuan Vinh, Zhichun Guo.

**Figure 2.** Figure 2: Ablation on tool-guided proposal and policy optimization across target activity tasks. is standardized within the group by subtracting the group mean and dividing by the group standard deviation to obtain a group-relative advantage. This advantage is then assigned to every step in the corresponding trajectory, enabling trajectory-level credit assignment. The policy is then updated by maximizing the standar… view at source ↗

**Figure 3.** Figure 3: Building block analysis. To evaluate whether the building blocks proposed by MolReAct can be readily obtained from commercial suppliers, we perform a post-hoc availability analysis on the four protein-target activity tasks. Using the Enamine building block catalog (∼2.1M compounds) as a reference [51], we apply an exact-match filter during evaluation that retains a proposed reaction when all of its non-i… view at source ↗

**Figure 4.** Figure 4: Representative synthetic pathways discovered by MolReAct on four protein-target activity [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Top-10, Top-30, and Top-50 scores vs. oracle calls on four protein-target activity tasks. [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of valid reactions proposed per query during training on the sEH task. The [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

read the original abstract

Lead optimization in drug discovery requires improving therapeutic properties while ensuring that molecular modifications correspond to feasible synthetic routes. Existing approaches either prioritize property scores without enforcing synthesizability, or rely on expensive enumeration over large reaction networks, while direct application of Large Language Models (LLMs) to molecular generation frequently produces chemically invalid structures. We introduce MolReAct, a framework that formulates lead optimization as a Markov Decision Process over a synthesis-constrained action space defined by validated reaction templates. A tool-augmented LLM agent serves as a dynamic reaction environment, invoking specialized chemical analysis tools to identify reactive sites and functional groups and proposing a compact set of chemically grounded transformations from matched templates. A dedicated policy model trained via Group Relative Policy Optimization (GRPO) selects among these constrained actions to maximize long-term oracle reward across multi-step trajectories, with a SMILES-based caching mechanism reducing end-to-end optimization time by approximately 43%. Across 13 property optimization tasks from the Therapeutic Data Commons and one structure-based docking task, MolReAct achieves an average Top-10 score of 0.571, the highest among all baselines, ranking first or second on 13 of 14 tasks and attaining the best sample efficiency on 9 of 14 tasks. By grounding every optimization step in validated reaction templates, MolReAct produces molecules that are not only property-improved but each accompanied by an explicit template-grounded synthetic pathway.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MolReAct uses an LLM to dynamically propose reaction templates inside an RL MDP for molecular optimization and beats baselines on most tasks, but the reported edge may stem from an unverified incomplete action space rather than stronger planning.

read the letter

The main point on this one is that the authors have a new way to keep molecular RL grounded in real chemistry by using an LLM to propose only valid reaction steps, and it performs well on the tests, though the advantage might not be as robust as it looks. What they do well is set up the problem as an MDP where actions come from LLM-matched templates after tool calls for site identification. Training the policy with GRPO lets it plan over multiple steps for better long-term properties. The caching of SMILES strings cuts runtime nicely. Their results show it leading the pack on average top-10 across those 14 tasks, with top efficiency in most. The soft spot is the lack of any check on how much the LLM actually covers. The stress test is right that if the proposals miss some valid reactions, the policy optimizes over fewer options than a full search would, which could make the numbers look better without proving better optimization. No recall or consistency numbers are mentioned in the abstract, and since we only have that, it's a real gap for the synthesizability claim. This is for researchers in computational drug design or RL applications to chemistry. Someone building similar systems could borrow the LLM-tool plus policy idea. It has enough of a concrete method and results to go to a serious referee, who could push for the missing validation experiments. I'd say send it for peer review but flag the action space completeness as something to verify.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MolReAct, a framework that formulates lead optimization as an MDP over a synthesis-constrained action space. A tool-augmented LLM agent uses chemical analysis tools to identify reactive sites and functional groups, then proposes transformations from matched reaction templates. A policy trained with Group Relative Policy Optimization (GRPO) selects actions to maximize long-term oracle reward, with SMILES caching for efficiency. On 13 Therapeutic Data Commons property optimization tasks plus one docking task, it reports the highest average Top-10 score of 0.571, ranking first or second on 13 of 14 tasks and best sample efficiency on 9 of 14, while guaranteeing each output molecule has an explicit template-grounded synthetic pathway.

Significance. If the LLM agent reliably produces complete and valid action spaces, the approach could meaningfully advance practical synthesizable molecular optimization by combining LLM chemical reasoning with RL long-horizon planning, offering better sample efficiency than exhaustive enumeration while avoiding the invalid structures common in unconstrained LLM generation.

major comments (2)

[Abstract] Abstract: The central empirical claims (average Top-10 score of 0.571, first/second ranking on 13/14 tasks, best sample efficiency on 9/14 tasks) rest on the action space being defined entirely by the tool-augmented LLM's template proposals, yet no quantitative coverage metric (recall of all template-applicable reactions, false-negative rate on reactive sites, or inter-run consistency) is supplied; this is load-bearing because an incomplete action space would make performance gains potentially attributable to reduced branching factor rather than superior planning via GRPO, undermining both the synthesizability guarantee and the efficiency interpretation.
[Abstract] Abstract and Results: The ranking and efficiency superiority claims require explicit details on baseline implementations, statistical testing procedures, controls for data leakage, and how reaction template coverage was verified; without these, the reported outperformance cannot be fully verified as robust.

minor comments (2)

The 43% time reduction from the SMILES-based caching mechanism should be accompanied by per-task timing tables and direct comparisons to baseline runtimes for clarity.
All acronyms (GRPO, TDC, MDP) should be expanded on first use in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and will incorporate clarifications and additional analyses in a revised version to strengthen the empirical support for our claims.

read point-by-point responses

Referee: [Abstract] Abstract: The central empirical claims (average Top-10 score of 0.571, first/second ranking on 13/14 tasks, best sample efficiency on 9/14 tasks) rest on the action space being defined entirely by the tool-augmented LLM's template proposals, yet no quantitative coverage metric (recall of all template-applicable reactions, false-negative rate on reactive sites, or inter-run consistency) is supplied; this is load-bearing because an incomplete action space would make performance gains potentially attributable to reduced branching factor rather than superior planning via GRPO, undermining both the synthesizability guarantee and the efficiency interpretation.

Authors: We appreciate the referee pointing out the need for quantitative coverage metrics. The synthesizability guarantee applies to each output molecule, as every action is drawn from a validated reaction template proposed by the LLM agent, providing an explicit template-grounded pathway. We agree, however, that metrics on coverage would help rule out reduced branching factor as the sole driver of gains. In revision we will add a dedicated analysis: on a random subset of 100 starting molecules per task, we will exhaustively enumerate all template-applicable reactions using RDKit and compare against the LLM agent's proposals to compute recall and false-negative rates on reactive sites. We will also report inter-run consistency by executing the agent five times on the same inputs and measuring overlap in proposed actions. These results will be presented alongside the main experiments to support that performance differences reflect GRPO planning rather than action-space size alone. revision: yes
Referee: [Abstract] Abstract and Results: The ranking and efficiency superiority claims require explicit details on baseline implementations, statistical testing procedures, controls for data leakage, and how reaction template coverage was verified; without these, the reported outperformance cannot be fully verified as robust.

Authors: We agree that greater transparency on these implementation and verification details is required. In the revised manuscript we will expand the Methods and Experimental Setup sections with: (i) full specifications of each baseline (including code repositories used, any modifications to original implementations, and hyperparameter choices); (ii) statistical procedures (multiple independent runs with reported means, standard deviations, and paired Wilcoxon signed-rank tests with p-values for ranking comparisons); (iii) explicit statement that the 13 TDC tasks use publicly released benchmark splits with no overlap to any pre-training data for the policy network or the LLM; and (iv) our template-coverage verification protocol, which combined automated matching against the USPTO-derived template library with manual review of 200 randomly sampled LLM-proposed reactions by two co-authors with chemistry backgrounds. These additions will enable independent verification of the reported Top-10 scores, rankings, and sample-efficiency results. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation is self-contained against external benchmarks

full rationale

The paper defines an MDP whose action space is constructed by an LLM tool-augmented agent matching reaction templates, then trains a policy via GRPO to maximize oracle rewards on Therapeutic Data Commons tasks and a docking task. All reported metrics (Top-10 scores, sample efficiency) are computed on held-out external oracles and datasets; no equation or result is obtained by fitting a parameter to a subset and relabeling it as a prediction, no self-citation supplies a load-bearing uniqueness theorem, and no ansatz is smuggled in. The central performance claims therefore rest on independent empirical evaluation rather than any definitional or self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim depends on the coverage and validity of external reaction templates plus reliable LLM tool behavior for action proposal; these are domain assumptions not derived within the paper.

axioms (2)

domain assumption Molecules can be faithfully represented and modified via SMILES strings and a fixed library of validated reaction templates.
Invoked to define the synthesis-constrained action space in the MDP formulation.
ad hoc to paper The tool-augmented LLM can accurately detect reactive sites and functional groups to propose only valid transformations.
Required for the dynamic reaction environment to generate the compact action set at each step.

invented entities (1)

MolReAct framework no independent evidence
purpose: Integrates LLM-guided action proposal with GRPO policy optimization for synthesizable molecular trajectories.
New composite method introduced by the paper; no independent evidence provided beyond the reported benchmarks.

pith-pipeline@v0.9.0 · 5567 in / 1607 out tokens · 98152 ms · 2026-05-10T18:22:22.395077+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

formulates lead optimization as a Markov Decision Process over a synthesis-constrained action space defined by validated reaction templates... GRPO selects among these constrained actions to maximize long-term oracle reward
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

tool-augmented LLM agent... proposes a compact set of chemically grounded transformations

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 1 internal anchor

[1]

Hevener, Russell Pesavento, JinHong Ren, Hyun Lee, Kiira Ratia, and Michael E

Kirk E. Hevener, Russell Pesavento, JinHong Ren, Hyun Lee, Kiira Ratia, and Michael E. Johnson. Chapter twelve - hit-to-lead: Hit validation and assessment. InModern Approaches in Drug Discovery, volume 610, pages 265–309. Academic Press, 2018

work page 2018
[2]

Christian Baber, Eric Feyfant, David C

Diane Joseph-McCarthy, J. Christian Baber, Eric Feyfant, David C. Thompson, and Christine Humblet. Lead optimization via high-throughput molecular docking.Current Opinion in Drug Discovery & Development, 2007

work page 2007
[3]

Keserü and Gergely M

György M. Keserü and Gergely M. Makara. The influence of lead discovery strategies on the properties of drug candidates.Nature Reviews Drug Discovery, 2009

work page 2009
[4]

Deep lead optimization: Leveraging generative ai for structural modification

Odin Zhang, Haitao Lin, Hui Zhang, Huifeng Zhao, Yufei Huang, Chang-Yu Hsieh, Peichen Pan, and Tingjun Hou. Deep lead optimization: Leveraging generative ai for structural modification. Journal of the American Chemical Society, 146(46):31357–31370, 2024

work page 2024
[5]

Papidocha, Andreas Burger, Varinia Bernales, and Alán Aspuru-Guzik

Sven M. Papidocha, Andreas Burger, Varinia Bernales, and Alán Aspuru-Guzik. The elephant in the lab: synthesizability in generative small-molecule design.Current Opinion in Chemical Engineering, 51:101217, 2026. ISSN 2211-3398

work page 2026
[6]

Searching for high-value molecules using reinforcement learning and transformers

Raj Ghugare, Santiago Miret, Adriana Hugessen, Mariano Phielipp, and Glen Berseth. Searching for high-value molecules using reinforcement learning and transformers. InProceedings of the International Conference on Learning Representations, 2024

work page 2024
[7]

MoleditRL: Structure-preserving molecular editing via discrete diffusion and reinforcement learning

Yuanxin Zhuang, Dazhong Shen, and Ying Sun. MoleditRL: Structure-preserving molecular editing via discrete diffusion and reinforcement learning. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[8]

De novo drug design using reinforce- ment learning with multiple gpt agents

Xiuyuan Hu, Guoqing Liu, Yang Zhao, and Hao Zhang. De novo drug design using reinforce- ment learning with multiple gpt agents. InProceedings of the 37th International Conference on Neural Information Processing Systems, 2023

work page 2023
[9]

Jinyeong Park, Jaegyoon Ahn, Jonghwan Choi, and Jibum Kim. Mol-air: Molecular reinforce- ment learning with adaptive intrinsic rewards for goal-directed molecular generation.Journal of Chemical Information and Modeling, 65(5):2283–2296, 2025

work page 2025
[10]

Pepthink-r1: LLM for interpretable cyclic peptide optimization with cot SFT and reinforcement learning

Ruheng Wang, Hang Zhang, Trieu Nguyen, Shasha Feng, Hao-Wei Pang, Xiang Yu, Li Xiao, and Peter Zhiping Zhang. Pepthink-r1: LLM for interpretable cyclic peptide optimization with cot SFT and reinforcement learning. InNeurIPS 2025 AI for Science Workshop, 2025

work page 2025
[11]

Jan H. Jensen. A graph-based genetic algorithm and generative model/monte carlo tree search for the exploration of chemical space.Chemical Science, 10(12):3567–3572, 2019

work page 2019
[12]

Efficient evolutionary search over chemical space with large language models

Haorui Wang, Marta Skreta, Cher-Tian Ser, Wenhao Gao, Lingkai Kong, Felix Strieth-Kalthoff, Chenru Duan, Yuchen Zhuang, Yue Yu, Yanqiao Zhu, Yuanqi Du, Alán Aspuru-Guzik, Kirill Neklyudov, and Chao Zhang. Efficient evolutionary search over chemical space with large language models. InProceedings of the International Conference on Learning Representations, 2025

work page 2025
[13]

GeLLM³O: Generalizing large language models for multi-property molecule optimization

Vishal Dey, Xiao Hu, and Xia Ning. GeLLM³O: Generalizing large language models for multi-property molecule optimization. InProceedings of the 63rd Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2025

work page 2025
[14]

Drugassist: a large language model for molecule optimization.Briefings in Bioinformatics, 26(1):bbae693, 01 2025

Geyan Ye, Xibao Cai, Houtim Lai, Xing Wang, Junhong Huang, Longyue Wang, Wei Liu, and Xiangxiang Zeng. Drugassist: a large language model for molecule optimization.Briefings in Bioinformatics, 26(1):bbae693, 01 2025

work page 2025
[15]

Ldmol: A text-to-molecule diffusion model with structurally informative latent space surpasses ar models.International Conference on Machine Learning, 2025

Jinho Chang and Jong Chul Ye. Ldmol: A text-to-molecule diffusion model with structurally informative latent space surpasses ar models.International Conference on Machine Learning, 2025. 10

work page 2025
[16]

Exploring synthesizable chemical space with iterative pathway refinements

Seul Lee, Karsten Kreis, Srimukh Prasad Veccham, Meng Liu, Danny Reidenbach, Saee Gopal Paliwal, Weili Nie, and Arash Vahdat. Exploring synthesizable chemical space with iterative pathway refinements. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[17]

Catacutan, Autumn Arnold, James Zou, and Jonathan M

Kyle Swanson, Gary Liu, Denise B. Catacutan, Autumn Arnold, James Zou, and Jonathan M. Stokes. Generative ai for designing and validating easily synthesizable and structurally novel antibiotics.Nature Machine Intelligence, 6:338–353, 2024

work page 2024
[18]

Molecular optimization using a conditional transformer for reaction-aware compound exploration with reinforcement learning

Shogo Nakamura, Nobuaki Yasuo, and Masakazu Sekijima. Molecular optimization using a conditional transformer for reaction-aware compound exploration with reinforcement learning. Communications Chemistry, 8(40), 2025

work page 2025
[19]

Burke, and Heng Ji

Carl Edwards, Chi Han, Gawon Lee, Thao Nguyen, Sara Szymku ´c, Chetan Kumar Prasad, Bowen Jin, Jiawei Han, Ying Diao, Ge Liu, Hao Peng, Bartosz Andrzej Grzybowski, Martin D. Burke, and Heng Ji. mCLM: A modular chemical language model that generates functional and makeable molecules. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[20]

Anderson, and Henry van den Bedem

Aryan Pedawi, Pawet Gniewek, Chaoyi Chang, Brandon M. Anderson, and Henry van den Bedem. An efficient graph generative model for navigating ultra-large combinatorial synthesis libraries. InProceedings of the 36th International Conference on Neural Information Processing Systems. Curran Associates Inc., 2022

work page 2022
[21]

Chembo: Bayesian optimization of small organic molecules with synthesizable recommendations

Ksenia Korovina, Sailun Xu, Kirthevasan Kandasamy, Willie Neiswanger, Barnabas Poczos, Jeff Schneider, and Eric Xing. Chembo: Bayesian optimization of small organic molecules with synthesizable recommendations. InProceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), pages 3393–3403. PMLR, 2020

work page 2020
[22]

Sample-efficient multi-objective molecular optimization with gflownets

Yiheng Zhu, Jialu Wu, Chaowen Hu, Jiahuan Yan, Chang-Yu Hsieh, Tingjun Hou, and Jian Wu. Sample-efficient multi-objective molecular optimization with gflownets. InProceedings of the 37th International Conference on Neural Information Processing Systems, 2023

work page 2023
[23]

Michał Koziarski, Andrei Rekesh, Dmytro Shevchuk, Almer van der Sloot, Piotr Gai ´nski, Yoshua Bengio, Cheng-Hao Liu, Mike Tyers, and Robert A. Batey. Rgfn: synthesizable molecular generation using gflownets. InProceedings of the 38th International Conference on Neural Information Processing Systems. Curran Associates Inc., 2024

work page 2024
[24]

Synflownet: Design of diverse and novel molecules with synthesis constraints

Miruna Cretu, Charles Harris, Ilia Igashov, Arne Schneuing, Marwin Segler, Bruno Correia, Julien Roy, Emmanuel Bengio, and Pietro Lio. Synflownet: Design of diverse and novel molecules with synthesis constraints. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[25]

Generative flows on synthetic pathway for drug design

Seonghwan Seo, Minsu Kim, Tony Shen, Martin Ester, Jinkyoo Park, Sungsoo Ahn, and Woo Youn Kim. Generative flows on synthetic pathway for drug design. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[26]

Molsearch: Search-based multi-objective molecular generation and property optimization

Mengying Sun, Jing Xing, Han Meng, Huijun Wang, Bin Chen, and Jiayu Zhou. Molsearch: Search-based multi-objective molecular generation and property optimization. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, 2022

work page 2022
[27]

Wenhao Gao, Shitong Luo, and Connor W. Coley. Generative artificial intelligence for navigating synthesizable chemical space.Proceedings of the National Academy of Sciences, 122(41): e2415665122, 2025

work page 2025
[28]

Coley, and Wojciech Matusik

Michael Sun, Alston Lo, Minghao Guo, Jie Chen, Connor W. Coley, and Wojciech Matusik. Procedural synthesis of synthesizable molecules. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[29]

Coley, and Jianzhu Ma

Shitong Luo, Wenhao Gao, Zuofan Wu, Jian Peng, Connor W. Coley, and Jianzhu Ma. Pro- jecting molecules into synthesizable chemical spaces. InProceedings of the 41st International Conference on Machine Learning. JMLR.org, 2024. 11

work page 2024
[30]

Cavanagh, Yingze Wang, Jacob M

Kunyang Sun, Dorian Bagni, Joseph M. Cavanagh, Yingze Wang, Jacob M. Sawyer, Bo Zhou, Andrew Gritsevskiy, Oufan Zhang, and Teresa Head-Gordon. Synllama: Generating synthesiz- able molecules and their analogs with large language models.ACS Central Science, 11(11): 2108–2120, 2025

work page 2025
[31]

Chawla, Olaf Wiest, and Xiangliang Zhang

Taicheng Guo, Kehan Guo, Bozhao Nan, Zhenwen Liang, Zhichun Guo, Nitesh V . Chawla, Olaf Wiest, and Xiangliang Zhang. What can large language models do in chemistry? a comprehensive benchmark on eight tasks. InProceedings of the 37th International Conference on Neural Information Processing Systems. Curran Associates Inc., 2023

work page 2023
[32]

Leverag- ing large language models for predictive chemistry.Nature Machine Intelligence, 2024

Kevin Maik Jablonka, Philippe Schwaller, Andres Ortega-Guerrero, and Berend Smit. Leverag- ing large language models for predictive chemistry.Nature Machine Intelligence, 2024

work page 2024
[33]

Jinyoung Park, Minseong Bae, Dohwan Ko, and Hyunwoo J. Kim. LLamo: Large language model-based molecular graph assistant. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[34]

Can LLMs solve molecule puzzles? a multimodal benchmark for molecular structure elucidation

Kehan Guo, Bozhao Nan, Yujun Zhou, Taicheng Guo, Zhichun Guo, Mihir Surve, Zhenwen Liang, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. Can LLMs solve molecule puzzles? a multimodal benchmark for molecular structure elucidation. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

work page 2024
[35]

Mol-instructions: A large-scale biomolecular instruction dataset for large language models

Yin Fang, Xiaozhuan Liang, Ningyu Zhang, Kangwei Liu, Rui Huang, Zhuo Chen, Xiaohui Fan, and Huajun Chen. Mol-instructions: A large-scale biomolecular instruction dataset for large language models. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[36]

How to make large language models generate 100% valid molecules? In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Wen Tao, Jing Tang, Alvin Chan, Bryan Hooi, Baolong Bi, Nanyun Peng, Yuansheng Liu, and Yiwei Wang. How to make large language models generate 100% valid molecules? In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, November 2025

work page 2025
[37]

A. M. Bran, S. Cox, O. Schilter, et al. Augmenting large language models with chemistry tools. Nature Machine Intelligence, 6:525–535, 2024

work page 2024
[38]

MT-mol: Multi agent system with tool-based reasoning for molecular optimization

Hyomin Kim, Yunhui Jang, and Sungsoo Ahn. MT-mol: Multi agent system with tool-based reasoning for molecular optimization. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguis- tics: EMNLP 2025. Association for Computational Linguistics, November 2025

work page 2025
[39]

Chemorch: Empowering LLMs with chemical intelligence via groundbreaking synthetic instructions

Yue Huang, Zhengzhe Jiang, Xiaonan Luo, Kehan Guo, Haomin Zhuang, Yujun Zhou, Zhengqing Yuan, Xiaoqi Sun, Jules Schleinitz, Yanbo Wang, Shuhao Zhang, Mihir Surve, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. Chemorch: Empowering LLMs with chemical intelligence via groundbreaking synthetic instructions. InThe Thirty-ninth Annual Conference on Neural ...

work page 2025
[40]

Coley, Cao Xiao, Jimeng Sun, and Marinka Zitnik

Kexin Huang, Tianfan Fu, Wenhao Gao, Yue Zhao, Yusuf H Roohani, Jure Leskovec, Connor W. Coley, Cao Xiao, Jimeng Sun, and Marinka Zitnik. Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021

work page 2021
[41]

Flow network based generative models for non-iterative diverse candidate generation

Emmanuel Bengio, Moksh Jain, Maksym Korablyov, Doina Precup, and Yoshua Bengio. Flow network based generative models for non-iterative diverse candidate generation. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, pages 27381–27394. Curran Associates, Inc., 2021

work page 2021
[42]

Clickgen: Directed exploration of synthesizable chemical space via modular reactions and reinforcement learning.Nature communications, 15 (1):10127, 2024

Mingyang Wang, Shuai Li, Jike Wang, Odin Zhang, Hongyan Du, Dejun Jiang, Zhenxing Wu, Yafeng Deng, Yu Kang, Peichen Pan, et al. Clickgen: Directed exploration of synthesizable chemical space via modular reactions and reinforcement learning.Nature communications, 15 (1):10127, 2024

work page 2024
[43]

LLM-augmented chemical synthesis and design decision programs

Haorui Wang, Jeff Guo, Lingkai Kong, Rampi Ramprasad, Philippe Schwaller, Yuanqi Du, and Chao Zhang. LLM-augmented chemical synthesis and design decision programs. In Forty-second International Conference on Machine Learning, 2025. 12

work page 2025
[44]

Retro-r1: LLM-based agentic retrosynthesis

Wei Liu, Jiangtao Feng, Hongli Yu, Yuxuan Song, Yuqiang Li, Shufei Zhang, LEI BAI, Wei- Ying Ma, and Hao Zhou. Retro-r1: LLM-based agentic retrosynthesis. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[45]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InNeurIPS 2022 Foundation Models for Decision Making Workshop, 2022

work page 2022
[46]

Wenhao Gao, Tianfan Fu, Jimeng Sun, and Connor W. Coley. Sample efficiency matters: a benchmark for practical molecular optimization. InProceedings of the 36th International Conference on Neural Information Processing Systems, Red Hook, NY , USA, 2022. Curran Associates Inc

work page 2022
[47]

Irwin, Teague Sterling, Michael M

John J. Irwin, Teague Sterling, Michael M. Mysinger, Erin S. Bolstad, and Ryan G. Coleman. ZINC: A free tool to discover chemistry for biology.Journal of Chemical Information and Modeling, 2012

work page 2012
[48]

The llama 3 herd of models, 2024

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. The llama 3 herd of models, 2024

work page 2024
[49]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles, New York, NY , USA, 2023. Association for Computing Machinery

work page 2023
[50]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

Building blocks catalog, 2023

Enamine. Building blocks catalog, 2023. URL https://enamine.net/building-blocks/ building-blocks-catalog

work page 2023
[52]

Oleg Trott and Arthur J. Olson. AutoDock Vina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading.Journal of Computational Chemistry, 31(2):455–461, 2010

work page 2010
[53]

Rdkit: Open-source cheminformatics software, 2016

Greg Landrum et al. Rdkit: Open-source cheminformatics software, 2016. URL http: //www.rdkit.org/. https://github.com/rdkit/rdkit

work page 2016
[54]

reactants: none

Harrison Chase. Langchain, 2022. URL https://github.com/langchain-ai/langchain. 13 A Implementation Details A.1 GRPO Training Hyperparameters The policy model is Qwen3-4B-Instruct, trained with GRPO using the MemoryEfficientAdamW optimizer on a single NVIDIA RTX 6000 Ada GPU (48 GB). Table 3 summarizes the hyperparameters shared across all 14 benchmark ta...

work page 2022

[1] [1]

Hevener, Russell Pesavento, JinHong Ren, Hyun Lee, Kiira Ratia, and Michael E

Kirk E. Hevener, Russell Pesavento, JinHong Ren, Hyun Lee, Kiira Ratia, and Michael E. Johnson. Chapter twelve - hit-to-lead: Hit validation and assessment. InModern Approaches in Drug Discovery, volume 610, pages 265–309. Academic Press, 2018

work page 2018

[2] [2]

Christian Baber, Eric Feyfant, David C

Diane Joseph-McCarthy, J. Christian Baber, Eric Feyfant, David C. Thompson, and Christine Humblet. Lead optimization via high-throughput molecular docking.Current Opinion in Drug Discovery & Development, 2007

work page 2007

[3] [3]

Keserü and Gergely M

György M. Keserü and Gergely M. Makara. The influence of lead discovery strategies on the properties of drug candidates.Nature Reviews Drug Discovery, 2009

work page 2009

[4] [4]

Deep lead optimization: Leveraging generative ai for structural modification

Odin Zhang, Haitao Lin, Hui Zhang, Huifeng Zhao, Yufei Huang, Chang-Yu Hsieh, Peichen Pan, and Tingjun Hou. Deep lead optimization: Leveraging generative ai for structural modification. Journal of the American Chemical Society, 146(46):31357–31370, 2024

work page 2024

[5] [5]

Papidocha, Andreas Burger, Varinia Bernales, and Alán Aspuru-Guzik

Sven M. Papidocha, Andreas Burger, Varinia Bernales, and Alán Aspuru-Guzik. The elephant in the lab: synthesizability in generative small-molecule design.Current Opinion in Chemical Engineering, 51:101217, 2026. ISSN 2211-3398

work page 2026

[6] [6]

Searching for high-value molecules using reinforcement learning and transformers

Raj Ghugare, Santiago Miret, Adriana Hugessen, Mariano Phielipp, and Glen Berseth. Searching for high-value molecules using reinforcement learning and transformers. InProceedings of the International Conference on Learning Representations, 2024

work page 2024

[7] [7]

MoleditRL: Structure-preserving molecular editing via discrete diffusion and reinforcement learning

Yuanxin Zhuang, Dazhong Shen, and Ying Sun. MoleditRL: Structure-preserving molecular editing via discrete diffusion and reinforcement learning. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026

[8] [8]

De novo drug design using reinforce- ment learning with multiple gpt agents

Xiuyuan Hu, Guoqing Liu, Yang Zhao, and Hao Zhang. De novo drug design using reinforce- ment learning with multiple gpt agents. InProceedings of the 37th International Conference on Neural Information Processing Systems, 2023

work page 2023

[9] [9]

Jinyeong Park, Jaegyoon Ahn, Jonghwan Choi, and Jibum Kim. Mol-air: Molecular reinforce- ment learning with adaptive intrinsic rewards for goal-directed molecular generation.Journal of Chemical Information and Modeling, 65(5):2283–2296, 2025

work page 2025

[10] [10]

Pepthink-r1: LLM for interpretable cyclic peptide optimization with cot SFT and reinforcement learning

Ruheng Wang, Hang Zhang, Trieu Nguyen, Shasha Feng, Hao-Wei Pang, Xiang Yu, Li Xiao, and Peter Zhiping Zhang. Pepthink-r1: LLM for interpretable cyclic peptide optimization with cot SFT and reinforcement learning. InNeurIPS 2025 AI for Science Workshop, 2025

work page 2025

[11] [11]

Jan H. Jensen. A graph-based genetic algorithm and generative model/monte carlo tree search for the exploration of chemical space.Chemical Science, 10(12):3567–3572, 2019

work page 2019

[12] [12]

Efficient evolutionary search over chemical space with large language models

Haorui Wang, Marta Skreta, Cher-Tian Ser, Wenhao Gao, Lingkai Kong, Felix Strieth-Kalthoff, Chenru Duan, Yuchen Zhuang, Yue Yu, Yanqiao Zhu, Yuanqi Du, Alán Aspuru-Guzik, Kirill Neklyudov, and Chao Zhang. Efficient evolutionary search over chemical space with large language models. InProceedings of the International Conference on Learning Representations, 2025

work page 2025

[13] [13]

GeLLM³O: Generalizing large language models for multi-property molecule optimization

Vishal Dey, Xiao Hu, and Xia Ning. GeLLM³O: Generalizing large language models for multi-property molecule optimization. InProceedings of the 63rd Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2025

work page 2025

[14] [14]

Drugassist: a large language model for molecule optimization.Briefings in Bioinformatics, 26(1):bbae693, 01 2025

Geyan Ye, Xibao Cai, Houtim Lai, Xing Wang, Junhong Huang, Longyue Wang, Wei Liu, and Xiangxiang Zeng. Drugassist: a large language model for molecule optimization.Briefings in Bioinformatics, 26(1):bbae693, 01 2025

work page 2025

[15] [15]

Ldmol: A text-to-molecule diffusion model with structurally informative latent space surpasses ar models.International Conference on Machine Learning, 2025

Jinho Chang and Jong Chul Ye. Ldmol: A text-to-molecule diffusion model with structurally informative latent space surpasses ar models.International Conference on Machine Learning, 2025. 10

work page 2025

[16] [16]

Exploring synthesizable chemical space with iterative pathway refinements

Seul Lee, Karsten Kreis, Srimukh Prasad Veccham, Meng Liu, Danny Reidenbach, Saee Gopal Paliwal, Weili Nie, and Arash Vahdat. Exploring synthesizable chemical space with iterative pathway refinements. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026

[17] [17]

Catacutan, Autumn Arnold, James Zou, and Jonathan M

Kyle Swanson, Gary Liu, Denise B. Catacutan, Autumn Arnold, James Zou, and Jonathan M. Stokes. Generative ai for designing and validating easily synthesizable and structurally novel antibiotics.Nature Machine Intelligence, 6:338–353, 2024

work page 2024

[18] [18]

Molecular optimization using a conditional transformer for reaction-aware compound exploration with reinforcement learning

Shogo Nakamura, Nobuaki Yasuo, and Masakazu Sekijima. Molecular optimization using a conditional transformer for reaction-aware compound exploration with reinforcement learning. Communications Chemistry, 8(40), 2025

work page 2025

[19] [19]

Burke, and Heng Ji

Carl Edwards, Chi Han, Gawon Lee, Thao Nguyen, Sara Szymku ´c, Chetan Kumar Prasad, Bowen Jin, Jiawei Han, Ying Diao, Ge Liu, Hao Peng, Bartosz Andrzej Grzybowski, Martin D. Burke, and Heng Ji. mCLM: A modular chemical language model that generates functional and makeable molecules. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026

[20] [20]

Anderson, and Henry van den Bedem

Aryan Pedawi, Pawet Gniewek, Chaoyi Chang, Brandon M. Anderson, and Henry van den Bedem. An efficient graph generative model for navigating ultra-large combinatorial synthesis libraries. InProceedings of the 36th International Conference on Neural Information Processing Systems. Curran Associates Inc., 2022

work page 2022

[21] [21]

Chembo: Bayesian optimization of small organic molecules with synthesizable recommendations

Ksenia Korovina, Sailun Xu, Kirthevasan Kandasamy, Willie Neiswanger, Barnabas Poczos, Jeff Schneider, and Eric Xing. Chembo: Bayesian optimization of small organic molecules with synthesizable recommendations. InProceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), pages 3393–3403. PMLR, 2020

work page 2020

[22] [22]

Sample-efficient multi-objective molecular optimization with gflownets

Yiheng Zhu, Jialu Wu, Chaowen Hu, Jiahuan Yan, Chang-Yu Hsieh, Tingjun Hou, and Jian Wu. Sample-efficient multi-objective molecular optimization with gflownets. InProceedings of the 37th International Conference on Neural Information Processing Systems, 2023

work page 2023

[23] [23]

Michał Koziarski, Andrei Rekesh, Dmytro Shevchuk, Almer van der Sloot, Piotr Gai ´nski, Yoshua Bengio, Cheng-Hao Liu, Mike Tyers, and Robert A. Batey. Rgfn: synthesizable molecular generation using gflownets. InProceedings of the 38th International Conference on Neural Information Processing Systems. Curran Associates Inc., 2024

work page 2024

[24] [24]

Synflownet: Design of diverse and novel molecules with synthesis constraints

Miruna Cretu, Charles Harris, Ilia Igashov, Arne Schneuing, Marwin Segler, Bruno Correia, Julien Roy, Emmanuel Bengio, and Pietro Lio. Synflownet: Design of diverse and novel molecules with synthesis constraints. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[25] [25]

Generative flows on synthetic pathway for drug design

Seonghwan Seo, Minsu Kim, Tony Shen, Martin Ester, Jinkyoo Park, Sungsoo Ahn, and Woo Youn Kim. Generative flows on synthetic pathway for drug design. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[26] [26]

Molsearch: Search-based multi-objective molecular generation and property optimization

Mengying Sun, Jing Xing, Han Meng, Huijun Wang, Bin Chen, and Jiayu Zhou. Molsearch: Search-based multi-objective molecular generation and property optimization. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, 2022

work page 2022

[27] [27]

Wenhao Gao, Shitong Luo, and Connor W. Coley. Generative artificial intelligence for navigating synthesizable chemical space.Proceedings of the National Academy of Sciences, 122(41): e2415665122, 2025

work page 2025

[28] [28]

Coley, and Wojciech Matusik

Michael Sun, Alston Lo, Minghao Guo, Jie Chen, Connor W. Coley, and Wojciech Matusik. Procedural synthesis of synthesizable molecules. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[29] [29]

Coley, and Jianzhu Ma

Shitong Luo, Wenhao Gao, Zuofan Wu, Jian Peng, Connor W. Coley, and Jianzhu Ma. Pro- jecting molecules into synthesizable chemical spaces. InProceedings of the 41st International Conference on Machine Learning. JMLR.org, 2024. 11

work page 2024

[30] [30]

Cavanagh, Yingze Wang, Jacob M

Kunyang Sun, Dorian Bagni, Joseph M. Cavanagh, Yingze Wang, Jacob M. Sawyer, Bo Zhou, Andrew Gritsevskiy, Oufan Zhang, and Teresa Head-Gordon. Synllama: Generating synthesiz- able molecules and their analogs with large language models.ACS Central Science, 11(11): 2108–2120, 2025

work page 2025

[31] [31]

Chawla, Olaf Wiest, and Xiangliang Zhang

Taicheng Guo, Kehan Guo, Bozhao Nan, Zhenwen Liang, Zhichun Guo, Nitesh V . Chawla, Olaf Wiest, and Xiangliang Zhang. What can large language models do in chemistry? a comprehensive benchmark on eight tasks. InProceedings of the 37th International Conference on Neural Information Processing Systems. Curran Associates Inc., 2023

work page 2023

[32] [32]

Leverag- ing large language models for predictive chemistry.Nature Machine Intelligence, 2024

Kevin Maik Jablonka, Philippe Schwaller, Andres Ortega-Guerrero, and Berend Smit. Leverag- ing large language models for predictive chemistry.Nature Machine Intelligence, 2024

work page 2024

[33] [33]

Jinyoung Park, Minseong Bae, Dohwan Ko, and Hyunwoo J. Kim. LLamo: Large language model-based molecular graph assistant. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024

[34] [34]

Can LLMs solve molecule puzzles? a multimodal benchmark for molecular structure elucidation

Kehan Guo, Bozhao Nan, Yujun Zhou, Taicheng Guo, Zhichun Guo, Mihir Surve, Zhenwen Liang, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. Can LLMs solve molecule puzzles? a multimodal benchmark for molecular structure elucidation. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

work page 2024

[35] [35]

Mol-instructions: A large-scale biomolecular instruction dataset for large language models

Yin Fang, Xiaozhuan Liang, Ningyu Zhang, Kangwei Liu, Rui Huang, Zhuo Chen, Xiaohui Fan, and Huajun Chen. Mol-instructions: A large-scale biomolecular instruction dataset for large language models. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[36] [36]

How to make large language models generate 100% valid molecules? In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Wen Tao, Jing Tang, Alvin Chan, Bryan Hooi, Baolong Bi, Nanyun Peng, Yuansheng Liu, and Yiwei Wang. How to make large language models generate 100% valid molecules? In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, November 2025

work page 2025

[37] [37]

A. M. Bran, S. Cox, O. Schilter, et al. Augmenting large language models with chemistry tools. Nature Machine Intelligence, 6:525–535, 2024

work page 2024

[38] [38]

MT-mol: Multi agent system with tool-based reasoning for molecular optimization

Hyomin Kim, Yunhui Jang, and Sungsoo Ahn. MT-mol: Multi agent system with tool-based reasoning for molecular optimization. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguis- tics: EMNLP 2025. Association for Computational Linguistics, November 2025

work page 2025

[39] [39]

Chemorch: Empowering LLMs with chemical intelligence via groundbreaking synthetic instructions

Yue Huang, Zhengzhe Jiang, Xiaonan Luo, Kehan Guo, Haomin Zhuang, Yujun Zhou, Zhengqing Yuan, Xiaoqi Sun, Jules Schleinitz, Yanbo Wang, Shuhao Zhang, Mihir Surve, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. Chemorch: Empowering LLMs with chemical intelligence via groundbreaking synthetic instructions. InThe Thirty-ninth Annual Conference on Neural ...

work page 2025

[40] [40]

Coley, Cao Xiao, Jimeng Sun, and Marinka Zitnik

Kexin Huang, Tianfan Fu, Wenhao Gao, Yue Zhao, Yusuf H Roohani, Jure Leskovec, Connor W. Coley, Cao Xiao, Jimeng Sun, and Marinka Zitnik. Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021

work page 2021

[41] [41]

Flow network based generative models for non-iterative diverse candidate generation

Emmanuel Bengio, Moksh Jain, Maksym Korablyov, Doina Precup, and Yoshua Bengio. Flow network based generative models for non-iterative diverse candidate generation. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, pages 27381–27394. Curran Associates, Inc., 2021

work page 2021

[42] [42]

Clickgen: Directed exploration of synthesizable chemical space via modular reactions and reinforcement learning.Nature communications, 15 (1):10127, 2024

Mingyang Wang, Shuai Li, Jike Wang, Odin Zhang, Hongyan Du, Dejun Jiang, Zhenxing Wu, Yafeng Deng, Yu Kang, Peichen Pan, et al. Clickgen: Directed exploration of synthesizable chemical space via modular reactions and reinforcement learning.Nature communications, 15 (1):10127, 2024

work page 2024

[43] [43]

LLM-augmented chemical synthesis and design decision programs

Haorui Wang, Jeff Guo, Lingkai Kong, Rampi Ramprasad, Philippe Schwaller, Yuanqi Du, and Chao Zhang. LLM-augmented chemical synthesis and design decision programs. In Forty-second International Conference on Machine Learning, 2025. 12

work page 2025

[44] [44]

Retro-r1: LLM-based agentic retrosynthesis

Wei Liu, Jiangtao Feng, Hongli Yu, Yuxuan Song, Yuqiang Li, Shufei Zhang, LEI BAI, Wei- Ying Ma, and Hao Zhou. Retro-r1: LLM-based agentic retrosynthesis. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025

[45] [45]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InNeurIPS 2022 Foundation Models for Decision Making Workshop, 2022

work page 2022

[46] [46]

Wenhao Gao, Tianfan Fu, Jimeng Sun, and Connor W. Coley. Sample efficiency matters: a benchmark for practical molecular optimization. InProceedings of the 36th International Conference on Neural Information Processing Systems, Red Hook, NY , USA, 2022. Curran Associates Inc

work page 2022

[47] [47]

Irwin, Teague Sterling, Michael M

John J. Irwin, Teague Sterling, Michael M. Mysinger, Erin S. Bolstad, and Ryan G. Coleman. ZINC: A free tool to discover chemistry for biology.Journal of Chemical Information and Modeling, 2012

work page 2012

[48] [48]

The llama 3 herd of models, 2024

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. The llama 3 herd of models, 2024

work page 2024

[49] [49]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles, New York, NY , USA, 2023. Association for Computing Machinery

work page 2023

[50] [50]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [51]

Building blocks catalog, 2023

Enamine. Building blocks catalog, 2023. URL https://enamine.net/building-blocks/ building-blocks-catalog

work page 2023

[52] [52]

Oleg Trott and Arthur J. Olson. AutoDock Vina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading.Journal of Computational Chemistry, 31(2):455–461, 2010

work page 2010

[53] [53]

Rdkit: Open-source cheminformatics software, 2016

Greg Landrum et al. Rdkit: Open-source cheminformatics software, 2016. URL http: //www.rdkit.org/. https://github.com/rdkit/rdkit

work page 2016

[54] [54]

reactants: none

Harrison Chase. Langchain, 2022. URL https://github.com/langchain-ai/langchain. 13 A Implementation Details A.1 GRPO Training Hyperparameters The policy model is Qwen3-4B-Instruct, trained with GRPO using the MemoryEfficientAdamW optimizer on a single NVIDIA RTX 6000 Ada GPU (48 GB). Table 3 summarizes the hyperparameters shared across all 14 benchmark ta...

work page 2022