pith. sign in

arxiv: 2604.07669 · v2 · submitted 2026-04-09 · 💻 cs.LG · cs.AI· cs.CE

Reinforcement Learning with LLM-Guided Action Spaces for Synthesizable Lead Optimization

Pith reviewed 2026-05-10 18:22 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CE
keywords optimizationreactionleadmolreacttaskstemplatesacrossaction
0
0 comments X

The pith

MolReAct uses an LLM agent to define only chemically valid reaction steps as the action space for reinforcement learning in molecular lead optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets up lead optimization as a sequence of molecular changes that must each correspond to a real, template-backed chemical reaction rather than arbitrary edits. An LLM equipped with chemistry analysis tools examines the current molecule, identifies reactive sites and functional groups, and outputs a small list of feasible next transformations drawn from matched reaction templates. A separate policy model trained with Group Relative Policy Optimization then chooses among those constrained options to maximize long-term property rewards across multiple steps. The result is molecules that score higher on standard optimization benchmarks than prior methods while each carrying an explicit synthetic route.

Core claim

MolReAct formulates lead optimization as a Markov Decision Process whose action space is generated on the fly by a tool-augmented LLM agent that invokes chemical analysis tools to locate reactive sites and then proposes a compact set of chemically grounded transformations from validated reaction templates; a policy trained via Group Relative Policy Optimization selects actions to maximize cumulative oracle reward, and a SMILES caching layer speeds up repeated evaluations.

What carries the argument

The tool-augmented LLM agent that acts as the dynamic reaction environment by matching the current molecule against reaction templates and emitting only a small set of valid transformations to serve as the constrained action space for the reinforcement learning policy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the reaction templates and tool calls remain reliable on novel molecular scaffolds, the same trained policy could be reused across additional property objectives without retraining.
  • The explicit template grounding opens the possibility of feeding the proposed synthetic steps directly into automated synthesis planners or experimental validation loops.
  • Because the action space shrinks dramatically at each step, longer optimization trajectories become computationally tractable compared with fully generative approaches.
  • The caching of SMILES evaluations suggests that performance gains could compound when the same intermediates appear across multiple independent optimization runs.

Load-bearing premise

The LLM agent must correctly identify all relevant reactive sites and functional groups and then propose a complete, valid collection of transformations from the templates without missing productive reactions or suggesting invalid ones.

What would settle it

Running the system on a new set of molecules where the LLM either proposes a chemically invalid transformation or omits a known productive reaction route, producing final molecules whose property scores fall below those obtained by an unconstrained generative baseline.

Figures

Figures reproduced from arXiv: 2604.07669 by Carl Yang, Kaiyuan Hou, Monika Raj, Tao Li, Tuan Vinh, Zhichun Guo.

Figure 1
Figure 1. Figure 1: Overview of MolReAct. The reaction environment performs template matching and tool [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Ablation on tool-guided proposal and policy optimization across target activity tasks. is standardized within the group by subtracting the group mean and dividing by the group standard deviation to obtain a group-relative advantage. This advantage is then assigned to every step in the corresponding trajectory, enabling trajectory-level credit assignment. The policy is then updated by maximizing the standar… view at source ↗
Figure 3
Figure 3. Figure 3: Building block analysis. To evaluate whether the building blocks proposed by MolReAct can be readily obtained from com￾mercial suppliers, we perform a post-hoc avail￾ability analysis on the four protein-target activity tasks. Using the Enamine building block catalog (∼2.1M compounds) as a reference [51], we apply an exact-match filter during evaluation that retains a proposed reaction when all of its non-i… view at source ↗
Figure 4
Figure 4. Figure 4: Representative synthetic pathways discovered by MolReAct on four protein-target activity [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Top-10, Top-30, and Top-50 scores vs. oracle calls on four protein-target activity tasks. [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of valid reactions proposed per query during training on the sEH task. The [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
read the original abstract

Lead optimization in drug discovery requires improving therapeutic properties while ensuring that molecular modifications correspond to feasible synthetic routes. Existing approaches either prioritize property scores without enforcing synthesizability, or rely on expensive enumeration over large reaction networks, while direct application of Large Language Models (LLMs) to molecular generation frequently produces chemically invalid structures. We introduce MolReAct, a framework that formulates lead optimization as a Markov Decision Process over a synthesis-constrained action space defined by validated reaction templates. A tool-augmented LLM agent serves as a dynamic reaction environment, invoking specialized chemical analysis tools to identify reactive sites and functional groups and proposing a compact set of chemically grounded transformations from matched templates. A dedicated policy model trained via Group Relative Policy Optimization (GRPO) selects among these constrained actions to maximize long-term oracle reward across multi-step trajectories, with a SMILES-based caching mechanism reducing end-to-end optimization time by approximately 43%. Across 13 property optimization tasks from the Therapeutic Data Commons and one structure-based docking task, MolReAct achieves an average Top-10 score of 0.571, the highest among all baselines, ranking first or second on 13 of 14 tasks and attaining the best sample efficiency on 9 of 14 tasks. By grounding every optimization step in validated reaction templates, MolReAct produces molecules that are not only property-improved but each accompanied by an explicit template-grounded synthetic pathway.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MolReAct, a framework that formulates lead optimization as an MDP over a synthesis-constrained action space. A tool-augmented LLM agent uses chemical analysis tools to identify reactive sites and functional groups, then proposes transformations from matched reaction templates. A policy trained with Group Relative Policy Optimization (GRPO) selects actions to maximize long-term oracle reward, with SMILES caching for efficiency. On 13 Therapeutic Data Commons property optimization tasks plus one docking task, it reports the highest average Top-10 score of 0.571, ranking first or second on 13 of 14 tasks and best sample efficiency on 9 of 14, while guaranteeing each output molecule has an explicit template-grounded synthetic pathway.

Significance. If the LLM agent reliably produces complete and valid action spaces, the approach could meaningfully advance practical synthesizable molecular optimization by combining LLM chemical reasoning with RL long-horizon planning, offering better sample efficiency than exhaustive enumeration while avoiding the invalid structures common in unconstrained LLM generation.

major comments (2)
  1. [Abstract] Abstract: The central empirical claims (average Top-10 score of 0.571, first/second ranking on 13/14 tasks, best sample efficiency on 9/14 tasks) rest on the action space being defined entirely by the tool-augmented LLM's template proposals, yet no quantitative coverage metric (recall of all template-applicable reactions, false-negative rate on reactive sites, or inter-run consistency) is supplied; this is load-bearing because an incomplete action space would make performance gains potentially attributable to reduced branching factor rather than superior planning via GRPO, undermining both the synthesizability guarantee and the efficiency interpretation.
  2. [Abstract] Abstract and Results: The ranking and efficiency superiority claims require explicit details on baseline implementations, statistical testing procedures, controls for data leakage, and how reaction template coverage was verified; without these, the reported outperformance cannot be fully verified as robust.
minor comments (2)
  1. The 43% time reduction from the SMILES-based caching mechanism should be accompanied by per-task timing tables and direct comparisons to baseline runtimes for clarity.
  2. All acronyms (GRPO, TDC, MDP) should be expanded on first use in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and will incorporate clarifications and additional analyses in a revised version to strengthen the empirical support for our claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central empirical claims (average Top-10 score of 0.571, first/second ranking on 13/14 tasks, best sample efficiency on 9/14 tasks) rest on the action space being defined entirely by the tool-augmented LLM's template proposals, yet no quantitative coverage metric (recall of all template-applicable reactions, false-negative rate on reactive sites, or inter-run consistency) is supplied; this is load-bearing because an incomplete action space would make performance gains potentially attributable to reduced branching factor rather than superior planning via GRPO, undermining both the synthesizability guarantee and the efficiency interpretation.

    Authors: We appreciate the referee pointing out the need for quantitative coverage metrics. The synthesizability guarantee applies to each output molecule, as every action is drawn from a validated reaction template proposed by the LLM agent, providing an explicit template-grounded pathway. We agree, however, that metrics on coverage would help rule out reduced branching factor as the sole driver of gains. In revision we will add a dedicated analysis: on a random subset of 100 starting molecules per task, we will exhaustively enumerate all template-applicable reactions using RDKit and compare against the LLM agent's proposals to compute recall and false-negative rates on reactive sites. We will also report inter-run consistency by executing the agent five times on the same inputs and measuring overlap in proposed actions. These results will be presented alongside the main experiments to support that performance differences reflect GRPO planning rather than action-space size alone. revision: yes

  2. Referee: [Abstract] Abstract and Results: The ranking and efficiency superiority claims require explicit details on baseline implementations, statistical testing procedures, controls for data leakage, and how reaction template coverage was verified; without these, the reported outperformance cannot be fully verified as robust.

    Authors: We agree that greater transparency on these implementation and verification details is required. In the revised manuscript we will expand the Methods and Experimental Setup sections with: (i) full specifications of each baseline (including code repositories used, any modifications to original implementations, and hyperparameter choices); (ii) statistical procedures (multiple independent runs with reported means, standard deviations, and paired Wilcoxon signed-rank tests with p-values for ranking comparisons); (iii) explicit statement that the 13 TDC tasks use publicly released benchmark splits with no overlap to any pre-training data for the policy network or the LLM; and (iv) our template-coverage verification protocol, which combined automated matching against the USPTO-derived template library with manual review of 200 randomly sampled LLM-proposed reactions by two co-authors with chemistry backgrounds. These additions will enable independent verification of the reported Top-10 scores, rankings, and sample-efficiency results. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation is self-contained against external benchmarks

full rationale

The paper defines an MDP whose action space is constructed by an LLM tool-augmented agent matching reaction templates, then trains a policy via GRPO to maximize oracle rewards on Therapeutic Data Commons tasks and a docking task. All reported metrics (Top-10 scores, sample efficiency) are computed on held-out external oracles and datasets; no equation or result is obtained by fitting a parameter to a subset and relabeling it as a prediction, no self-citation supplies a load-bearing uniqueness theorem, and no ansatz is smuggled in. The central performance claims therefore rest on independent empirical evaluation rather than any definitional or self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim depends on the coverage and validity of external reaction templates plus reliable LLM tool behavior for action proposal; these are domain assumptions not derived within the paper.

axioms (2)
  • domain assumption Molecules can be faithfully represented and modified via SMILES strings and a fixed library of validated reaction templates.
    Invoked to define the synthesis-constrained action space in the MDP formulation.
  • ad hoc to paper The tool-augmented LLM can accurately detect reactive sites and functional groups to propose only valid transformations.
    Required for the dynamic reaction environment to generate the compact action set at each step.
invented entities (1)
  • MolReAct framework no independent evidence
    purpose: Integrates LLM-guided action proposal with GRPO policy optimization for synthesizable molecular trajectories.
    New composite method introduced by the paper; no independent evidence provided beyond the reported benchmarks.

pith-pipeline@v0.9.0 · 5567 in / 1607 out tokens · 98152 ms · 2026-05-10T18:22:22.395077+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 1 internal anchor

  1. [1]

    Hevener, Russell Pesavento, JinHong Ren, Hyun Lee, Kiira Ratia, and Michael E

    Kirk E. Hevener, Russell Pesavento, JinHong Ren, Hyun Lee, Kiira Ratia, and Michael E. Johnson. Chapter twelve - hit-to-lead: Hit validation and assessment. InModern Approaches in Drug Discovery, volume 610, pages 265–309. Academic Press, 2018

  2. [2]

    Christian Baber, Eric Feyfant, David C

    Diane Joseph-McCarthy, J. Christian Baber, Eric Feyfant, David C. Thompson, and Christine Humblet. Lead optimization via high-throughput molecular docking.Current Opinion in Drug Discovery & Development, 2007

  3. [3]

    Keserü and Gergely M

    György M. Keserü and Gergely M. Makara. The influence of lead discovery strategies on the properties of drug candidates.Nature Reviews Drug Discovery, 2009

  4. [4]

    Deep lead optimization: Leveraging generative ai for structural modification

    Odin Zhang, Haitao Lin, Hui Zhang, Huifeng Zhao, Yufei Huang, Chang-Yu Hsieh, Peichen Pan, and Tingjun Hou. Deep lead optimization: Leveraging generative ai for structural modification. Journal of the American Chemical Society, 146(46):31357–31370, 2024

  5. [5]

    Papidocha, Andreas Burger, Varinia Bernales, and Alán Aspuru-Guzik

    Sven M. Papidocha, Andreas Burger, Varinia Bernales, and Alán Aspuru-Guzik. The elephant in the lab: synthesizability in generative small-molecule design.Current Opinion in Chemical Engineering, 51:101217, 2026. ISSN 2211-3398

  6. [6]

    Searching for high-value molecules using reinforcement learning and transformers

    Raj Ghugare, Santiago Miret, Adriana Hugessen, Mariano Phielipp, and Glen Berseth. Searching for high-value molecules using reinforcement learning and transformers. InProceedings of the International Conference on Learning Representations, 2024

  7. [7]

    MoleditRL: Structure-preserving molecular editing via discrete diffusion and reinforcement learning

    Yuanxin Zhuang, Dazhong Shen, and Ying Sun. MoleditRL: Structure-preserving molecular editing via discrete diffusion and reinforcement learning. InThe Fourteenth International Conference on Learning Representations, 2026

  8. [8]

    De novo drug design using reinforce- ment learning with multiple gpt agents

    Xiuyuan Hu, Guoqing Liu, Yang Zhao, and Hao Zhang. De novo drug design using reinforce- ment learning with multiple gpt agents. InProceedings of the 37th International Conference on Neural Information Processing Systems, 2023

  9. [9]

    Jinyeong Park, Jaegyoon Ahn, Jonghwan Choi, and Jibum Kim. Mol-air: Molecular reinforce- ment learning with adaptive intrinsic rewards for goal-directed molecular generation.Journal of Chemical Information and Modeling, 65(5):2283–2296, 2025

  10. [10]

    Pepthink-r1: LLM for interpretable cyclic peptide optimization with cot SFT and reinforcement learning

    Ruheng Wang, Hang Zhang, Trieu Nguyen, Shasha Feng, Hao-Wei Pang, Xiang Yu, Li Xiao, and Peter Zhiping Zhang. Pepthink-r1: LLM for interpretable cyclic peptide optimization with cot SFT and reinforcement learning. InNeurIPS 2025 AI for Science Workshop, 2025

  11. [11]

    Jan H. Jensen. A graph-based genetic algorithm and generative model/monte carlo tree search for the exploration of chemical space.Chemical Science, 10(12):3567–3572, 2019

  12. [12]

    Efficient evolutionary search over chemical space with large language models

    Haorui Wang, Marta Skreta, Cher-Tian Ser, Wenhao Gao, Lingkai Kong, Felix Strieth-Kalthoff, Chenru Duan, Yuchen Zhuang, Yue Yu, Yanqiao Zhu, Yuanqi Du, Alán Aspuru-Guzik, Kirill Neklyudov, and Chao Zhang. Efficient evolutionary search over chemical space with large language models. InProceedings of the International Conference on Learning Representations, 2025

  13. [13]

    GeLLM³O: Generalizing large language models for multi-property molecule optimization

    Vishal Dey, Xiao Hu, and Xia Ning. GeLLM³O: Generalizing large language models for multi-property molecule optimization. InProceedings of the 63rd Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2025

  14. [14]

    Drugassist: a large language model for molecule optimization.Briefings in Bioinformatics, 26(1):bbae693, 01 2025

    Geyan Ye, Xibao Cai, Houtim Lai, Xing Wang, Junhong Huang, Longyue Wang, Wei Liu, and Xiangxiang Zeng. Drugassist: a large language model for molecule optimization.Briefings in Bioinformatics, 26(1):bbae693, 01 2025

  15. [15]

    Ldmol: A text-to-molecule diffusion model with structurally informative latent space surpasses ar models.International Conference on Machine Learning, 2025

    Jinho Chang and Jong Chul Ye. Ldmol: A text-to-molecule diffusion model with structurally informative latent space surpasses ar models.International Conference on Machine Learning, 2025. 10

  16. [16]

    Exploring synthesizable chemical space with iterative pathway refinements

    Seul Lee, Karsten Kreis, Srimukh Prasad Veccham, Meng Liu, Danny Reidenbach, Saee Gopal Paliwal, Weili Nie, and Arash Vahdat. Exploring synthesizable chemical space with iterative pathway refinements. InThe Fourteenth International Conference on Learning Representations, 2026

  17. [17]

    Catacutan, Autumn Arnold, James Zou, and Jonathan M

    Kyle Swanson, Gary Liu, Denise B. Catacutan, Autumn Arnold, James Zou, and Jonathan M. Stokes. Generative ai for designing and validating easily synthesizable and structurally novel antibiotics.Nature Machine Intelligence, 6:338–353, 2024

  18. [18]

    Molecular optimization using a conditional transformer for reaction-aware compound exploration with reinforcement learning

    Shogo Nakamura, Nobuaki Yasuo, and Masakazu Sekijima. Molecular optimization using a conditional transformer for reaction-aware compound exploration with reinforcement learning. Communications Chemistry, 8(40), 2025

  19. [19]

    Burke, and Heng Ji

    Carl Edwards, Chi Han, Gawon Lee, Thao Nguyen, Sara Szymku ´c, Chetan Kumar Prasad, Bowen Jin, Jiawei Han, Ying Diao, Ge Liu, Hao Peng, Bartosz Andrzej Grzybowski, Martin D. Burke, and Heng Ji. mCLM: A modular chemical language model that generates functional and makeable molecules. InThe Fourteenth International Conference on Learning Representations, 2026

  20. [20]

    Anderson, and Henry van den Bedem

    Aryan Pedawi, Pawet Gniewek, Chaoyi Chang, Brandon M. Anderson, and Henry van den Bedem. An efficient graph generative model for navigating ultra-large combinatorial synthesis libraries. InProceedings of the 36th International Conference on Neural Information Processing Systems. Curran Associates Inc., 2022

  21. [21]

    Chembo: Bayesian optimization of small organic molecules with synthesizable recommendations

    Ksenia Korovina, Sailun Xu, Kirthevasan Kandasamy, Willie Neiswanger, Barnabas Poczos, Jeff Schneider, and Eric Xing. Chembo: Bayesian optimization of small organic molecules with synthesizable recommendations. InProceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), pages 3393–3403. PMLR, 2020

  22. [22]

    Sample-efficient multi-objective molecular optimization with gflownets

    Yiheng Zhu, Jialu Wu, Chaowen Hu, Jiahuan Yan, Chang-Yu Hsieh, Tingjun Hou, and Jian Wu. Sample-efficient multi-objective molecular optimization with gflownets. InProceedings of the 37th International Conference on Neural Information Processing Systems, 2023

  23. [23]

    Michał Koziarski, Andrei Rekesh, Dmytro Shevchuk, Almer van der Sloot, Piotr Gai ´nski, Yoshua Bengio, Cheng-Hao Liu, Mike Tyers, and Robert A. Batey. Rgfn: synthesizable molecular generation using gflownets. InProceedings of the 38th International Conference on Neural Information Processing Systems. Curran Associates Inc., 2024

  24. [24]

    Synflownet: Design of diverse and novel molecules with synthesis constraints

    Miruna Cretu, Charles Harris, Ilia Igashov, Arne Schneuing, Marwin Segler, Bruno Correia, Julien Roy, Emmanuel Bengio, and Pietro Lio. Synflownet: Design of diverse and novel molecules with synthesis constraints. InThe Thirteenth International Conference on Learning Representations, 2025

  25. [25]

    Generative flows on synthetic pathway for drug design

    Seonghwan Seo, Minsu Kim, Tony Shen, Martin Ester, Jinkyoo Park, Sungsoo Ahn, and Woo Youn Kim. Generative flows on synthetic pathway for drug design. InThe Thirteenth International Conference on Learning Representations, 2025

  26. [26]

    Molsearch: Search-based multi-objective molecular generation and property optimization

    Mengying Sun, Jing Xing, Han Meng, Huijun Wang, Bin Chen, and Jiayu Zhou. Molsearch: Search-based multi-objective molecular generation and property optimization. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, 2022

  27. [27]

    Wenhao Gao, Shitong Luo, and Connor W. Coley. Generative artificial intelligence for navigating synthesizable chemical space.Proceedings of the National Academy of Sciences, 122(41): e2415665122, 2025

  28. [28]

    Coley, and Wojciech Matusik

    Michael Sun, Alston Lo, Minghao Guo, Jie Chen, Connor W. Coley, and Wojciech Matusik. Procedural synthesis of synthesizable molecules. InThe Thirteenth International Conference on Learning Representations, 2025

  29. [29]

    Coley, and Jianzhu Ma

    Shitong Luo, Wenhao Gao, Zuofan Wu, Jian Peng, Connor W. Coley, and Jianzhu Ma. Pro- jecting molecules into synthesizable chemical spaces. InProceedings of the 41st International Conference on Machine Learning. JMLR.org, 2024. 11

  30. [30]

    Cavanagh, Yingze Wang, Jacob M

    Kunyang Sun, Dorian Bagni, Joseph M. Cavanagh, Yingze Wang, Jacob M. Sawyer, Bo Zhou, Andrew Gritsevskiy, Oufan Zhang, and Teresa Head-Gordon. Synllama: Generating synthesiz- able molecules and their analogs with large language models.ACS Central Science, 11(11): 2108–2120, 2025

  31. [31]

    Chawla, Olaf Wiest, and Xiangliang Zhang

    Taicheng Guo, Kehan Guo, Bozhao Nan, Zhenwen Liang, Zhichun Guo, Nitesh V . Chawla, Olaf Wiest, and Xiangliang Zhang. What can large language models do in chemistry? a comprehensive benchmark on eight tasks. InProceedings of the 37th International Conference on Neural Information Processing Systems. Curran Associates Inc., 2023

  32. [32]

    Leverag- ing large language models for predictive chemistry.Nature Machine Intelligence, 2024

    Kevin Maik Jablonka, Philippe Schwaller, Andres Ortega-Guerrero, and Berend Smit. Leverag- ing large language models for predictive chemistry.Nature Machine Intelligence, 2024

  33. [33]

    Jinyoung Park, Minseong Bae, Dohwan Ko, and Hyunwoo J. Kim. LLamo: Large language model-based molecular graph assistant. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  34. [34]

    Can LLMs solve molecule puzzles? a multimodal benchmark for molecular structure elucidation

    Kehan Guo, Bozhao Nan, Yujun Zhou, Taicheng Guo, Zhichun Guo, Mihir Surve, Zhenwen Liang, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. Can LLMs solve molecule puzzles? a multimodal benchmark for molecular structure elucidation. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

  35. [35]

    Mol-instructions: A large-scale biomolecular instruction dataset for large language models

    Yin Fang, Xiaozhuan Liang, Ningyu Zhang, Kangwei Liu, Rui Huang, Zhuo Chen, Xiaohui Fan, and Huajun Chen. Mol-instructions: A large-scale biomolecular instruction dataset for large language models. InThe Twelfth International Conference on Learning Representations, 2024

  36. [36]

    How to make large language models generate 100% valid molecules? In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

    Wen Tao, Jing Tang, Alvin Chan, Bryan Hooi, Baolong Bi, Nanyun Peng, Yuansheng Liu, and Yiwei Wang. How to make large language models generate 100% valid molecules? In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, November 2025

  37. [37]

    A. M. Bran, S. Cox, O. Schilter, et al. Augmenting large language models with chemistry tools. Nature Machine Intelligence, 6:525–535, 2024

  38. [38]

    MT-mol: Multi agent system with tool-based reasoning for molecular optimization

    Hyomin Kim, Yunhui Jang, and Sungsoo Ahn. MT-mol: Multi agent system with tool-based reasoning for molecular optimization. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguis- tics: EMNLP 2025. Association for Computational Linguistics, November 2025

  39. [39]

    Chemorch: Empowering LLMs with chemical intelligence via groundbreaking synthetic instructions

    Yue Huang, Zhengzhe Jiang, Xiaonan Luo, Kehan Guo, Haomin Zhuang, Yujun Zhou, Zhengqing Yuan, Xiaoqi Sun, Jules Schleinitz, Yanbo Wang, Shuhao Zhang, Mihir Surve, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. Chemorch: Empowering LLMs with chemical intelligence via groundbreaking synthetic instructions. InThe Thirty-ninth Annual Conference on Neural ...

  40. [40]

    Coley, Cao Xiao, Jimeng Sun, and Marinka Zitnik

    Kexin Huang, Tianfan Fu, Wenhao Gao, Yue Zhao, Yusuf H Roohani, Jure Leskovec, Connor W. Coley, Cao Xiao, Jimeng Sun, and Marinka Zitnik. Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021

  41. [41]

    Flow network based generative models for non-iterative diverse candidate generation

    Emmanuel Bengio, Moksh Jain, Maksym Korablyov, Doina Precup, and Yoshua Bengio. Flow network based generative models for non-iterative diverse candidate generation. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, pages 27381–27394. Curran Associates, Inc., 2021

  42. [42]

    Clickgen: Directed exploration of synthesizable chemical space via modular reactions and reinforcement learning.Nature communications, 15 (1):10127, 2024

    Mingyang Wang, Shuai Li, Jike Wang, Odin Zhang, Hongyan Du, Dejun Jiang, Zhenxing Wu, Yafeng Deng, Yu Kang, Peichen Pan, et al. Clickgen: Directed exploration of synthesizable chemical space via modular reactions and reinforcement learning.Nature communications, 15 (1):10127, 2024

  43. [43]

    LLM-augmented chemical synthesis and design decision programs

    Haorui Wang, Jeff Guo, Lingkai Kong, Rampi Ramprasad, Philippe Schwaller, Yuanqi Du, and Chao Zhang. LLM-augmented chemical synthesis and design decision programs. In Forty-second International Conference on Machine Learning, 2025. 12

  44. [44]

    Retro-r1: LLM-based agentic retrosynthesis

    Wei Liu, Jiangtao Feng, Hongli Yu, Yuxuan Song, Yuqiang Li, Shufei Zhang, LEI BAI, Wei- Ying Ma, and Hao Zhou. Retro-r1: LLM-based agentic retrosynthesis. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  45. [45]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InNeurIPS 2022 Foundation Models for Decision Making Workshop, 2022

  46. [46]

    Wenhao Gao, Tianfan Fu, Jimeng Sun, and Connor W. Coley. Sample efficiency matters: a benchmark for practical molecular optimization. InProceedings of the 36th International Conference on Neural Information Processing Systems, Red Hook, NY , USA, 2022. Curran Associates Inc

  47. [47]

    Irwin, Teague Sterling, Michael M

    John J. Irwin, Teague Sterling, Michael M. Mysinger, Erin S. Bolstad, and Ryan G. Coleman. ZINC: A free tool to discover chemistry for biology.Journal of Chemical Information and Modeling, 2012

  48. [48]

    The llama 3 herd of models, 2024

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. The llama 3 herd of models, 2024

  49. [49]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles, New York, NY , USA, 2023. Association for Computing Machinery

  50. [50]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  51. [51]

    Building blocks catalog, 2023

    Enamine. Building blocks catalog, 2023. URL https://enamine.net/building-blocks/ building-blocks-catalog

  52. [52]

    Oleg Trott and Arthur J. Olson. AutoDock Vina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading.Journal of Computational Chemistry, 31(2):455–461, 2010

  53. [53]

    Rdkit: Open-source cheminformatics software, 2016

    Greg Landrum et al. Rdkit: Open-source cheminformatics software, 2016. URL http: //www.rdkit.org/. https://github.com/rdkit/rdkit

  54. [54]

    reactants: none

    Harrison Chase. Langchain, 2022. URL https://github.com/langchain-ai/langchain. 13 A Implementation Details A.1 GRPO Training Hyperparameters The policy model is Qwen3-4B-Instruct, trained with GRPO using the MemoryEfficientAdamW optimizer on a single NVIDIA RTX 6000 Ada GPU (48 GB). Table 3 summarizes the hyperparameters shared across all 14 benchmark ta...