Reinforcement Learning with LLM-Guided Action Spaces for Synthesizable Lead Optimization
Pith reviewed 2026-05-10 18:22 UTC · model grok-4.3
The pith
MolReAct uses an LLM agent to define only chemically valid reaction steps as the action space for reinforcement learning in molecular lead optimization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MolReAct formulates lead optimization as a Markov Decision Process whose action space is generated on the fly by a tool-augmented LLM agent that invokes chemical analysis tools to locate reactive sites and then proposes a compact set of chemically grounded transformations from validated reaction templates; a policy trained via Group Relative Policy Optimization selects actions to maximize cumulative oracle reward, and a SMILES caching layer speeds up repeated evaluations.
What carries the argument
The tool-augmented LLM agent that acts as the dynamic reaction environment by matching the current molecule against reaction templates and emitting only a small set of valid transformations to serve as the constrained action space for the reinforcement learning policy.
Where Pith is reading between the lines
- If the reaction templates and tool calls remain reliable on novel molecular scaffolds, the same trained policy could be reused across additional property objectives without retraining.
- The explicit template grounding opens the possibility of feeding the proposed synthetic steps directly into automated synthesis planners or experimental validation loops.
- Because the action space shrinks dramatically at each step, longer optimization trajectories become computationally tractable compared with fully generative approaches.
- The caching of SMILES evaluations suggests that performance gains could compound when the same intermediates appear across multiple independent optimization runs.
Load-bearing premise
The LLM agent must correctly identify all relevant reactive sites and functional groups and then propose a complete, valid collection of transformations from the templates without missing productive reactions or suggesting invalid ones.
What would settle it
Running the system on a new set of molecules where the LLM either proposes a chemically invalid transformation or omits a known productive reaction route, producing final molecules whose property scores fall below those obtained by an unconstrained generative baseline.
Figures
read the original abstract
Lead optimization in drug discovery requires improving therapeutic properties while ensuring that molecular modifications correspond to feasible synthetic routes. Existing approaches either prioritize property scores without enforcing synthesizability, or rely on expensive enumeration over large reaction networks, while direct application of Large Language Models (LLMs) to molecular generation frequently produces chemically invalid structures. We introduce MolReAct, a framework that formulates lead optimization as a Markov Decision Process over a synthesis-constrained action space defined by validated reaction templates. A tool-augmented LLM agent serves as a dynamic reaction environment, invoking specialized chemical analysis tools to identify reactive sites and functional groups and proposing a compact set of chemically grounded transformations from matched templates. A dedicated policy model trained via Group Relative Policy Optimization (GRPO) selects among these constrained actions to maximize long-term oracle reward across multi-step trajectories, with a SMILES-based caching mechanism reducing end-to-end optimization time by approximately 43%. Across 13 property optimization tasks from the Therapeutic Data Commons and one structure-based docking task, MolReAct achieves an average Top-10 score of 0.571, the highest among all baselines, ranking first or second on 13 of 14 tasks and attaining the best sample efficiency on 9 of 14 tasks. By grounding every optimization step in validated reaction templates, MolReAct produces molecules that are not only property-improved but each accompanied by an explicit template-grounded synthetic pathway.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MolReAct, a framework that formulates lead optimization as an MDP over a synthesis-constrained action space. A tool-augmented LLM agent uses chemical analysis tools to identify reactive sites and functional groups, then proposes transformations from matched reaction templates. A policy trained with Group Relative Policy Optimization (GRPO) selects actions to maximize long-term oracle reward, with SMILES caching for efficiency. On 13 Therapeutic Data Commons property optimization tasks plus one docking task, it reports the highest average Top-10 score of 0.571, ranking first or second on 13 of 14 tasks and best sample efficiency on 9 of 14, while guaranteeing each output molecule has an explicit template-grounded synthetic pathway.
Significance. If the LLM agent reliably produces complete and valid action spaces, the approach could meaningfully advance practical synthesizable molecular optimization by combining LLM chemical reasoning with RL long-horizon planning, offering better sample efficiency than exhaustive enumeration while avoiding the invalid structures common in unconstrained LLM generation.
major comments (2)
- [Abstract] Abstract: The central empirical claims (average Top-10 score of 0.571, first/second ranking on 13/14 tasks, best sample efficiency on 9/14 tasks) rest on the action space being defined entirely by the tool-augmented LLM's template proposals, yet no quantitative coverage metric (recall of all template-applicable reactions, false-negative rate on reactive sites, or inter-run consistency) is supplied; this is load-bearing because an incomplete action space would make performance gains potentially attributable to reduced branching factor rather than superior planning via GRPO, undermining both the synthesizability guarantee and the efficiency interpretation.
- [Abstract] Abstract and Results: The ranking and efficiency superiority claims require explicit details on baseline implementations, statistical testing procedures, controls for data leakage, and how reaction template coverage was verified; without these, the reported outperformance cannot be fully verified as robust.
minor comments (2)
- The 43% time reduction from the SMILES-based caching mechanism should be accompanied by per-task timing tables and direct comparisons to baseline runtimes for clarity.
- All acronyms (GRPO, TDC, MDP) should be expanded on first use in the main text.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and will incorporate clarifications and additional analyses in a revised version to strengthen the empirical support for our claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central empirical claims (average Top-10 score of 0.571, first/second ranking on 13/14 tasks, best sample efficiency on 9/14 tasks) rest on the action space being defined entirely by the tool-augmented LLM's template proposals, yet no quantitative coverage metric (recall of all template-applicable reactions, false-negative rate on reactive sites, or inter-run consistency) is supplied; this is load-bearing because an incomplete action space would make performance gains potentially attributable to reduced branching factor rather than superior planning via GRPO, undermining both the synthesizability guarantee and the efficiency interpretation.
Authors: We appreciate the referee pointing out the need for quantitative coverage metrics. The synthesizability guarantee applies to each output molecule, as every action is drawn from a validated reaction template proposed by the LLM agent, providing an explicit template-grounded pathway. We agree, however, that metrics on coverage would help rule out reduced branching factor as the sole driver of gains. In revision we will add a dedicated analysis: on a random subset of 100 starting molecules per task, we will exhaustively enumerate all template-applicable reactions using RDKit and compare against the LLM agent's proposals to compute recall and false-negative rates on reactive sites. We will also report inter-run consistency by executing the agent five times on the same inputs and measuring overlap in proposed actions. These results will be presented alongside the main experiments to support that performance differences reflect GRPO planning rather than action-space size alone. revision: yes
-
Referee: [Abstract] Abstract and Results: The ranking and efficiency superiority claims require explicit details on baseline implementations, statistical testing procedures, controls for data leakage, and how reaction template coverage was verified; without these, the reported outperformance cannot be fully verified as robust.
Authors: We agree that greater transparency on these implementation and verification details is required. In the revised manuscript we will expand the Methods and Experimental Setup sections with: (i) full specifications of each baseline (including code repositories used, any modifications to original implementations, and hyperparameter choices); (ii) statistical procedures (multiple independent runs with reported means, standard deviations, and paired Wilcoxon signed-rank tests with p-values for ranking comparisons); (iii) explicit statement that the 13 TDC tasks use publicly released benchmark splits with no overlap to any pre-training data for the policy network or the LLM; and (iv) our template-coverage verification protocol, which combined automated matching against the USPTO-derived template library with manual review of 200 randomly sampled LLM-proposed reactions by two co-authors with chemistry backgrounds. These additions will enable independent verification of the reported Top-10 scores, rankings, and sample-efficiency results. revision: yes
Circularity Check
No circularity; derivation is self-contained against external benchmarks
full rationale
The paper defines an MDP whose action space is constructed by an LLM tool-augmented agent matching reaction templates, then trains a policy via GRPO to maximize oracle rewards on Therapeutic Data Commons tasks and a docking task. All reported metrics (Top-10 scores, sample efficiency) are computed on held-out external oracles and datasets; no equation or result is obtained by fitting a parameter to a subset and relabeling it as a prediction, no self-citation supplies a load-bearing uniqueness theorem, and no ansatz is smuggled in. The central performance claims therefore rest on independent empirical evaluation rather than any definitional or self-referential reduction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Molecules can be faithfully represented and modified via SMILES strings and a fixed library of validated reaction templates.
- ad hoc to paper The tool-augmented LLM can accurately detect reactive sites and functional groups to propose only valid transformations.
invented entities (1)
-
MolReAct framework
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
formulates lead optimization as a Markov Decision Process over a synthesis-constrained action space defined by validated reaction templates... GRPO selects among these constrained actions to maximize long-term oracle reward
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
tool-augmented LLM agent... proposes a compact set of chemically grounded transformations
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Hevener, Russell Pesavento, JinHong Ren, Hyun Lee, Kiira Ratia, and Michael E
Kirk E. Hevener, Russell Pesavento, JinHong Ren, Hyun Lee, Kiira Ratia, and Michael E. Johnson. Chapter twelve - hit-to-lead: Hit validation and assessment. InModern Approaches in Drug Discovery, volume 610, pages 265–309. Academic Press, 2018
work page 2018
-
[2]
Christian Baber, Eric Feyfant, David C
Diane Joseph-McCarthy, J. Christian Baber, Eric Feyfant, David C. Thompson, and Christine Humblet. Lead optimization via high-throughput molecular docking.Current Opinion in Drug Discovery & Development, 2007
work page 2007
-
[3]
György M. Keserü and Gergely M. Makara. The influence of lead discovery strategies on the properties of drug candidates.Nature Reviews Drug Discovery, 2009
work page 2009
-
[4]
Deep lead optimization: Leveraging generative ai for structural modification
Odin Zhang, Haitao Lin, Hui Zhang, Huifeng Zhao, Yufei Huang, Chang-Yu Hsieh, Peichen Pan, and Tingjun Hou. Deep lead optimization: Leveraging generative ai for structural modification. Journal of the American Chemical Society, 146(46):31357–31370, 2024
work page 2024
-
[5]
Papidocha, Andreas Burger, Varinia Bernales, and Alán Aspuru-Guzik
Sven M. Papidocha, Andreas Burger, Varinia Bernales, and Alán Aspuru-Guzik. The elephant in the lab: synthesizability in generative small-molecule design.Current Opinion in Chemical Engineering, 51:101217, 2026. ISSN 2211-3398
work page 2026
-
[6]
Searching for high-value molecules using reinforcement learning and transformers
Raj Ghugare, Santiago Miret, Adriana Hugessen, Mariano Phielipp, and Glen Berseth. Searching for high-value molecules using reinforcement learning and transformers. InProceedings of the International Conference on Learning Representations, 2024
work page 2024
-
[7]
MoleditRL: Structure-preserving molecular editing via discrete diffusion and reinforcement learning
Yuanxin Zhuang, Dazhong Shen, and Ying Sun. MoleditRL: Structure-preserving molecular editing via discrete diffusion and reinforcement learning. InThe Fourteenth International Conference on Learning Representations, 2026
work page 2026
-
[8]
De novo drug design using reinforce- ment learning with multiple gpt agents
Xiuyuan Hu, Guoqing Liu, Yang Zhao, and Hao Zhang. De novo drug design using reinforce- ment learning with multiple gpt agents. InProceedings of the 37th International Conference on Neural Information Processing Systems, 2023
work page 2023
-
[9]
Jinyeong Park, Jaegyoon Ahn, Jonghwan Choi, and Jibum Kim. Mol-air: Molecular reinforce- ment learning with adaptive intrinsic rewards for goal-directed molecular generation.Journal of Chemical Information and Modeling, 65(5):2283–2296, 2025
work page 2025
-
[10]
Ruheng Wang, Hang Zhang, Trieu Nguyen, Shasha Feng, Hao-Wei Pang, Xiang Yu, Li Xiao, and Peter Zhiping Zhang. Pepthink-r1: LLM for interpretable cyclic peptide optimization with cot SFT and reinforcement learning. InNeurIPS 2025 AI for Science Workshop, 2025
work page 2025
-
[11]
Jan H. Jensen. A graph-based genetic algorithm and generative model/monte carlo tree search for the exploration of chemical space.Chemical Science, 10(12):3567–3572, 2019
work page 2019
-
[12]
Efficient evolutionary search over chemical space with large language models
Haorui Wang, Marta Skreta, Cher-Tian Ser, Wenhao Gao, Lingkai Kong, Felix Strieth-Kalthoff, Chenru Duan, Yuchen Zhuang, Yue Yu, Yanqiao Zhu, Yuanqi Du, Alán Aspuru-Guzik, Kirill Neklyudov, and Chao Zhang. Efficient evolutionary search over chemical space with large language models. InProceedings of the International Conference on Learning Representations, 2025
work page 2025
-
[13]
GeLLM³O: Generalizing large language models for multi-property molecule optimization
Vishal Dey, Xiao Hu, and Xia Ning. GeLLM³O: Generalizing large language models for multi-property molecule optimization. InProceedings of the 63rd Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2025
work page 2025
-
[14]
Geyan Ye, Xibao Cai, Houtim Lai, Xing Wang, Junhong Huang, Longyue Wang, Wei Liu, and Xiangxiang Zeng. Drugassist: a large language model for molecule optimization.Briefings in Bioinformatics, 26(1):bbae693, 01 2025
work page 2025
-
[15]
Jinho Chang and Jong Chul Ye. Ldmol: A text-to-molecule diffusion model with structurally informative latent space surpasses ar models.International Conference on Machine Learning, 2025. 10
work page 2025
-
[16]
Exploring synthesizable chemical space with iterative pathway refinements
Seul Lee, Karsten Kreis, Srimukh Prasad Veccham, Meng Liu, Danny Reidenbach, Saee Gopal Paliwal, Weili Nie, and Arash Vahdat. Exploring synthesizable chemical space with iterative pathway refinements. InThe Fourteenth International Conference on Learning Representations, 2026
work page 2026
-
[17]
Catacutan, Autumn Arnold, James Zou, and Jonathan M
Kyle Swanson, Gary Liu, Denise B. Catacutan, Autumn Arnold, James Zou, and Jonathan M. Stokes. Generative ai for designing and validating easily synthesizable and structurally novel antibiotics.Nature Machine Intelligence, 6:338–353, 2024
work page 2024
-
[18]
Shogo Nakamura, Nobuaki Yasuo, and Masakazu Sekijima. Molecular optimization using a conditional transformer for reaction-aware compound exploration with reinforcement learning. Communications Chemistry, 8(40), 2025
work page 2025
-
[19]
Carl Edwards, Chi Han, Gawon Lee, Thao Nguyen, Sara Szymku ´c, Chetan Kumar Prasad, Bowen Jin, Jiawei Han, Ying Diao, Ge Liu, Hao Peng, Bartosz Andrzej Grzybowski, Martin D. Burke, and Heng Ji. mCLM: A modular chemical language model that generates functional and makeable molecules. InThe Fourteenth International Conference on Learning Representations, 2026
work page 2026
-
[20]
Anderson, and Henry van den Bedem
Aryan Pedawi, Pawet Gniewek, Chaoyi Chang, Brandon M. Anderson, and Henry van den Bedem. An efficient graph generative model for navigating ultra-large combinatorial synthesis libraries. InProceedings of the 36th International Conference on Neural Information Processing Systems. Curran Associates Inc., 2022
work page 2022
-
[21]
Chembo: Bayesian optimization of small organic molecules with synthesizable recommendations
Ksenia Korovina, Sailun Xu, Kirthevasan Kandasamy, Willie Neiswanger, Barnabas Poczos, Jeff Schneider, and Eric Xing. Chembo: Bayesian optimization of small organic molecules with synthesizable recommendations. InProceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), pages 3393–3403. PMLR, 2020
work page 2020
-
[22]
Sample-efficient multi-objective molecular optimization with gflownets
Yiheng Zhu, Jialu Wu, Chaowen Hu, Jiahuan Yan, Chang-Yu Hsieh, Tingjun Hou, and Jian Wu. Sample-efficient multi-objective molecular optimization with gflownets. InProceedings of the 37th International Conference on Neural Information Processing Systems, 2023
work page 2023
-
[23]
Michał Koziarski, Andrei Rekesh, Dmytro Shevchuk, Almer van der Sloot, Piotr Gai ´nski, Yoshua Bengio, Cheng-Hao Liu, Mike Tyers, and Robert A. Batey. Rgfn: synthesizable molecular generation using gflownets. InProceedings of the 38th International Conference on Neural Information Processing Systems. Curran Associates Inc., 2024
work page 2024
-
[24]
Synflownet: Design of diverse and novel molecules with synthesis constraints
Miruna Cretu, Charles Harris, Ilia Igashov, Arne Schneuing, Marwin Segler, Bruno Correia, Julien Roy, Emmanuel Bengio, and Pietro Lio. Synflownet: Design of diverse and novel molecules with synthesis constraints. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[25]
Generative flows on synthetic pathway for drug design
Seonghwan Seo, Minsu Kim, Tony Shen, Martin Ester, Jinkyoo Park, Sungsoo Ahn, and Woo Youn Kim. Generative flows on synthetic pathway for drug design. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[26]
Molsearch: Search-based multi-objective molecular generation and property optimization
Mengying Sun, Jing Xing, Han Meng, Huijun Wang, Bin Chen, and Jiayu Zhou. Molsearch: Search-based multi-objective molecular generation and property optimization. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, 2022
work page 2022
-
[27]
Wenhao Gao, Shitong Luo, and Connor W. Coley. Generative artificial intelligence for navigating synthesizable chemical space.Proceedings of the National Academy of Sciences, 122(41): e2415665122, 2025
work page 2025
-
[28]
Michael Sun, Alston Lo, Minghao Guo, Jie Chen, Connor W. Coley, and Wojciech Matusik. Procedural synthesis of synthesizable molecules. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[29]
Shitong Luo, Wenhao Gao, Zuofan Wu, Jian Peng, Connor W. Coley, and Jianzhu Ma. Pro- jecting molecules into synthesizable chemical spaces. InProceedings of the 41st International Conference on Machine Learning. JMLR.org, 2024. 11
work page 2024
-
[30]
Cavanagh, Yingze Wang, Jacob M
Kunyang Sun, Dorian Bagni, Joseph M. Cavanagh, Yingze Wang, Jacob M. Sawyer, Bo Zhou, Andrew Gritsevskiy, Oufan Zhang, and Teresa Head-Gordon. Synllama: Generating synthesiz- able molecules and their analogs with large language models.ACS Central Science, 11(11): 2108–2120, 2025
work page 2025
-
[31]
Chawla, Olaf Wiest, and Xiangliang Zhang
Taicheng Guo, Kehan Guo, Bozhao Nan, Zhenwen Liang, Zhichun Guo, Nitesh V . Chawla, Olaf Wiest, and Xiangliang Zhang. What can large language models do in chemistry? a comprehensive benchmark on eight tasks. InProceedings of the 37th International Conference on Neural Information Processing Systems. Curran Associates Inc., 2023
work page 2023
-
[32]
Leverag- ing large language models for predictive chemistry.Nature Machine Intelligence, 2024
Kevin Maik Jablonka, Philippe Schwaller, Andres Ortega-Guerrero, and Berend Smit. Leverag- ing large language models for predictive chemistry.Nature Machine Intelligence, 2024
work page 2024
-
[33]
Jinyoung Park, Minseong Bae, Dohwan Ko, and Hyunwoo J. Kim. LLamo: Large language model-based molecular graph assistant. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
work page 2024
-
[34]
Can LLMs solve molecule puzzles? a multimodal benchmark for molecular structure elucidation
Kehan Guo, Bozhao Nan, Yujun Zhou, Taicheng Guo, Zhichun Guo, Mihir Surve, Zhenwen Liang, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. Can LLMs solve molecule puzzles? a multimodal benchmark for molecular structure elucidation. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024
work page 2024
-
[35]
Mol-instructions: A large-scale biomolecular instruction dataset for large language models
Yin Fang, Xiaozhuan Liang, Ningyu Zhang, Kangwei Liu, Rui Huang, Zhuo Chen, Xiaohui Fan, and Huajun Chen. Mol-instructions: A large-scale biomolecular instruction dataset for large language models. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[36]
Wen Tao, Jing Tang, Alvin Chan, Bryan Hooi, Baolong Bi, Nanyun Peng, Yuansheng Liu, and Yiwei Wang. How to make large language models generate 100% valid molecules? In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, November 2025
work page 2025
-
[37]
A. M. Bran, S. Cox, O. Schilter, et al. Augmenting large language models with chemistry tools. Nature Machine Intelligence, 6:525–535, 2024
work page 2024
-
[38]
MT-mol: Multi agent system with tool-based reasoning for molecular optimization
Hyomin Kim, Yunhui Jang, and Sungsoo Ahn. MT-mol: Multi agent system with tool-based reasoning for molecular optimization. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguis- tics: EMNLP 2025. Association for Computational Linguistics, November 2025
work page 2025
-
[39]
Chemorch: Empowering LLMs with chemical intelligence via groundbreaking synthetic instructions
Yue Huang, Zhengzhe Jiang, Xiaonan Luo, Kehan Guo, Haomin Zhuang, Yujun Zhou, Zhengqing Yuan, Xiaoqi Sun, Jules Schleinitz, Yanbo Wang, Shuhao Zhang, Mihir Surve, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. Chemorch: Empowering LLMs with chemical intelligence via groundbreaking synthetic instructions. InThe Thirty-ninth Annual Conference on Neural ...
work page 2025
-
[40]
Coley, Cao Xiao, Jimeng Sun, and Marinka Zitnik
Kexin Huang, Tianfan Fu, Wenhao Gao, Yue Zhao, Yusuf H Roohani, Jure Leskovec, Connor W. Coley, Cao Xiao, Jimeng Sun, and Marinka Zitnik. Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021
work page 2021
-
[41]
Flow network based generative models for non-iterative diverse candidate generation
Emmanuel Bengio, Moksh Jain, Maksym Korablyov, Doina Precup, and Yoshua Bengio. Flow network based generative models for non-iterative diverse candidate generation. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, pages 27381–27394. Curran Associates, Inc., 2021
work page 2021
-
[42]
Mingyang Wang, Shuai Li, Jike Wang, Odin Zhang, Hongyan Du, Dejun Jiang, Zhenxing Wu, Yafeng Deng, Yu Kang, Peichen Pan, et al. Clickgen: Directed exploration of synthesizable chemical space via modular reactions and reinforcement learning.Nature communications, 15 (1):10127, 2024
work page 2024
-
[43]
LLM-augmented chemical synthesis and design decision programs
Haorui Wang, Jeff Guo, Lingkai Kong, Rampi Ramprasad, Philippe Schwaller, Yuanqi Du, and Chao Zhang. LLM-augmented chemical synthesis and design decision programs. In Forty-second International Conference on Machine Learning, 2025. 12
work page 2025
-
[44]
Retro-r1: LLM-based agentic retrosynthesis
Wei Liu, Jiangtao Feng, Hongli Yu, Yuxuan Song, Yuqiang Li, Shufei Zhang, LEI BAI, Wei- Ying Ma, and Hao Zhou. Retro-r1: LLM-based agentic retrosynthesis. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[45]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InNeurIPS 2022 Foundation Models for Decision Making Workshop, 2022
work page 2022
-
[46]
Wenhao Gao, Tianfan Fu, Jimeng Sun, and Connor W. Coley. Sample efficiency matters: a benchmark for practical molecular optimization. InProceedings of the 36th International Conference on Neural Information Processing Systems, Red Hook, NY , USA, 2022. Curran Associates Inc
work page 2022
-
[47]
Irwin, Teague Sterling, Michael M
John J. Irwin, Teague Sterling, Michael M. Mysinger, Erin S. Bolstad, and Ryan G. Coleman. ZINC: A free tool to discover chemistry for biology.Journal of Chemical Information and Modeling, 2012
work page 2012
-
[48]
The llama 3 herd of models, 2024
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. The llama 3 herd of models, 2024
work page 2024
-
[49]
Efficient memory management for large language model serving with pagedattention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles, New York, NY , USA, 2023. Association for Computing Machinery
work page 2023
-
[50]
An Yang, Anfeng Li, Baosong Yang, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
Enamine. Building blocks catalog, 2023. URL https://enamine.net/building-blocks/ building-blocks-catalog
work page 2023
-
[52]
Oleg Trott and Arthur J. Olson. AutoDock Vina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading.Journal of Computational Chemistry, 31(2):455–461, 2010
work page 2010
-
[53]
Rdkit: Open-source cheminformatics software, 2016
Greg Landrum et al. Rdkit: Open-source cheminformatics software, 2016. URL http: //www.rdkit.org/. https://github.com/rdkit/rdkit
work page 2016
-
[54]
Harrison Chase. Langchain, 2022. URL https://github.com/langchain-ai/langchain. 13 A Implementation Details A.1 GRPO Training Hyperparameters The policy model is Qwen3-4B-Instruct, trained with GRPO using the MemoryEfficientAdamW optimizer on a single NVIDIA RTX 6000 Ada GPU (48 GB). Table 3 summarizes the hyperparameters shared across all 14 benchmark ta...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.