My Chemical Harness: Evolutionary Molecular Design over Synthetic Pathways with Large Language Model Agents
Pith reviewed 2026-06-27 14:23 UTC · model grok-4.3
The pith
LLM agents controlling high-level preferences in an evolutionary search over executable synthetic routes outperform both direct LLM generation and deterministic controllers on a soluble epoxide hydrolase proxy task.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By populating an evolutionary algorithm with executable synthetic pathways and restricting the LLM to high-level strategic preferences, the My Chemical Harness framework achieves state-of-the-art results on the sEH proxy task across the sEH score, synthetic accessibility score, and AiZynthFinder success rate, while deterministic chemistry tools guarantee route validity.
What carries the argument
LLM as high-level strategy controller that selects preferences over route length, move type, reaction families, motifs, and exploration pressure; deterministic code executes all route construction, validation, scoring, selection, and memory updates.
If this is right
- Molecules discovered by the method come with verified synthetic routes by construction.
- The same controller can be swapped across different molecular oracles without retraining.
- Performance gains arise from the evolutionary loop over routes rather than from any chemical knowledge inside the LLM.
- No dedicated generative model or fine-tuning step is required for the observed improvements.
Where Pith is reading between the lines
- The approach could be tested on additional oracle tasks such as docking scores or ADMET properties to check whether the same preference vocabulary remains effective.
- If the preference set proves insufficient for more complex targets, the framework would require either richer controller outputs or additional deterministic heuristics.
- Because routes are stored and deduplicated, the memory component may scale to larger design campaigns than molecule-only evolutionary methods.
Load-bearing premise
That high-level preferences chosen by the LLM are sufficient to produce useful search guidance even though the model never proposes or validates any actual chemical steps.
What would settle it
A controlled run on the same sEH task in which the LLM controller is replaced by random or fixed preferences and the performance metrics fall to or below those of the deterministic baseline.
Figures
read the original abstract
Designing molecules with target properties is most useful when candidate structures are accompanied by feasible synthetic routes. We introduce My Chemical Harness, a route-native evolutionary framework for goal-directed molecular design in which the search population consists of executable synthetic pathways rather than isolated molecular graphs. Each route is built from purchasable building blocks and reaction templates, executed by deterministic chemistry tools, and scored through task-specific molecular oracles. Large language models (LLMs) are used only as strategy controllers that select high-level preferences over route length, move type, reaction families, motifs, and exploration pressure, while local code performs route construction, validation, deduplication, scoring, selection, and memory updates. This separation lets the LLM guide exploration without allowing it to introduce hallucinated products or unsupported reaction steps. On a soluble epoxide hydrolase proxy task, our LLM agent improves over single pass LLM and deterministic controllers, reaching state-of-the-art performance across the sEH score, synthetic accessibility score, and AiZynthFinder success rate metrics. These results suggest that constrained LLM agents can play a significant role in molecular discovery without requiring training, fine-tuning, or dedicated generative models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces My Chemical Harness, a route-native evolutionary framework for goal-directed molecular design. The search population consists of executable synthetic pathways built from purchasable building blocks and reaction templates. LLMs are used solely as strategy controllers that select high-level preferences over route length, move type, reaction families, motifs, and exploration pressure; all route construction, validation, deduplication, scoring, selection, and memory updates are performed by deterministic chemistry tools. On a soluble epoxide hydrolase (sEH) proxy task, the LLM agent is claimed to improve over single-pass LLM and deterministic controllers, reaching state-of-the-art performance on the sEH score, synthetic accessibility score, and AiZynthFinder success rate metrics.
Significance. If the performance claims hold under rigorous evaluation, the work offers a practical template for integrating LLMs into molecular discovery while strictly separating high-level strategy from chemical execution. This constrained role for the LLM avoids hallucinated reactions and allows reuse of existing deterministic oracles and route planners. The explicit credit for reproducible separation of concerns and the use of off-the-shelf LLMs without fine-tuning are strengths that could be adopted more broadly if the empirical gains are shown to be robust.
major comments (3)
- [Results] Results section: the central claim that the LLM agent reaches SOTA across sEH score, SA score, and AiZynthFinder success rate is presented without any reported numerical values, error bars, number of independent runs, or statistical tests. This absence makes it impossible to assess whether the reported improvements are statistically meaningful or reproducible.
- [Methods] Methods section: no ablation is described that replaces the LLM-derived high-level preferences with fixed heuristics or random preferences while keeping the same evolutionary machinery, population size, and evaluation budget. Without this control, it is unclear whether the performance gain is attributable to chemically informative guidance or simply to the evolutionary search itself.
- [Experimental Setup] Experimental setup: the comparison to deterministic controllers does not state whether the total number of oracle evaluations, generations, or wall-clock time was matched across conditions. Matched budgets are required to attribute any advantage specifically to the LLM preference signals rather than differences in search effort.
minor comments (2)
- [Abstract] Abstract: the phrase 'state-of-the-art performance' is used without citing the specific prior methods or numerical thresholds that define the current SOTA on the sEH proxy task.
- [Methods] The description of the preference vocabulary (route length, move type, reaction families, motifs, exploration pressure) would benefit from an explicit enumeration or table of the discrete options available to the LLM at each decision point.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which identify key gaps in reporting and experimental controls. We address each major comment below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Results] Results section: the central claim that the LLM agent reaches SOTA across sEH score, SA score, and AiZynthFinder success rate is presented without any reported numerical values, error bars, number of independent runs, or statistical tests. This absence makes it impossible to assess whether the reported improvements are statistically meaningful or reproducible.
Authors: We agree that the absence of numerical values, error bars, run counts, and statistical tests prevents proper assessment of the claims. In the revised manuscript, we will add a dedicated results table reporting mean performance metrics with standard deviations across multiple independent runs (minimum of five), along with appropriate statistical comparisons (e.g., t-tests or Wilcoxon tests) against baselines. This will include the sEH score, SA score, and AiZynthFinder success rate. revision: yes
-
Referee: [Methods] Methods section: no ablation is described that replaces the LLM-derived high-level preferences with fixed heuristics or random preferences while keeping the same evolutionary machinery, population size, and evaluation budget. Without this control, it is unclear whether the performance gain is attributable to chemically informative guidance or simply to the evolutionary search itself.
Authors: This comment correctly identifies a missing control. We will add an ablation study to the Methods and Results sections in the revision. The study will compare the LLM preference controller against (i) random preference selection and (ii) fixed heuristic rules, while holding the evolutionary machinery, population size, and total evaluation budget constant. Performance differences will be quantified and reported. revision: yes
-
Referee: [Experimental Setup] Experimental setup: the comparison to deterministic controllers does not state whether the total number of oracle evaluations, generations, or wall-clock time was matched across conditions. Matched budgets are required to attribute any advantage specifically to the LLM preference signals rather than differences in search effort.
Authors: We acknowledge that budget matching must be explicitly documented. The original experiments were designed with matched generation counts and oracle evaluation limits, but this was not stated clearly. In the revision, we will add explicit statements confirming that all compared conditions use identical total oracle evaluations and generations; wall-clock times will also be reported. If any prior runs deviated, we will rerun under strictly matched conditions. revision: yes
Circularity Check
No circularity: empirical performance claims rest on external benchmarks
full rationale
The manuscript reports an empirical comparison of an LLM agent versus single-pass LLM and deterministic controllers on a soluble epoxide hydrolase proxy task, measuring sEH score, synthetic accessibility, and AiZynthFinder success rate. No equations, parameter fits, or derivations are presented whose outputs are forced by construction from the inputs. The separation of high-level LLM preferences from deterministic route construction, validation, and scoring is described procedurally rather than as a mathematical reduction; results are obtained by running the system on external oracles and reporting observed metrics. The work is therefore self-contained against independent benchmarks with no load-bearing step that collapses to self-definition, fitted prediction, or self-citation chain.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Deterministic chemistry tools can reliably execute and validate proposed reaction sequences without error.
Reference graph
Works this paper leans on
-
[1]
S.; Reid, M.; Matsuo, Y.; Iwasawa, Y.Adv
Kojima, T.; Gu, S. S.; Reid, M.; Matsuo, Y.; Iwasawa, Y.Adv. Neural Inf. Process. Syst. 2022,35, 22199–22213
2022
-
[2]
Program Synthesis with Large Language Models
Austin, J.; Odena, A.; Nye, M.; Bosma, M.; Michalewski, H.; Dohan, D.; Jiang, E.; Cai, C.; Terry, M.; Le, Q.; Sutton, C. Program Synthesis with Large Language Models. arXiv, Version 1, August 16, 2021; DOI: 10.48550/arXiv.2108.07732
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2108.07732 2021
-
[3]
Chen, M. et al. Evaluating Large Language Models Trained on Code. arXiv, Version 2, July 14, 2021; DOI: 10.48550/arXiv.2107.03374
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2107.03374 2021
-
[4]
Score-Based Generative Modeling through Stochastic Differential Equations
Song, Y.; Sohl-Dickstein, J.; Kingma, D. P.; Kumar, A.; Ermon, S.; Poole, B. Score-Based Generative Modeling through Stochastic Differential Equations. arXiv, Version 1, February 10, 2020; DOI: 10.48550/arXiv.2011.13456
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2011.13456 2020
-
[5]
Mater.2024,10, 273, DOI: 10.1038/s41524-024-01466-5
Qiu, H.; Sun, Z.-Y.npj Comput. Mater.2024,10, 273, DOI: 10.1038/s41524-024-01466-5
-
[6]
Gao, W.; Luo, S.; Coley, C. W.Proc. Natl. Acad. Sci. U.S.A.2025,122, e2415665122, DOI: 10.1073/pnas.2415665122. 23
-
[7]
M.; Ros, K.; Honke, G.; Cho, K.; Ji, H
Edwards, C.; Lai, T. M.; Ros, K.; Honke, G.; Cho, K.; Ji, H. Translation between Molecules and Natural Language. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing; Goldberg, Y.; Kozareva, Z.; Zhang, Y., Eds.; Association for Computational Linguistics: Abu Dhabi, United Arab Emirates, 2022; pp 375–413. DOI: 10.18653/v...
-
[8]
G.; Vignac, C.; Welling, M
Hoogeboom, E.; Satorras, V. G.; Vignac, C.; Welling, M. Equivariant Diffusion for Molecule Generation in 3D. InProceedings of the 39th International Conference on Machine Learning; Chaudhuri, K.; Jegelka, S.; Song, L.; Szepesvari, C.; Niu, G.; Sabato, S., Eds.; PMLR, 2022; Proceedings of Machine Learning Research, Vol. 162, pp 8867–8887. URL:https: //proc...
2022
-
[9]
Dunn, I.; Koes, D. R.Digit. Discov.2026,5, 2052–2066, DOI: 10.1039/D5DD00363F
-
[10]
Equivariant Flow Matching with Hybrid Probability Transport for 3D Molecule Generation
Song, Y.; Gong, J.; Xu, M.; Cao, Z.; Lan, Y.; Ermon, S.; Zhou, H.; Ma, W.-Y. Equivariant Flow Matching with Hybrid Probability Transport for 3D Molecule Generation. InAdvances in Neural Information Processing Systems 36; Oh, A.; Naumann, T.; Globerson, A.; Saenko, K.; Hardt, M.; Levine, S., Eds.; Curran Associates, Inc., 2023; pp 549–568. URL: https://pro...
2023
-
[11]
DiGress: Discrete Denoising Diffusion for Graph Generation
Vignac, C.; Krawczuk, I.; Siraudin, A.; Wang, B.; Cevher, V.; Frossard, P. DiGress: Discrete Denoising Diffusion for Graph Generation. The Eleventh International Conference on Learning Representations, 2023; URL:https://openreview.net/forum?id=UaAD-Nu86WX
2023
-
[12]
B.; Arnold, A.; Zou, J.; Stokes, J
Swanson, K.; Liu, G.; Catacutan, D. B.; Arnold, A.; Zou, J.; Stokes, J. M.Nat. Mach. Intell.2024,6, 338–353, DOI: 10.1038/s42256-024-00809-7
-
[13]
P.; Liu, M.; Reidenbach, D.; Paliwal, S
Lee, S.; Kreis, K.; Veccham, S. P.; Liu, M.; Reidenbach, D.; Paliwal, S. G.; Nie, W.; Vahdat, A. Exploring Synthesizable Chemical Space with Iterative Pathway Refinements. The Fourteenth International Conference on Learning Representations, 2026; URL:https: //openreview.net/forum?id=aQKVfKOkR5
2026
-
[14]
Nature625, 7995 (01 Jan 2024), 468–475
Romera-Paredes, B.; Barekatain, M.; Novikov, A.; Balog, M.; Kumar, M. P.; Dupont, E.; Ruiz, F. J. R.; Ellenberg, J. S.; Wang, P.; Fawzi, O.; Kohli, P.; Fawzi, A.Nature2024,625, 468–475, DOI: 10.1038/s41586-023-06924-6
-
[15]
Novikov, A. et al. AlphaEvolve: A Coding Agent for Scientific and Algorithmic Discovery. arXiv, Version 1, June 16, 2025; DOI: 10.48550/arXiv.2506.13131
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.13131 2025
-
[16]
Kang, Y.; Kim, J.Nat. Commun.2024,15, 4705, DOI: 10.1038/s41467-024-48998-4
-
[17]
LLMatDesign: Autonomous Materials Discovery with Large Language Models
Jia, S.; Zhang, C.; Fung, V. LLMatDesign: Autonomous Materials Discovery with Large Language Models. arXiv, Version 1, June 19, 2024; DOI: 10.48550/arXiv.2406.13163
-
[18]
Luo, F.; Zhang, J.; Wang, Q.; Yang, C.ACS Cent. Sci.2025,11, 511–519, DOI: 10.1021/ac- scentsci.4c01935
work page doi:10.1021/ac- 2025
-
[19]
Zheng, Z.; Zhang, O.; Nguyen, H. L.; Rampal, N.; Alawadhi, A. H.; Rong, Z.; Head- Gordon, T.; Borgs, C.; Chayes, J. T.; Yaghi, O. M.ACS Cent. Sci.2023,9, 2161–2170, DOI: 10.1021/acscentsci.3c01087
-
[20]
Lee, J.; Woo, J.; Kim, Y.; Kim, S.; Paulina, C.; Park, H.; Kim, H.-T.; Park, S.; Kim, J. ACS Cent. Sci.2026,12, 484–496, DOI: 10.1021/acscentsci.5c02433
-
[21]
Caldas Ramos, M.; Michtavy, S. S.; White, A. D.; Porosoff, M. D.ACS Cent. Sci.2026, DOI: 10.1021/acscentsci.5c02418. 24
-
[22]
Abhyankar, N.; Kabra, S.; Desai, S.; Reddy, C. K. LLEMA: Evolutionary Search with LLMs for Multi-Objective Materials Discovery. The Fourteenth International Conference on Learning Representations, 2026; URL:https://openreview.net/forum?id=TIqzhBvCNB
2026
-
[23]
Lange, R. T.; Tian, Y.; Tang, Y. Large Language Models as Evolution Strategies. In Proceedings of the Genetic and Evolutionary Computation Conference Companion; Li, X.; Handl, J., Eds.; ACM, 2024; pp 579–582. DOI: 10.1145/3638530.3654238
-
[24]
Holland, J. H.Sci. Am.1992,267, 66–72, DOI: 10.1038/scientificamerican0792-66
-
[25]
Neural Inf
Bengio, E.; Jain, M.; Korablyov, M.; Precup, D.; Bengio, Y.Adv. Neural Inf. Process. Syst. 2021,34, 27381–27394, URL:https://papers.nips.cc/paper/2021/hash/e614f646836 aaed9f89ce58e837e2310-Abstract.html
2021
-
[26]
SynFlowNet: Design of Diverse and Novel Molecules with Synthesis Constraints
Cretu, M.; Harris, C.; Igashov, I.; Schneuing, A.; Segler, M.; Correia, B.; Roy, J.; Bengio, E.; Liò, P. SynFlowNet: Design of Diverse and Novel Molecules with Synthesis Constraints. The Thirteenth International Conference on Learning Representations, 2025; URL:https: //openreview.net/forum?id=uvHmnahyp1
2025
-
[27]
Argiriadi, M. A.; Morisseau, C.; Goodrow, M. H.; Dowdy, D. L.; Hammock, B. D.; Chris- tianson, D. W.J. Biol. Chem.2000,275, 15265–15270, DOI: 10.1074/jbc.M000278200
-
[28]
Gomez, G. A.; Morisseau, C.; Hammock, B. D.; Christianson, D. W.Protein Sci.2006,15, 58–64, DOI: 10.1110/ps.051720206
-
[29]
Kim, I.-H.; Tsai, H.-J.; Nishi, K.; Kasagami, T.; Morisseau, C.; Hammock, B. D.J. Med. Chem.2007,50, 5217–5226, DOI: 10.1021/jm070705c
-
[30]
Huang, S.-X.; Li, H.-Y.; Liu, J.-Y.; Morisseau, C.; Hammock, B. D.; Long, Y.-Q.J. Med. Chem.2010,53, 8376–8386, DOI: 10.1021/jm101087u
-
[31]
Lee, K. S. S. et al.J. Med. Chem.2014,57, 7016–7030, DOI: 10.1021/jm500694p
-
[32]
W.; Xiao, C.; Sun, J.; Zitnik, M
Huang, K.; Fu, T.; Gao, W.; Zhao, Y.; Roohani, Y.; Leskovec, J.; Coley, C. W.; Xiao, C.; Sun, J.; Zitnik, M. Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021; URL:https://openreview.net/for um?id=8nvgnORnoWr
2021
-
[33]
Gao, W.; Fu, T.; Sun, J.; Coley, C. W.Adv. Neural Inf. Process. Syst.2022,35, 21342– 21357, URL:https://proceedings.neurips.cc/paper_files/paper/2022/hash/86443 53f7d307baaf29bc1e56fe8e0ec-Abstract-Datasets_and_Benchmarks.html
2022
-
[34]
W.; Matusik, W
Sun, M.; Lo, A.; Guo, M.; Chen, J.; Coley, C. W.; Matusik, W. Procedural Synthesis of Syn- thesizable Molecules. The Thirteenth International Conference on Learning Representations, 2025; URL:https://openreview.net/forum?id=OGfyzExd69
2025
-
[35]
Sun, K.; Bagni, D.; Cavanagh, J. M.; Wang, Y.; Sawyer, J. M.; Zhou, B.; Gritsevskiy, A.; Zhang, O.; Head-Gordon, T.ACS Cent. Sci.2025,11, 2108–2120, DOI: 10.1021/acs- centsci.5c01285
-
[36]
Reinforcement Learning with LLM-Guided Action Spaces for Synthesizable Lead Optimization
Li, T.; Hou, K.; Vinh, T.; Raj, M.; Guo, Z.; Yang, C. Reinforcement Learning with LLM- Guided Action Spaces for Synthesizable Lead Optimization. arXiv, Version 2, May 1, 2026; DOI: 10.48550/arXiv.2604.07669
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.07669 2026
-
[37]
Gottweis, J.; Weng, W.-H.; Daryin, A.; Tu, T.; Sirkovic, P.; Myaskovsky, A.; Glowaty, G.; Weissenberger, F.; Orlandi, A.; Natarajan, V.Nature2026, DOI: 10.1038/s41586-026-10644- y. 25
-
[38]
E.; Chang, B.; Mitchener, L.; Yiu, A.; Szostkiewicz, C
Ghareeb, A. E.; Chang, B.; Mitchener, L.; Yiu, A.; Szostkiewicz, C. J.; Shved, D.; Gy- imesi, G. J.; Laurent, J. M.; Wright, S. M.; Razzak, M. T.; White, A. D.; Finnemann, S. C.; Hinks, M. M.; Rodriques, S. G.Nature2026, DOI: 10.1038/s41586-026-10652-y
-
[39]
Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes
Boiko, D. A.; MacKnight, R.; Kline, B.; Gomes, G.Nature2023,624, 570–578, DOI: 10.1038/s41586-023-06792-0. A. Objective and Scoring Details This appendix gives the implementation level scoring details that are omitted from the main Methods. The main text treats the objective as a blackbox fitness function; here we specify the normalization and scalar aggr...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.