MolClaw: An Autonomous Agent with Hierarchical Skills for Drug Molecule Evaluation, Screening, and Optimization

Bowen Zhou; Haoran Sun; Haoyang Su; Lei Bai; Lilong Wang; Lisheng Zhang; Qikui Yang; Qingsong Li; Wei Tang; Wenjie Lou

arxiv: 2604.21937 · v2 · pith:6I7ZAYPAnew · submitted 2026-04-02 · 💻 cs.AI · cs.MA

MolClaw: An Autonomous Agent with Hierarchical Skills for Drug Molecule Evaluation, Screening, and Optimization

Lisheng Zhang , Lilong Wang , Xiangyu Sun , Wei Tang , Haoyang Su , Yuehui Qian , Qikui Yang , Qingsong Li

show 9 more authors

Zhenyu Tang Haoran Sun Yingnan Han Yankai Jiang Wenjie Lou Bowen Zhou Xiaosong Wang Lei Bai Zhengwei Xie

This is my paper

Pith reviewed 2026-05-21 10:42 UTC · model grok-4.3

classification 💻 cs.AI cs.MA

keywords autonomous agentsdrug discoveryhierarchical skillsmolecular screeningoptimizationworkflow orchestrationbenchmarksAI agents

0 comments

The pith

MolClaw uses a three-tier skill hierarchy to orchestrate dozens of tools for drug molecule evaluation, screening, and optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current AI agents struggle to sustain performance when drug discovery requires chaining many specialized tools across multi-step workflows. MolClaw addresses this limitation with a three-tier architecture that organizes 70 skills drawn from over 30 domain resources. Tool-level skills perform basic operations, workflow-level skills build validated pipelines that include checks and reflection, and a discipline-level skill supplies field-wide scientific principles for planning. The authors also release MolBench, a benchmark of screening, optimization, and end-to-end tasks that demand between 8 and 50 sequential tool calls. Ablation results show the largest gains on the structured, high-complexity subset of tasks, while simpler tasks solvable by direct scripting show little difference.

Core claim

MolClaw unifies over 30 specialized domain resources through a three-tier hierarchical skill architecture comprising 70 skills in total. Tool-level skills standardize atomic operations, workflow-level skills compose them into validated pipelines that include quality checks and reflection, and a discipline-level skill supplies scientific principles that govern planning and verification across scenarios. The paper introduces MolBench, a benchmark spanning molecular screening, optimization, and end-to-end discovery challenges that require 8 to 50 or more sequential tool calls. On this benchmark MolClaw records state-of-the-art performance across all metrics. Ablation studies indicate that the性能

What carries the argument

Three-tier hierarchical skill architecture that separates atomic tool operations, composable workflow pipelines with reflection, and discipline-level scientific principles for consistent planning and verification.

If this is right

Workflow orchestration competence, rather than access to individual tools, becomes the primary bottleneck for AI-driven drug discovery.
Agents that incorporate workflow-level skills with quality checks maintain higher success rates across long sequences of 20 to 50 tool calls.
Discipline-level skills improve consistency of planning and verification without requiring task-specific retraining.
Performance differences between hierarchical and flat agents concentrate on complex structured problems and vanish on tasks solvable by direct scripting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same tiered skill structure could be tested in adjacent domains such as materials screening or synthetic biology that also rely on chained experimental tools.
Benchmarks that add longer sequences or integration with actual laboratory execution would provide stronger tests of whether the hierarchy scales beyond simulation.
The discipline-level layer suggests a route to transfer scientific constraints across related molecular tasks without rebuilding lower-level skills.

Load-bearing premise

The MolBench tasks and evaluation metrics are representative of real drug-discovery difficulty, and the reported performance advantage is caused by the three-tier hierarchical skill design rather than other unstated implementation choices.

What would settle it

An ablation on MolBench that disables workflow-level and discipline-level skills and measures whether success rates on tasks requiring 20 or more tool calls fall to the level of simple ad-hoc scripting baselines.

Figures

Figures reproduced from arXiv: 2604.21937 by Bowen Zhou, Haoran Sun, Haoyang Su, Lei Bai, Lilong Wang, Lisheng Zhang, Qikui Yang, Qingsong Li, Wei Tang, Wenjie Lou, Xiangyu Sun, Xiaosong Wang, Yankai Jiang, Yingnan Han, Yuehui Qian, Zhengwei Xie, Zhenyu Tang.

**Figure 2.** Figure 2: Agent execution traces for the three MolBench-E2E tasks. (A) E2E-Q1: coarse-grained conformational sampling. Five tool-level failures (red) were resolved via skill-governed recovery actions (orange), yielding 20 verified all-atom structures. (B) E2E-Q2: QED-driven iterative optimization. One tool fallback (F1), two constraintdriven rejections (F2–F3), and five strategy adaptations (D1–D5) were autonomous… view at source ↗

**Figure 3.** Figure 3: MolClaw achieves state-of-the-art performance across all MolBench evaluation dimensions. (A) Binding affinity comparison accuracy. MolClaw-CC achieves 81.1%. (B) Docking screening hit count. MolClaw-CC attains 0.80. (C) Molecule editing accuracy. MolClaw-CC reaches 100.0%. (D) Optimization success rate. (E) Property filtering accuracy. (F) Property filtering F1 score. (G) Agent systems grouped comparison a… view at source ↗

**Figure 4.** Figure 4: Statistical validation confirms the significance and reliability of MolClaw’s performance advantages. (A) Normalized performance heatmap across seven metrics for 13 methods. MolClaw variants highlighted by red borders. (B–E) Wilson score 95% CI forest plots for binding affinity accuracy (B), molecule editing accuracy (C), optimization success rate (D), and property filtering accuracy (E). (F–I) Category-l… view at source ↗

**Figure 5.** Figure 5: Ablation studies and in-depth statistical analyses reveal the mechanistic basis of MolClaw’s superiority. (A–C) Ablation on Claude Code and OpenClaw platforms: accuracy metrics (A), docking hit count (B), optimization delta (C). Largest skill-driven gain: binding affinity +29.7 pp (P = 0.013, h = 0.64). (D) Rank trajectory across four tasks for top six methods. (E) Average rank (Friedman χ 2 = 35.35, P = 2… view at source ↗

**Figure 6.** Figure 6: Coarse-grained conformational sampling of the EGFR kinase domain by OpenAWSEM and GoCa. (A) Superposition of 10 PULCHRA-reconstructed all-atom conformations from the OpenAWSEM ensemble, aligned to the 1M17 crystal structure. (B) Corresponding superposition for the GoCa ensemble. (C) Cα-RMSD to native structure: GoCa 4.54 ± 0.93 Å versus AWSEM 7.78 ± 1.53 Å (P = 7.69 × 10−4 ). (D) Radius of gyration: GoCa… view at source ↗

**Figure 7.** Figure 7: QED-driven iterative optimization of a triazolo-benzodiazepine scaffold by the AI agent. (A) Multi-dimensional property trajectory across five optimization rounds: QED score (target ≥ 0.70), MW, ALogP, Tanimoto similarity (constraint ≥ 0.40), TPSA and rotatable bonds. (B) QED desirability decomposition by component and round (R0–R5). (C) QED–Tanimoto trade-off for all 182 molecules; red stars, selected bes… view at source ↗

**Figure 8.** Figure 8: Comprehensive evaluation of AI-agent-driven iterative lead optimization of Erlotinib targeting the EGFR kinase domain. (A) Optimization trajectory showing best QuickVina docking score per round; blue dashed line: Erlotinib baseline (−6.9 kcal/mol); red dashed line: −8.9 kcal/mol target. (B) Docking score distributions across Rounds 1–6 (box-and-strip plot, n = 54). (C) Tanimoto similarity heatmap between r… view at source ↗

**Figure 9.** Figure 9: Schrödinger-style 2D protein–ligand interaction diagrams and 3D pose overlay. (A) Erlotinib baseline (−6.9 kcal/mol): two H-bonds (Thr766, Asp831) and eight hydrophobic contacts. (B) R1 best (−7.4): methoxy shortening + meta-Br; new Met769 H-bond (2.99 Å). (C) R2 best (−8.0): Br→F substitution; Met769 maintained. (D) R3 best (−8.3): F + OH + CH3 on aniline. (E) R4 best (−8.9, target met): 2,6-diF-4-OH anil… view at source ↗

**Figure 10.** Figure 10: Statistical validation, source attribution, and interaction conservation analysis. (A) Docking scores of all 54 molecules by source: REINVENT4 (blue, n = 20) and agent-designed (red, n = 34). (B) Per-round mean scores ± s.e.m. tested against baseline (Wilcoxon); R1 n.s., R2–R3 ∗∗, R4–R6 ∗∗∗. (C) Early (R1–R3) vs. late (R4–R6) violin plot (p = 1.24 × 10−4 ). (D) Agent vs. REINVENT violin plot (p = 0.104, n… view at source ↗

read the original abstract

Computational drug discovery, particularly the complex workflows of drug molecule screening and optimization, requires orchestrating dozens of specialized tools in multi-step workflows, yet current AI agents struggle to maintain robust performance and consistently underperform in these high-complexity scenarios. Here we present MolClaw, an autonomous agent that leads drug molecule evaluation, screening, and optimization. It unifies over 30 specialized domain resources through a three-tier hierarchical skill architecture (70 skills in total) that facilitates agent long-term interaction at runtime: tool-level skills standardize atomic operations, workflow-level skills compose them into validated pipelines with quality check and reflection, and a discipline-level skill supplies scientific principles governing planning and verification across all scenarios in the field. Additionally, we introduce MolBench, a benchmark comprising molecular screening, optimization, and end-to-end discovery challenges spanning 8 to 50+ sequential tool calls. MolClaw achieves state-of-the-art performance across all metrics, and ablation studies confirm that gains concentrate on tasks that demand structured workflows while vanishing on those solvable with ad hoc scripting, establishing workflow orchestration competence as the primary capability bottleneck for AI-driven drug discovery.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MolClaw adds a concrete three-tier skill hierarchy and MolBench benchmark for long-horizon molecular workflows, but the ablations do not clearly separate the tier structure from total skill count or prompt details.

read the letter

MolClaw builds an agent around a three-tier skill setup for drug molecule tasks and pairs it with MolBench, a benchmark that runs screening, optimization, and end-to-end discovery sequences of 8 to 50+ tool calls. The hierarchy splits skills into tool-level atomic operations, workflow-level pipelines that include quality checks and reflection, and a discipline-level layer that supplies scientific principles for planning and verification. This is a direct attempt to keep performance stable when an agent must chain many specialized molecular resources over extended interactions. The benchmark itself is a useful addition because it moves past single-step queries and forces agents to handle realistic workflow length and structure. That combination of system and test set is the clearest new element here. The paper does a reasonable job describing how the layers are meant to work together and why they target the coordination problems that show up in current agents on these tasks. The stress-test note is on point: the abstract claims that ablations show gains only on structured workflows and disappear on ad-hoc ones, yet it gives no numbers, no error bars, and no description of whether the control conditions kept the full set of 70 skills, the same tool interfaces, or equivalent reflection prompts. Without those controls it is difficult to attribute the reported advantage to the tiered organization rather than simply more total resources or implementation choices. The abstract also withholds the actual performance scores and dataset details, so the SOTA claim cannot be checked from what is provided. This paper is aimed at researchers who build agents for chemistry or early drug discovery and want a worked example of hierarchy applied to domain tools. Readers who need benchmarks that test multi-step orchestration would get some value from MolBench even if they adapt the agent design. The work has enough concrete pieces to deserve a serious referee, though the results and ablation sections will need tighter controls and full reporting before the central claims can be evaluated properly. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript presents MolClaw, an autonomous agent with a three-tier hierarchical skill architecture (tool-level, workflow-level, and discipline-level skills, totaling 70) designed to orchestrate complex multi-step workflows for drug molecule evaluation, screening, and optimization. It unifies over 30 specialized domain resources and introduces MolBench, a benchmark consisting of molecular screening, optimization, and end-to-end discovery tasks that require 8 to 50+ sequential tool calls. The central claims are that MolClaw achieves state-of-the-art performance across all metrics on MolBench and that ablation studies demonstrate the performance gains concentrate on tasks demanding structured workflows, thereby establishing workflow orchestration competence as the primary bottleneck for AI-driven drug discovery.

Significance. If the performance claims and ablation results hold under rigorous controls, the work could meaningfully advance the design of hierarchical agents for long-horizon scientific workflows in chemistry. The MolBench benchmark may also provide a reusable testbed for evaluating agent robustness on tasks with high sequential complexity. The explicit framing of workflow orchestration as a distinct capability bottleneck is a useful conceptual contribution.

major comments (2)

[Abstract] Abstract: the manuscript asserts SOTA results across all metrics and attributes them to the three-tier architecture via ablations, yet the abstract (and by extension the reported results) supplies no numerical scores, error bars, dataset sizes, or statistical tests. This absence makes it impossible to evaluate the magnitude or reliability of the claimed performance advantage.
[Ablation studies] Ablation studies: the claim that gains concentrate on structured-workflow tasks while vanishing on ad-hoc ones is load-bearing for the central architectural argument. However, it is not shown whether the non-hierarchical control agents retain the full set of 70 skills, employ identical tool interfaces, or receive equivalent reflection/quality-check prompts. Without these controls, the measured advantage cannot be confidently attributed to the tiered organization rather than differences in total skill count or implementation details.

minor comments (2)

[Introduction] Clarify the precise mapping between the stated 'over 30 specialized domain resources' and the final count of 70 skills.
[MolBench] MolBench section: provide explicit definitions of the evaluation metrics and a clearer justification that the chosen tasks and sequential lengths are representative of real drug-discovery difficulty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to improve clarity and experimental transparency.

read point-by-point responses

Referee: [Abstract] Abstract: the manuscript asserts SOTA results across all metrics and attributes them to the three-tier architecture via ablations, yet the abstract (and by extension the reported results) supplies no numerical scores, error bars, dataset sizes, or statistical tests. This absence makes it impossible to evaluate the magnitude or reliability of the claimed performance advantage.

Authors: We agree that the abstract would benefit from explicit numerical results to allow readers to assess the scale of the reported gains. In the revised manuscript we have added representative performance figures (success rate and average tool calls on MolBench), noted the dataset sizes, and referenced the error bars and statistical comparisons that appear in the results section. revision: yes
Referee: [Ablation studies] Ablation studies: the claim that gains concentrate on structured-workflow tasks while vanishing on ad-hoc ones is load-bearing for the central architectural argument. However, it is not shown whether the non-hierarchical control agents retain the full set of 70 skills, employ identical tool interfaces, or receive equivalent reflection/quality-check prompts. Without these controls, the measured advantage cannot be confidently attributed to the tiered organization rather than differences in total skill count or implementation details.

Authors: We acknowledge that the ablation description did not explicitly document these controls. The non-hierarchical baselines were in fact given the identical set of 70 skills, the same tool interfaces, and equivalent reflection and quality-check prompts. To eliminate any ambiguity we have inserted a new paragraph in the ablation studies section that states these equivalences and lists the precise prompt templates used for the control agents. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on independent benchmark and ablations

full rationale

The paper introduces MolClaw's three-tier skill hierarchy and the separate MolBench benchmark, then reports empirical SOTA results plus ablation outcomes showing differential gains on structured vs. ad-hoc tasks. No equations, fitted parameters, or self-referential definitions appear in the provided text that would reduce the performance claims to the architecture by construction. The derivation chain is a standard empirical pipeline (new method + new test set + controlled comparisons) rather than a closed loop where outputs are forced by input definitions or self-citations. The ablation critique concerns experimental controls, not circular reduction of results to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract; the ledger is therefore empty.

pith-pipeline@v0.9.0 · 5790 in / 941 out tokens · 44922 ms · 2026-05-21T10:42:18.866896+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · 5 internal anchors

[1]

Gromacs: High performance molecular simulations through multi-level paral- lelism from laptops to supercomputers.SoftwareX, 1:19–25, 2015

Mark James Abraham, Teemu Murtola, Roland Schulz, Szilárd Páll, Jeremy C Smith, Berk Hess, and Erik Lindahl. Gromacs: High performance molecular simulations through multi-level paral- lelism from laptops to supercomputers.SoftwareX, 1:19–25, 2015

work page 2015
[2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 27

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Alphafold db: Open repository of protein structure predictions

AlphaFold Database Consortium. Alphafold db: Open repository of protein structure predictions. https://alphafold.ebi.ac.uk/, 2024

work page 2024
[4]

The claude model family.https://www.anthropic.com/claude, 2024

Anthropic. The claude model family.https://www.anthropic.com/claude, 2024

work page 2024
[5]

Claude code: a command-line tool for agentic coding.https://docs.anthropic.c om/en/docs/claude-code, 2025

Anthropic. Claude code: a command-line tool for agentic coding.https://docs.anthropic.c om/en/docs/claude-code, 2025

work page 2025
[6]

Liddia: Language-based intelligent drug discovery agent

Reza Averly, Frazier N Baker, Ian A Watson, and Xia Ning. Liddia: Language-based intelligent drug discovery agent. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12015–12039, 2025

work page 2025
[7]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Artificial intelligence in drug discovery: what is realistic, what are illusions? part 1: Ways to make an impact, and why we are not there yet

Andreas Bender and Isidro Cortés-Ciriano. Artificial intelligence in drug discovery: what is realistic, what are illusions? part 1: Ways to make an impact, and why we are not there yet. Drug discovery today, 26(2):511–524, 2021

work page 2021
[9]

Quantifying the chemical beauty of drugs.Nature Chemistry, 4:90–98, 2012

G Richard Bickerton, Gaia V Paolini, Jérémy Besnard, Sorel Muresan, and Andrew L Hopkins. Quantifying the chemical beauty of drugs.Nature Chemistry, 4:90–98, 2012

work page 2012
[10]

Autonomous chemical research with large language models.Nature, 624(7992):570–578, 2023

Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models.Nature, 624(7992):570–578, 2023

work page 2023
[11]

Prolif: a library to encode molecular interactions as fingerprints.Journal of cheminformatics, 13(1):72, 2021

Cédric Bouysset and Sébastien Fiorucci. Prolif: a library to encode molecular interactions as fingerprints.Journal of cheminformatics, 13(1):72, 2021

work page 2021
[12]

Evobind: in silico directed evolution of peptide binders with alphafold.bioRxiv, pages 2022–07, 2022

Patrick Bryant and Arne Elofsson. Evobind: in silico directed evolution of peptide binders with alphafold.bioRxiv, pages 2022–07, 2022

work page 2022
[13]

Generic protein–ligand interaction scoring by inte- grating physical prior knowledge and data augmentation modelling.Nature Machine Intelligence, 6(6):688–700, 2024

Duanhua Cao, Geng Chen, Jiaxin Jiang, Jie Yu, Runze Zhang, Mingan Chen, Wei Zhang, Lifan Chen, Feisheng Zhong, Yingying Zhang, et al. Generic protein–ligand interaction scoring by inte- grating physical prior knowledge and data augmentation modelling.Nature Machine Intelligence, 6(6):688–700, 2024

work page 2024
[14]

Mozi: Governed autonomy for drug discovery llm agents.arXiv preprint arXiv:2603.03655, 2026

He Cao, Siyu Liu, Fan Zhang, Zijing Liu, Hao Li, Bin Feng, Shengyuan Bai, Leqing Chen, Kai Xie, and Yu Li. Mozi: Governed autonomy for drug discovery llm agents.arXiv preprint arXiv:2603.03655, 2026

work page arXiv 2026
[15]

Chembl: The global bioactivity database for drug discovery.https: //www.ebi.ac.uk/chembl/, 2024

ChEMBL Consortium. Chembl: The global bioactivity database for drug discovery.https: //www.ebi.ac.uk/chembl/, 2024

work page 2024
[16]

(23) Varadi, M.; Velankar, S

Gabriele Corso, Hannes Stärk, Bowen Jing, Regina Barzilay, and Tommi Jaakkola. Diffdock: Diffusion steps, twists, and turns for molecular docking.arXiv preprint arXiv:2210.01776, 2022

work page arXiv 2022
[17]

Robust deep learning–based protein sequence design using proteinmpnn.Science, 378(6615):49–56, 2022

Justas Dauparas, Ivan Anishchenko, Nathaniel Bennett, Hua Bai, Robert J Ragotte, Lukas F Milles, Basile IM Wicky, Alexis Courbet, Rob J de Haas, Neville Bethel, et al. Robust deep learning–based protein sequence design using proteinmpnn.Science, 378(6615):49–56, 2022

work page 2022
[18]

Pymol: An open-source molecular graphics tool.CCP4 Newsl

Warren L DeLano et al. Pymol: An open-source molecular graphics tool.CCP4 Newsl. protein crystallogr, 40(1):82–92, 2002

work page 2002
[19]

Leading ai-driven drug discovery platforms: 2025 landscape and global outlook

Mahendiran Dharmasivam, Busra Kaya, Adedoyin Akinware, Mahan Gholam Azad, and Des R Richardson. Leading ai-driven drug discovery platforms: 2025 landscape and global outlook. Pharmacological Reviews, page 100102, 2025

work page 2025
[20]

Vina-gpu 2.0: further accelerating autodock vina and its derivatives with graphics processing units.Journal of chemical information and modeling, 63(7):1982–1998, 2023

Ji Ding, Shidi Tang, Zheming Mei, Lingyue Wang, Qinqin Huang, Haifeng Hu, Ming Ling, and Jiansheng Wu. Vina-gpu 2.0: further accelerating autodock vina and its derivatives with graphics processing units.Journal of chemical information and modeling, 63(7):1982–1998, 2023. 28

work page 1982
[21]

Openmm 7: Rapid development of high performance algorithms for molecular dynamics.PLoS computational biology, 13(7):e1005659, 2017

Peter Eastman, Jason Swails, John D Chodera, Robert T McGibbon, Yutong Zhao, Kyle A Beauchamp, Lee-Ping Wang, Andrew C Simmonett, Matthew P Harrigan, Chaya D Stern, et al. Openmm 7: Rapid development of high performance algorithms for molecular dynamics.PLoS computational biology, 13(7):e1005659, 2017

work page 2017
[22]

Autodock vina 1.2

Jerome Eberhardt, Diogo Santos-Martins, Andreas F Tillack, and Stefano Forli. Autodock vina 1.2. 0: new docking methods, expanded force field, and python bindings.Journal of chemical information and modeling, 61(8):3891–3898, 2021

work page 2021
[23]

Internagent-1.5: A unified agentic framework for long-horizon autonomous scientific discovery.arXiv preprint arXiv:2602.08990, 2026

Shiyang Feng, Runmin Ma, Xiangchao Yan, Yue Fan, Yusong Hu, Songtao Huang, Shuaiyu Zhang, Zongsheng Cao, Tianshuo Peng, Jiakang Yuan, et al. Internagent-1.5: A unified agentic framework for long-horizon autonomous scientific discovery.arXiv preprint arXiv:2602.08990, 2026

work page arXiv 2026
[24]

Glide: a new approach for rapid, accurate docking and scoring

Richard A Friesner, Jay L Banks, Robert B Murphy, Thomas A Halgren, Jasna J Klicic, Daniel T Mainz, Matthew P Repasky, Eric H Knoll, Mee Shelley, Jason K Perry, et al. Glide: a new approach for rapid, accurate docking and scoring. 1. method and assessment of docking accuracy. Journal of medicinal chemistry, 47(7):1739–1749, 2004

work page 2004
[25]

Empowering biomedical discovery with ai agents.Cell, 187(22):6125–6151, 2024

Shanghua Gao, Ada Fang, Yepeng Huang, Valentina Giunchiglia, Ayush Noori, Jonathan Richard Schwarz, Yasha Ektefaie, Jovana Kondic, and Marinka Zitnik. Empowering biomedical discovery with ai agents.Cell, 187(22):6125–6151, 2024

work page 2024
[26]

Txagent: an ai agent for therapeutic reasoning across a universe of tools.arXiv preprint arXiv:2503.10970, 2025

Shanghua Gao, Richard Zhu, Zhenglun Kong, Ayush Noori, Xiaorui Su, Curtis Ginder, Theodoros Tsiligkaridis, and Marinka Zitnik. Txagent: an ai agent for therapeutic reasoning across a universe of tools.arXiv preprint arXiv:2503.10970, 2025

work page arXiv 2025
[27]

Democratising real-world drug discovery through agentic ai.Drug Discovery Today, page 104605, 2026

Jiazhen He, Helen Lai, Lakshidaa Saigiridharan, Gian Marco Ghiandoni, Kinga Jenei, Umur Gokalp, Ajsa Nukovic, Ola Engkvist, Jon Paul Janet, and Samuel Genheden. Democratising real-world drug discovery through agentic ai.Drug Discovery Today, page 104605, 2026

work page 2026
[28]

Biomni: A general-purpose biomedical ai agent.biorxiv, 2025

Kexin Huang, Serena Zhang, Hanchen Wang, Yuanhao Qu, Yingzhou Lu, Yusuf Roohani, Ryan Li, Lin Qiu, Gavin Li, Junze Zhang, et al. Biomni: A general-purpose biomedical ai agent.biorxiv, 2025

work page 2025
[29]

Illuminating protein space with a programmable generative model.Nature, 623(7989):1070–1078, 2023

John B Ingraham, Max Baranov, Zak Costello, Karl W Barber, Wujie Wang, Ahmed Ismail, VincentFrappier, DanaMLord, ChristopherNg-Thow-Hing, ErikRVanVlack, etal. Illuminating protein space with a programmable generative model.Nature, 623(7989):1070–1078, 2023

work page 2023
[30]

Scp: Accelerating discovery with a global web of autonomous scientific agents.arXiv preprint arXiv:2512.24189, 2025

Yankai Jiang, Wenjie Lou, Lilong Wang, Zhenyu Tang, Shiyang Feng, Jiaxuan Lu, Haoran Sun, Yaning Pan, Shuang Gu, Haoyang Su, et al. Scp: Accelerating discovery with a global web of autonomous scientific agents.arXiv preprint arXiv:2512.24189, 2025

work page arXiv 2025
[31]

Deepsite: protein-binding site predictor using 3d-convolutional neural networks.Bioinformatics, 33(19):3036–3042, 2017

José Jiménez, Stefan Doerr, Gerard Martínez-Rosell, Alexander S Rose, and Gianni De Fabritiis. Deepsite: protein-binding site predictor using 3d-convolutional neural networks.Bioinformatics, 33(19):3036–3042, 2017

work page 2017
[32]

Highly accurate protein structure prediction with alphafold.nature, 596(7873):583–589, 2021

JohnJumper, RichardEvans, AlexanderPritzel, TimGreen, MichaelFigurnov, OlafRonneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold.nature, 596(7873):583–589, 2021

work page 2021
[33]

P2rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure.Journal of cheminformatics, 10(1):39, 2018

Radoslav Krivák and David Hoksza. P2rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure.Journal of cheminformatics, 10(1):39, 2018

work page 2018
[34]

Fpocket: an open source platform for ligand pocket detection.BMC bioinformatics, 10(1):168, 2009

Vincent Le Guilloux, Peter Schmidtke, and Pierre Tuffery. Fpocket: an open source platform for ligand pocket detection.BMC bioinformatics, 10(1):168, 2009

work page 2009
[35]

Scalable emulation of protein equilibrium ensembles with generative deep learning.Science, 389(6761):eadv9817, 2025

Sarah Lewis, Tim Hempel, José Jiménez-Luna, Michael Gastegger, Yu Xie, Andrew YK Foong, Victor García Satorras, Osama Abdin, Bastiaan S Veeling, Iryna Zaporozhets, et al. Scalable emulation of protein equilibrium ensembles with generative deep learning.Science, 389(6761):eadv9817, 2025. 29

work page 2025
[36]

Beyond chemical qa: Evaluating llm’s chemical reasoning with modular chemical operations.arXiv preprint arXiv:2505.21318, 2025

Hao Li, He Cao, Bin Feng, Yanjun Shao, Xiangru Tang, Zhiyuan Yan, Li Yuan, Yonghong Tian, and Yu Li. Beyond chemical qa: Evaluating llm’s chemical reasoning with modular chemical operations.arXiv preprint arXiv:2505.21318, 2025

work page arXiv 2025
[37]

Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023

Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023

work page 2023
[38]

Drugagent: Automating ai-aided drug discovery programming through llm multi-agent collaboration

SizheLiu, YizhouLu, SiyuChen, XiyangHu, JieyuZhao, YingzhouLu, andYueZhao. Drugagent: Automating ai-aided drug discovery programming through llm multi-agent collaboration.arXiv preprint arXiv:2411.15692, 2024

work page arXiv 2024
[39]

Reinvent 4: Modern ai–driven generative molecule design.Journal of Cheminformatics, 16(1):20, 2024

Hannes H Loeffler, Jiazhen He, Alessandro Tibo, Jon Paul Janet, Alexey Voronov, Lewis H Mervin, and Ola Engkvist. Reinvent 4: Modern ai–driven generative molecule design.Journal of Cheminformatics, 16(1):20, 2024

work page 2024
[40]

Openawsem with open3spn2: A fast, flexible, and accessible framework for large-scale coarse-grained biomolecular simulations.PLoS computational biology, 17(2):e1008308, 2021

Wei Lu, Carlos Bueno, Nicholas P Schafer, Joshua Moller, Shikai Jin, Xun Chen, Mingchen Chen, Xinyu Gu, Aram Davtyan, Juan J de Pablo, et al. Openawsem with open3spn2: A fast, flexible, and accessible framework for large-scale coarse-grained biomolecular simulations.PLoS computational biology, 17(2):e1008308, 2021

work page 2021
[41]

Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller

Andres M. Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller. Augmenting large language models with chemistry tools.Nature machine intelligence, 6(5):525–535, 2024

work page 2024
[42]

Augmented Language Models: a Survey

Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, et al. Augmented language models: a survey.arXiv preprint arXiv:2302.07842, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Openclaw: an open-source framework for building tool-augmented llm agents.https://github.com/openclaw, 2025

OpenClaw Contributors. Openclaw: an open-source framework for building tool-augmented llm agents.https://github.com/openclaw, 2025

work page 2025
[44]

Frogent: An end-to-end full-process drug design agent.arXiv preprint arXiv:2508.10760, 2025

Qihua Pan, Dong Xu, Jenna Xinyi Yao, Lijia Ma, Zexuan Zhu, and Junkai Ji. Frogent: An end-to-end full-process drug design agent.arXiv preprint arXiv:2508.10760, 2025

work page arXiv 2025
[45]

Boltz-2: Towards accurate and efficient binding affinity prediction.BioRxiv, 2025

Saro Passaro, Gabriele Corso, Jeremy Wohlwend, Mateo Reveiz, Stephan Thaler, Vignesh Ram Somnath, Noah Getz, Tally Portnoi, Julien Roy, Hannes Stark, et al. Boltz-2: Towards accurate and efficient binding affinity prediction.BioRxiv, 2025

work page 2025
[46]

Pubchem: Open chemistry database.https://pubchem.ncbi.nlm.nih .gov/, 2025

PubChem Consortium. Pubchem: Open chemistry database.https://pubchem.ncbi.nlm.nih .gov/, 2025

work page 2025
[47]

Rcsb pdb: Research collaboratory for structural bioinformatics protein data bank.https://www.rcsb.org/, 2024

RCSB PDB Consortium. Rcsb pdb: Research collaboratory for structural bioinformatics protein data bank.https://www.rcsb.org/, 2024

work page 2024
[48]

Rdkit: Open-source cheminformatics software.https://www.rdkit.org/, 2024

RDKit Consortium. Rdkit: Open-source cheminformatics software.https://www.rdkit.org/, 2024

work page 2024
[49]

Fast procedure for reconstruction of full-atom protein models from reduced representations.Journal of computational chemistry, 29(9):1460–1465, 2008

Piotr Rotkiewicz and Jeffrey Skolnick. Fast procedure for reconstruction of full-atom protein models from reduced representations.Journal of computational chemistry, 29(9):1460–1465, 2008

work page 2008
[50]

Computational approaches streamlining drug discovery.Nature, 616(7958):673–685, 2023

Anastasiia V Sadybekov and Vsevolod Katritch. Computational approaches streamlining drug discovery.Nature, 616(7958):673–685, 2023

work page 2023
[51]

Plip: fully automated protein–ligand interaction profiler.Nucleic acids research, 43(W1):W443– W447, 2015

Sebastian Salentin, Sven Schreiber, V Joachim Haupt, Melissa F Adasme, and Michael Schroeder. Plip: fully automated protein–ligand interaction profiler.Nucleic acids research, 43(W1):W443– W447, 2015

work page 2015
[52]

Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023. 30

work page 2023
[53]

Understanding and predicting druggability

Peter Schmidtke and Xavier Barril. Understanding and predicting druggability. a high-throughput method for detection of drug binding sites.Journal of medicinal chemistry, 53(15):5858–5867, 2010

work page 2010
[54]

Why 90% of clinical drug development fails and how to improve it?Acta Pharmaceutica Sinica B, 12(7):3049–3062, 2022

Duxin Sun, Wei Gao, Hongxiang Hu, and Simon Zhou. Why 90% of clinical drug development fails and how to improve it?Acta Pharmaceutica Sinica B, 12(7):3049–3062, 2022

work page 2022
[55]

Admet-ai: a machine learning admet platform for evaluation of large-scale chemical libraries.Bioinformatics, 40(7):btae416, 2024

Kyle Swanson, Parker Walther, Jeremy Leitz, Souhrid Mukherjee, Joseph C Wu, Rabindra V Shivnaraine, and James Zou. Admet-ai: a machine learning admet platform for evaluation of large-scale chemical libraries.Bioinformatics, 40(7):btae416, 2024

work page 2024
[56]

Chai-1: Decoding the molecular interactions of life

Chai Discovery team, Jacques Boitreaud, Jack Dent, Matthew McPartlon, Joshua Meier, Vinicius Reis, Alex Rogozhonikov, and Kevin Wu. Chai-1: Decoding the molecular interactions of life. BioRxiv, pages 2024–10, 2024

work page 2024
[57]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: A family of highly capable multimodal models. arxiv 2023.arXiv preprint arXiv:2312.11805, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Uniprot: The universal protein knowledgebase.https://www.unipro t.org/, 2025

The UniProt Consortium. Uniprot: The universal protein knowledgebase.https://www.unipro t.org/, 2025

work page 2025
[59]

Benchmarking compound activity prediction for real-world drug discovery applications.Commu- nications Chemistry, 7(1):127, 2024

Tingzhong Tian, Shuya Li, Ziting Zhang, Lin Chen, Ziheng Zou, Dan Zhao, and Jianyang Zeng. Benchmarking compound activity prediction for real-world drug discovery applications.Commu- nications Chemistry, 7(1):127, 2024

work page 2024
[60]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[61]

gmx_mmpbsa: a new tool to perform end-state free energy calculations with gromacs.Journal of chemical theory and computation, 17(10):6281–6291, 2021

Mario S Valdés-Tresanco, Mario E Valdés-Tresanco, Pedro A Valiente, and Ernesto Moreno. gmx_mmpbsa: a new tool to perform end-state free energy calculations with gromacs.Journal of chemical theory and computation, 17(10):6281–6291, 2021

work page 2021
[62]

Applications of machine learning in drug discovery and development.Nature reviews Drug discovery, 18(6):463–477, 2019

Jessica Vamathevan, Dominic Clark, Paul Czodrowski, Ian Dunham, Edgardo Ferran, George Lee, Bin Li, Anant Madabhushi, Parantu Shah, Michaela Spitzer, et al. Applications of machine learning in drug discovery and development.Nature reviews Drug discovery, 18(6):463–477, 2019

work page 2019
[63]

Prompt-to-pill: Multi- agent drug discovery and clinical simulation pipeline.Bioinformatics Advances, 6(1):vbaf323, 2026

Ivana Vichentijevikj, Kostadin Mishev, and Monika Simjanoska Misheva. Prompt-to-pill: Multi- agent drug discovery and clinical simulation pipeline.Bioinformatics Advances, 6(1):vbaf323, 2026

work page 2026
[64]

Structure-based protein assembly simu- lations including various binding sites and conformations.Journal of Chemical Information and Modeling, 64(8):3465–3476, 2024

Luis J Walter, Patrick K Quoika, and Martin Zacharias. Structure-based protein assembly simu- lations including various binding sites and conformations.Journal of Chemical Information and Modeling, 64(8):3465–3476, 2024

work page 2024
[65]

Scientific discovery in the age of artificial intelligence.Nature, 620(7972):47–60, 2023

Hanchen Wang, Tianfan Fu, Yuanqi Du, Wenhao Gao, Kexin Huang, Ziming Liu, Payal Chandak, Shengchao Liu, Peter Van Katwyk, Andreea Deac, et al. Scientific discovery in the age of artificial intelligence.Nature, 620(7972):47–60, 2023

work page 2023
[66]

Estimated research and development investment needed to bring a new medicine to market, 2009-2018.Jama, 323(9):844–853, 2020

Olivier J Wouters, Martin McKee, and Jeroen Luyten. Estimated research and development investment needed to bring a new medicine to market, 2009-2018.Jama, 323(9):844–853, 2020

work page 2009
[67]

The hdock server for integrated protein–protein docking.Nature protocols, 15(5):1829–1852, 2020

Yumeng Yan, Huanyu Tao, Jiahua He, and Sheng-You Huang. The hdock server for integrated protein–protein docking.Nature protocols, 15(5):1829–1852, 2020

work page 2020
[68]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

work page 2022
[69]

Efficient and accurate large library ligand docking with karmadock.Nature Computational Science, 3(9):789–804, 2023

Xujun Zhang, Odin Zhang, Chao Shen, Wanglin Qu, Shicheng Chen, Hanqun Cao, Yu Kang, Zhe Wang, Ercheng Wang, Jintu Zhang, et al. Efficient and accurate large library ligand docking with karmadock.Nature Computational Science, 3(9):789–804, 2023. 31

work page 2023
[70]

Activity cliff prediction: Dataset and benchmark.arXiv preprint arXiv:2302.07541, 2023

Ziqiao Zhang, Bangyi Zhao, Ailin Xie, Yatao Bian, and Shuigeng Zhou. Activity cliff prediction: Dataset and benchmark.arXiv preprint arXiv:2302.07541, 2023

work page arXiv 2023
[71]

delete hydroxyl

Jie Zhu, Jingxiang Wang, Xin Wang, Mingjing Gao, Bingbing Guo, Miaomiao Gao, Jiarui Liu, Yanqiu Yu, Liang Wang, Weikaixin Kong, et al. Prediction of drug efficacy from transcriptional profiles with deep learning.Nature biotechnology, 39(11):1444–1452, 2021. Data A vailability Both the MolBench dataset (CSV format) and associated evaluation code can be acc...

work page 2021
[72]

Here we derive this bound analytically

Predicting the optimization ceiling from scaffold topology A central finding of this study is that the triazolo-benzodiazepine scaffold imposes a hard upper bound on the achievable QED score. Here we derive this bound analytically. QED is defined as a weighted geometric mean of eight component desirability functionsdi (ref. [9]): QED= exp P8 i=1 wi lnd i ...

work page
[73]

Tanimoto budget exhaustion

Tanimoto budget exhaustion as a convergence diagnostic In our main Results we noted that the qualification rate—the fraction of generated molecules satisfying the Tanimoto≥0.40 constraint—declined from 100% (R1–R2) to 57.6% (R5). We propose that this declining rate constitutes a generalizable convergence diagnostic that we term “Tanimoto budget exhaustion...

work page
[74]

The structural alerts (ALERTS) desirability exhibited a discontinuous phase transition: 0.241 at R0 to 0.842 at R1, with no further change in R2–R5

Phase transitions versus gradual improvement in property optimization Not all QED components improved gradually. The structural alerts (ALERTS) desirability exhibited a discontinuous phase transition: 0.241 at R0 to 0.842 at R1, with no further change in R2–R5. This occurred because the starting molecule’s butyl ester triggered two Brenk structural alerts...

work page
[75]

propose 3–5 modified molecules with chemical ratio- nale

The interpretability–efficiency trade-off in generative design The evaluation question instructed the agent to “propose 3–5 modified molecules with chemical ratio- nale.” Instead, the agent employed REINVENT4 batch generation to produce 23–54 candidates per round and selected the best by QED ranking. This substitution raises a fundamental question about A...

work page
[76]

unmonitored endpoint alarm

Systematic blind spot detection in multi-objective optimization The agent’s failure to detect the AMES mutagenicity deterioration (+180%, from 0.165 to 0.462) de- spite tracking over 13 ADMET endpoints illustrates a general vulnerability of attention-based monitor- ing. The agent explicitly tracked CYP3A4, hERG and DILI at each round—all of which improved...

work page
[77]

Cost-effectiveness and practical stopping rules Thediminishingreturnspattern(Fig.7H)hasdirectimplicationsforcomputationalresourceallocation. Assuming roughly equal tool-call costs per round, and measuring against the R0-to-R4 improvement (+0.4216) since R4 is the recommended molecule, R1 delivers 83.0%, R1–R2 deliver 88.1%, R1–R3 deliver 94.0%, and R1–R4 ...

work page
[78]

Multi-round iterative optimization as an emergent agent capability The E2E-Q3 task required the AI agent to execute an iterative closed-loop optimization cycle— Strategize, Generate, Dock, Evaluate—across up to 15 rounds, with autonomous decision-making at each round boundary. This represents a fundamentally different challenge from single-step computa- t...

work page
[79]

erlotinib

Long-range planning, self-repair, and emergent medicinal chemistry knowledge The agent autonomously authored four pipeline versions (v1–v4, totaling 163 KB of Python), pro- gressively diagnosing and recovering from crashes: v1 failed due to NumPy/RDKit incompatibility, v2 succeeded through R1 but crashed on an f-string bug, v3 resumed from R2 with pre-pro...

work page
[80]

ethynyl fixation

Agent versus REINVENT: complementary collaboration rather than competition The 3:3 tie in round winners and the non-significant pooled comparison (p= 0.104) mask a deeper complementarity. REINVENT excelled at creative molecular recombination: it serendipitously dis- covered methoxy shortening in R1 (not hypothesized by the agent), generated the F+OH+CH3 m...

work page

Showing first 80 references.

[1] [1]

Gromacs: High performance molecular simulations through multi-level paral- lelism from laptops to supercomputers.SoftwareX, 1:19–25, 2015

Mark James Abraham, Teemu Murtola, Roland Schulz, Szilárd Páll, Jeremy C Smith, Berk Hess, and Erik Lindahl. Gromacs: High performance molecular simulations through multi-level paral- lelism from laptops to supercomputers.SoftwareX, 1:19–25, 2015

work page 2015

[2] [2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 27

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Alphafold db: Open repository of protein structure predictions

AlphaFold Database Consortium. Alphafold db: Open repository of protein structure predictions. https://alphafold.ebi.ac.uk/, 2024

work page 2024

[4] [4]

The claude model family.https://www.anthropic.com/claude, 2024

Anthropic. The claude model family.https://www.anthropic.com/claude, 2024

work page 2024

[5] [5]

Claude code: a command-line tool for agentic coding.https://docs.anthropic.c om/en/docs/claude-code, 2025

Anthropic. Claude code: a command-line tool for agentic coding.https://docs.anthropic.c om/en/docs/claude-code, 2025

work page 2025

[6] [6]

Liddia: Language-based intelligent drug discovery agent

Reza Averly, Frazier N Baker, Ian A Watson, and Xia Ning. Liddia: Language-based intelligent drug discovery agent. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12015–12039, 2025

work page 2025

[7] [7]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Artificial intelligence in drug discovery: what is realistic, what are illusions? part 1: Ways to make an impact, and why we are not there yet

Andreas Bender and Isidro Cortés-Ciriano. Artificial intelligence in drug discovery: what is realistic, what are illusions? part 1: Ways to make an impact, and why we are not there yet. Drug discovery today, 26(2):511–524, 2021

work page 2021

[9] [9]

Quantifying the chemical beauty of drugs.Nature Chemistry, 4:90–98, 2012

G Richard Bickerton, Gaia V Paolini, Jérémy Besnard, Sorel Muresan, and Andrew L Hopkins. Quantifying the chemical beauty of drugs.Nature Chemistry, 4:90–98, 2012

work page 2012

[10] [10]

Autonomous chemical research with large language models.Nature, 624(7992):570–578, 2023

Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models.Nature, 624(7992):570–578, 2023

work page 2023

[11] [11]

Prolif: a library to encode molecular interactions as fingerprints.Journal of cheminformatics, 13(1):72, 2021

Cédric Bouysset and Sébastien Fiorucci. Prolif: a library to encode molecular interactions as fingerprints.Journal of cheminformatics, 13(1):72, 2021

work page 2021

[12] [12]

Evobind: in silico directed evolution of peptide binders with alphafold.bioRxiv, pages 2022–07, 2022

Patrick Bryant and Arne Elofsson. Evobind: in silico directed evolution of peptide binders with alphafold.bioRxiv, pages 2022–07, 2022

work page 2022

[13] [13]

Generic protein–ligand interaction scoring by inte- grating physical prior knowledge and data augmentation modelling.Nature Machine Intelligence, 6(6):688–700, 2024

Duanhua Cao, Geng Chen, Jiaxin Jiang, Jie Yu, Runze Zhang, Mingan Chen, Wei Zhang, Lifan Chen, Feisheng Zhong, Yingying Zhang, et al. Generic protein–ligand interaction scoring by inte- grating physical prior knowledge and data augmentation modelling.Nature Machine Intelligence, 6(6):688–700, 2024

work page 2024

[14] [14]

Mozi: Governed autonomy for drug discovery llm agents.arXiv preprint arXiv:2603.03655, 2026

He Cao, Siyu Liu, Fan Zhang, Zijing Liu, Hao Li, Bin Feng, Shengyuan Bai, Leqing Chen, Kai Xie, and Yu Li. Mozi: Governed autonomy for drug discovery llm agents.arXiv preprint arXiv:2603.03655, 2026

work page arXiv 2026

[15] [15]

Chembl: The global bioactivity database for drug discovery.https: //www.ebi.ac.uk/chembl/, 2024

ChEMBL Consortium. Chembl: The global bioactivity database for drug discovery.https: //www.ebi.ac.uk/chembl/, 2024

work page 2024

[16] [16]

(23) Varadi, M.; Velankar, S

Gabriele Corso, Hannes Stärk, Bowen Jing, Regina Barzilay, and Tommi Jaakkola. Diffdock: Diffusion steps, twists, and turns for molecular docking.arXiv preprint arXiv:2210.01776, 2022

work page arXiv 2022

[17] [17]

Robust deep learning–based protein sequence design using proteinmpnn.Science, 378(6615):49–56, 2022

Justas Dauparas, Ivan Anishchenko, Nathaniel Bennett, Hua Bai, Robert J Ragotte, Lukas F Milles, Basile IM Wicky, Alexis Courbet, Rob J de Haas, Neville Bethel, et al. Robust deep learning–based protein sequence design using proteinmpnn.Science, 378(6615):49–56, 2022

work page 2022

[18] [18]

Pymol: An open-source molecular graphics tool.CCP4 Newsl

Warren L DeLano et al. Pymol: An open-source molecular graphics tool.CCP4 Newsl. protein crystallogr, 40(1):82–92, 2002

work page 2002

[19] [19]

Leading ai-driven drug discovery platforms: 2025 landscape and global outlook

Mahendiran Dharmasivam, Busra Kaya, Adedoyin Akinware, Mahan Gholam Azad, and Des R Richardson. Leading ai-driven drug discovery platforms: 2025 landscape and global outlook. Pharmacological Reviews, page 100102, 2025

work page 2025

[20] [20]

Vina-gpu 2.0: further accelerating autodock vina and its derivatives with graphics processing units.Journal of chemical information and modeling, 63(7):1982–1998, 2023

Ji Ding, Shidi Tang, Zheming Mei, Lingyue Wang, Qinqin Huang, Haifeng Hu, Ming Ling, and Jiansheng Wu. Vina-gpu 2.0: further accelerating autodock vina and its derivatives with graphics processing units.Journal of chemical information and modeling, 63(7):1982–1998, 2023. 28

work page 1982

[21] [21]

Openmm 7: Rapid development of high performance algorithms for molecular dynamics.PLoS computational biology, 13(7):e1005659, 2017

Peter Eastman, Jason Swails, John D Chodera, Robert T McGibbon, Yutong Zhao, Kyle A Beauchamp, Lee-Ping Wang, Andrew C Simmonett, Matthew P Harrigan, Chaya D Stern, et al. Openmm 7: Rapid development of high performance algorithms for molecular dynamics.PLoS computational biology, 13(7):e1005659, 2017

work page 2017

[22] [22]

Autodock vina 1.2

Jerome Eberhardt, Diogo Santos-Martins, Andreas F Tillack, and Stefano Forli. Autodock vina 1.2. 0: new docking methods, expanded force field, and python bindings.Journal of chemical information and modeling, 61(8):3891–3898, 2021

work page 2021

[23] [23]

Internagent-1.5: A unified agentic framework for long-horizon autonomous scientific discovery.arXiv preprint arXiv:2602.08990, 2026

Shiyang Feng, Runmin Ma, Xiangchao Yan, Yue Fan, Yusong Hu, Songtao Huang, Shuaiyu Zhang, Zongsheng Cao, Tianshuo Peng, Jiakang Yuan, et al. Internagent-1.5: A unified agentic framework for long-horizon autonomous scientific discovery.arXiv preprint arXiv:2602.08990, 2026

work page arXiv 2026

[24] [24]

Glide: a new approach for rapid, accurate docking and scoring

Richard A Friesner, Jay L Banks, Robert B Murphy, Thomas A Halgren, Jasna J Klicic, Daniel T Mainz, Matthew P Repasky, Eric H Knoll, Mee Shelley, Jason K Perry, et al. Glide: a new approach for rapid, accurate docking and scoring. 1. method and assessment of docking accuracy. Journal of medicinal chemistry, 47(7):1739–1749, 2004

work page 2004

[25] [25]

Empowering biomedical discovery with ai agents.Cell, 187(22):6125–6151, 2024

Shanghua Gao, Ada Fang, Yepeng Huang, Valentina Giunchiglia, Ayush Noori, Jonathan Richard Schwarz, Yasha Ektefaie, Jovana Kondic, and Marinka Zitnik. Empowering biomedical discovery with ai agents.Cell, 187(22):6125–6151, 2024

work page 2024

[26] [26]

Txagent: an ai agent for therapeutic reasoning across a universe of tools.arXiv preprint arXiv:2503.10970, 2025

Shanghua Gao, Richard Zhu, Zhenglun Kong, Ayush Noori, Xiaorui Su, Curtis Ginder, Theodoros Tsiligkaridis, and Marinka Zitnik. Txagent: an ai agent for therapeutic reasoning across a universe of tools.arXiv preprint arXiv:2503.10970, 2025

work page arXiv 2025

[27] [27]

Democratising real-world drug discovery through agentic ai.Drug Discovery Today, page 104605, 2026

Jiazhen He, Helen Lai, Lakshidaa Saigiridharan, Gian Marco Ghiandoni, Kinga Jenei, Umur Gokalp, Ajsa Nukovic, Ola Engkvist, Jon Paul Janet, and Samuel Genheden. Democratising real-world drug discovery through agentic ai.Drug Discovery Today, page 104605, 2026

work page 2026

[28] [28]

Biomni: A general-purpose biomedical ai agent.biorxiv, 2025

Kexin Huang, Serena Zhang, Hanchen Wang, Yuanhao Qu, Yingzhou Lu, Yusuf Roohani, Ryan Li, Lin Qiu, Gavin Li, Junze Zhang, et al. Biomni: A general-purpose biomedical ai agent.biorxiv, 2025

work page 2025

[29] [29]

Illuminating protein space with a programmable generative model.Nature, 623(7989):1070–1078, 2023

John B Ingraham, Max Baranov, Zak Costello, Karl W Barber, Wujie Wang, Ahmed Ismail, VincentFrappier, DanaMLord, ChristopherNg-Thow-Hing, ErikRVanVlack, etal. Illuminating protein space with a programmable generative model.Nature, 623(7989):1070–1078, 2023

work page 2023

[30] [30]

Scp: Accelerating discovery with a global web of autonomous scientific agents.arXiv preprint arXiv:2512.24189, 2025

Yankai Jiang, Wenjie Lou, Lilong Wang, Zhenyu Tang, Shiyang Feng, Jiaxuan Lu, Haoran Sun, Yaning Pan, Shuang Gu, Haoyang Su, et al. Scp: Accelerating discovery with a global web of autonomous scientific agents.arXiv preprint arXiv:2512.24189, 2025

work page arXiv 2025

[31] [31]

Deepsite: protein-binding site predictor using 3d-convolutional neural networks.Bioinformatics, 33(19):3036–3042, 2017

José Jiménez, Stefan Doerr, Gerard Martínez-Rosell, Alexander S Rose, and Gianni De Fabritiis. Deepsite: protein-binding site predictor using 3d-convolutional neural networks.Bioinformatics, 33(19):3036–3042, 2017

work page 2017

[32] [32]

Highly accurate protein structure prediction with alphafold.nature, 596(7873):583–589, 2021

JohnJumper, RichardEvans, AlexanderPritzel, TimGreen, MichaelFigurnov, OlafRonneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold.nature, 596(7873):583–589, 2021

work page 2021

[33] [33]

P2rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure.Journal of cheminformatics, 10(1):39, 2018

Radoslav Krivák and David Hoksza. P2rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure.Journal of cheminformatics, 10(1):39, 2018

work page 2018

[34] [34]

Fpocket: an open source platform for ligand pocket detection.BMC bioinformatics, 10(1):168, 2009

Vincent Le Guilloux, Peter Schmidtke, and Pierre Tuffery. Fpocket: an open source platform for ligand pocket detection.BMC bioinformatics, 10(1):168, 2009

work page 2009

[35] [35]

Scalable emulation of protein equilibrium ensembles with generative deep learning.Science, 389(6761):eadv9817, 2025

Sarah Lewis, Tim Hempel, José Jiménez-Luna, Michael Gastegger, Yu Xie, Andrew YK Foong, Victor García Satorras, Osama Abdin, Bastiaan S Veeling, Iryna Zaporozhets, et al. Scalable emulation of protein equilibrium ensembles with generative deep learning.Science, 389(6761):eadv9817, 2025. 29

work page 2025

[36] [36]

Beyond chemical qa: Evaluating llm’s chemical reasoning with modular chemical operations.arXiv preprint arXiv:2505.21318, 2025

Hao Li, He Cao, Bin Feng, Yanjun Shao, Xiangru Tang, Zhiyuan Yan, Li Yuan, Yonghong Tian, and Yu Li. Beyond chemical qa: Evaluating llm’s chemical reasoning with modular chemical operations.arXiv preprint arXiv:2505.21318, 2025

work page arXiv 2025

[37] [37]

Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023

Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023

work page 2023

[38] [38]

Drugagent: Automating ai-aided drug discovery programming through llm multi-agent collaboration

SizheLiu, YizhouLu, SiyuChen, XiyangHu, JieyuZhao, YingzhouLu, andYueZhao. Drugagent: Automating ai-aided drug discovery programming through llm multi-agent collaboration.arXiv preprint arXiv:2411.15692, 2024

work page arXiv 2024

[39] [39]

Reinvent 4: Modern ai–driven generative molecule design.Journal of Cheminformatics, 16(1):20, 2024

Hannes H Loeffler, Jiazhen He, Alessandro Tibo, Jon Paul Janet, Alexey Voronov, Lewis H Mervin, and Ola Engkvist. Reinvent 4: Modern ai–driven generative molecule design.Journal of Cheminformatics, 16(1):20, 2024

work page 2024

[40] [40]

Openawsem with open3spn2: A fast, flexible, and accessible framework for large-scale coarse-grained biomolecular simulations.PLoS computational biology, 17(2):e1008308, 2021

Wei Lu, Carlos Bueno, Nicholas P Schafer, Joshua Moller, Shikai Jin, Xun Chen, Mingchen Chen, Xinyu Gu, Aram Davtyan, Juan J de Pablo, et al. Openawsem with open3spn2: A fast, flexible, and accessible framework for large-scale coarse-grained biomolecular simulations.PLoS computational biology, 17(2):e1008308, 2021

work page 2021

[41] [41]

Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller

Andres M. Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller. Augmenting large language models with chemistry tools.Nature machine intelligence, 6(5):525–535, 2024

work page 2024

[42] [42]

Augmented Language Models: a Survey

Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, et al. Augmented language models: a survey.arXiv preprint arXiv:2302.07842, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [43]

Openclaw: an open-source framework for building tool-augmented llm agents.https://github.com/openclaw, 2025

OpenClaw Contributors. Openclaw: an open-source framework for building tool-augmented llm agents.https://github.com/openclaw, 2025

work page 2025

[44] [44]

Frogent: An end-to-end full-process drug design agent.arXiv preprint arXiv:2508.10760, 2025

Qihua Pan, Dong Xu, Jenna Xinyi Yao, Lijia Ma, Zexuan Zhu, and Junkai Ji. Frogent: An end-to-end full-process drug design agent.arXiv preprint arXiv:2508.10760, 2025

work page arXiv 2025

[45] [45]

Boltz-2: Towards accurate and efficient binding affinity prediction.BioRxiv, 2025

Saro Passaro, Gabriele Corso, Jeremy Wohlwend, Mateo Reveiz, Stephan Thaler, Vignesh Ram Somnath, Noah Getz, Tally Portnoi, Julien Roy, Hannes Stark, et al. Boltz-2: Towards accurate and efficient binding affinity prediction.BioRxiv, 2025

work page 2025

[46] [46]

Pubchem: Open chemistry database.https://pubchem.ncbi.nlm.nih .gov/, 2025

PubChem Consortium. Pubchem: Open chemistry database.https://pubchem.ncbi.nlm.nih .gov/, 2025

work page 2025

[47] [47]

Rcsb pdb: Research collaboratory for structural bioinformatics protein data bank.https://www.rcsb.org/, 2024

RCSB PDB Consortium. Rcsb pdb: Research collaboratory for structural bioinformatics protein data bank.https://www.rcsb.org/, 2024

work page 2024

[48] [48]

Rdkit: Open-source cheminformatics software.https://www.rdkit.org/, 2024

RDKit Consortium. Rdkit: Open-source cheminformatics software.https://www.rdkit.org/, 2024

work page 2024

[49] [49]

Fast procedure for reconstruction of full-atom protein models from reduced representations.Journal of computational chemistry, 29(9):1460–1465, 2008

Piotr Rotkiewicz and Jeffrey Skolnick. Fast procedure for reconstruction of full-atom protein models from reduced representations.Journal of computational chemistry, 29(9):1460–1465, 2008

work page 2008

[50] [50]

Computational approaches streamlining drug discovery.Nature, 616(7958):673–685, 2023

Anastasiia V Sadybekov and Vsevolod Katritch. Computational approaches streamlining drug discovery.Nature, 616(7958):673–685, 2023

work page 2023

[51] [51]

Plip: fully automated protein–ligand interaction profiler.Nucleic acids research, 43(W1):W443– W447, 2015

Sebastian Salentin, Sven Schreiber, V Joachim Haupt, Melissa F Adasme, and Michael Schroeder. Plip: fully automated protein–ligand interaction profiler.Nucleic acids research, 43(W1):W443– W447, 2015

work page 2015

[52] [52]

Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023. 30

work page 2023

[53] [53]

Understanding and predicting druggability

Peter Schmidtke and Xavier Barril. Understanding and predicting druggability. a high-throughput method for detection of drug binding sites.Journal of medicinal chemistry, 53(15):5858–5867, 2010

work page 2010

[54] [54]

Why 90% of clinical drug development fails and how to improve it?Acta Pharmaceutica Sinica B, 12(7):3049–3062, 2022

Duxin Sun, Wei Gao, Hongxiang Hu, and Simon Zhou. Why 90% of clinical drug development fails and how to improve it?Acta Pharmaceutica Sinica B, 12(7):3049–3062, 2022

work page 2022

[55] [55]

Admet-ai: a machine learning admet platform for evaluation of large-scale chemical libraries.Bioinformatics, 40(7):btae416, 2024

Kyle Swanson, Parker Walther, Jeremy Leitz, Souhrid Mukherjee, Joseph C Wu, Rabindra V Shivnaraine, and James Zou. Admet-ai: a machine learning admet platform for evaluation of large-scale chemical libraries.Bioinformatics, 40(7):btae416, 2024

work page 2024

[56] [56]

Chai-1: Decoding the molecular interactions of life

Chai Discovery team, Jacques Boitreaud, Jack Dent, Matthew McPartlon, Joshua Meier, Vinicius Reis, Alex Rogozhonikov, and Kevin Wu. Chai-1: Decoding the molecular interactions of life. BioRxiv, pages 2024–10, 2024

work page 2024

[57] [57]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: A family of highly capable multimodal models. arxiv 2023.arXiv preprint arXiv:2312.11805, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2023

[58] [58]

Uniprot: The universal protein knowledgebase.https://www.unipro t.org/, 2025

The UniProt Consortium. Uniprot: The universal protein knowledgebase.https://www.unipro t.org/, 2025

work page 2025

[59] [59]

Benchmarking compound activity prediction for real-world drug discovery applications.Commu- nications Chemistry, 7(1):127, 2024

Tingzhong Tian, Shuya Li, Ziting Zhang, Lin Chen, Ziheng Zou, Dan Zhao, and Jianyang Zeng. Benchmarking compound activity prediction for real-world drug discovery applications.Commu- nications Chemistry, 7(1):127, 2024

work page 2024

[60] [60]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[61] [61]

gmx_mmpbsa: a new tool to perform end-state free energy calculations with gromacs.Journal of chemical theory and computation, 17(10):6281–6291, 2021

Mario S Valdés-Tresanco, Mario E Valdés-Tresanco, Pedro A Valiente, and Ernesto Moreno. gmx_mmpbsa: a new tool to perform end-state free energy calculations with gromacs.Journal of chemical theory and computation, 17(10):6281–6291, 2021

work page 2021

[62] [62]

Applications of machine learning in drug discovery and development.Nature reviews Drug discovery, 18(6):463–477, 2019

Jessica Vamathevan, Dominic Clark, Paul Czodrowski, Ian Dunham, Edgardo Ferran, George Lee, Bin Li, Anant Madabhushi, Parantu Shah, Michaela Spitzer, et al. Applications of machine learning in drug discovery and development.Nature reviews Drug discovery, 18(6):463–477, 2019

work page 2019

[63] [63]

Prompt-to-pill: Multi- agent drug discovery and clinical simulation pipeline.Bioinformatics Advances, 6(1):vbaf323, 2026

Ivana Vichentijevikj, Kostadin Mishev, and Monika Simjanoska Misheva. Prompt-to-pill: Multi- agent drug discovery and clinical simulation pipeline.Bioinformatics Advances, 6(1):vbaf323, 2026

work page 2026

[64] [64]

Structure-based protein assembly simu- lations including various binding sites and conformations.Journal of Chemical Information and Modeling, 64(8):3465–3476, 2024

Luis J Walter, Patrick K Quoika, and Martin Zacharias. Structure-based protein assembly simu- lations including various binding sites and conformations.Journal of Chemical Information and Modeling, 64(8):3465–3476, 2024

work page 2024

[65] [65]

Scientific discovery in the age of artificial intelligence.Nature, 620(7972):47–60, 2023

Hanchen Wang, Tianfan Fu, Yuanqi Du, Wenhao Gao, Kexin Huang, Ziming Liu, Payal Chandak, Shengchao Liu, Peter Van Katwyk, Andreea Deac, et al. Scientific discovery in the age of artificial intelligence.Nature, 620(7972):47–60, 2023

work page 2023

[66] [66]

Estimated research and development investment needed to bring a new medicine to market, 2009-2018.Jama, 323(9):844–853, 2020

Olivier J Wouters, Martin McKee, and Jeroen Luyten. Estimated research and development investment needed to bring a new medicine to market, 2009-2018.Jama, 323(9):844–853, 2020

work page 2009

[67] [67]

The hdock server for integrated protein–protein docking.Nature protocols, 15(5):1829–1852, 2020

Yumeng Yan, Huanyu Tao, Jiahua He, and Sheng-You Huang. The hdock server for integrated protein–protein docking.Nature protocols, 15(5):1829–1852, 2020

work page 2020

[68] [68]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

work page 2022

[69] [69]

Efficient and accurate large library ligand docking with karmadock.Nature Computational Science, 3(9):789–804, 2023

Xujun Zhang, Odin Zhang, Chao Shen, Wanglin Qu, Shicheng Chen, Hanqun Cao, Yu Kang, Zhe Wang, Ercheng Wang, Jintu Zhang, et al. Efficient and accurate large library ligand docking with karmadock.Nature Computational Science, 3(9):789–804, 2023. 31

work page 2023

[70] [70]

Activity cliff prediction: Dataset and benchmark.arXiv preprint arXiv:2302.07541, 2023

Ziqiao Zhang, Bangyi Zhao, Ailin Xie, Yatao Bian, and Shuigeng Zhou. Activity cliff prediction: Dataset and benchmark.arXiv preprint arXiv:2302.07541, 2023

work page arXiv 2023

[71] [71]

delete hydroxyl

Jie Zhu, Jingxiang Wang, Xin Wang, Mingjing Gao, Bingbing Guo, Miaomiao Gao, Jiarui Liu, Yanqiu Yu, Liang Wang, Weikaixin Kong, et al. Prediction of drug efficacy from transcriptional profiles with deep learning.Nature biotechnology, 39(11):1444–1452, 2021. Data A vailability Both the MolBench dataset (CSV format) and associated evaluation code can be acc...

work page 2021

[72] [72]

Here we derive this bound analytically

Predicting the optimization ceiling from scaffold topology A central finding of this study is that the triazolo-benzodiazepine scaffold imposes a hard upper bound on the achievable QED score. Here we derive this bound analytically. QED is defined as a weighted geometric mean of eight component desirability functionsdi (ref. [9]): QED= exp P8 i=1 wi lnd i ...

work page

[73] [73]

Tanimoto budget exhaustion

Tanimoto budget exhaustion as a convergence diagnostic In our main Results we noted that the qualification rate—the fraction of generated molecules satisfying the Tanimoto≥0.40 constraint—declined from 100% (R1–R2) to 57.6% (R5). We propose that this declining rate constitutes a generalizable convergence diagnostic that we term “Tanimoto budget exhaustion...

work page

[74] [74]

The structural alerts (ALERTS) desirability exhibited a discontinuous phase transition: 0.241 at R0 to 0.842 at R1, with no further change in R2–R5

Phase transitions versus gradual improvement in property optimization Not all QED components improved gradually. The structural alerts (ALERTS) desirability exhibited a discontinuous phase transition: 0.241 at R0 to 0.842 at R1, with no further change in R2–R5. This occurred because the starting molecule’s butyl ester triggered two Brenk structural alerts...

work page

[75] [75]

propose 3–5 modified molecules with chemical ratio- nale

The interpretability–efficiency trade-off in generative design The evaluation question instructed the agent to “propose 3–5 modified molecules with chemical ratio- nale.” Instead, the agent employed REINVENT4 batch generation to produce 23–54 candidates per round and selected the best by QED ranking. This substitution raises a fundamental question about A...

work page

[76] [76]

unmonitored endpoint alarm

Systematic blind spot detection in multi-objective optimization The agent’s failure to detect the AMES mutagenicity deterioration (+180%, from 0.165 to 0.462) de- spite tracking over 13 ADMET endpoints illustrates a general vulnerability of attention-based monitor- ing. The agent explicitly tracked CYP3A4, hERG and DILI at each round—all of which improved...

work page

[77] [77]

Cost-effectiveness and practical stopping rules Thediminishingreturnspattern(Fig.7H)hasdirectimplicationsforcomputationalresourceallocation. Assuming roughly equal tool-call costs per round, and measuring against the R0-to-R4 improvement (+0.4216) since R4 is the recommended molecule, R1 delivers 83.0%, R1–R2 deliver 88.1%, R1–R3 deliver 94.0%, and R1–R4 ...

work page

[78] [78]

Multi-round iterative optimization as an emergent agent capability The E2E-Q3 task required the AI agent to execute an iterative closed-loop optimization cycle— Strategize, Generate, Dock, Evaluate—across up to 15 rounds, with autonomous decision-making at each round boundary. This represents a fundamentally different challenge from single-step computa- t...

work page

[79] [79]

erlotinib

Long-range planning, self-repair, and emergent medicinal chemistry knowledge The agent autonomously authored four pipeline versions (v1–v4, totaling 163 KB of Python), pro- gressively diagnosing and recovering from crashes: v1 failed due to NumPy/RDKit incompatibility, v2 succeeded through R1 but crashed on an f-string bug, v3 resumed from R2 with pre-pro...

work page

[80] [80]

ethynyl fixation

Agent versus REINVENT: complementary collaboration rather than competition The 3:3 tie in round winners and the non-significant pooled comparison (p= 0.104) mask a deeper complementarity. REINVENT excelled at creative molecular recombination: it serendipitously dis- covered methoxy shortening in R1 (not hypothesized by the agent), generated the F+OH+CH3 m...

work page