pith. sign in

arxiv: 2605.21740 · v2 · pith:V3FYH6XJnew · submitted 2026-05-20 · 💻 cs.AI

SMDD-Bench: Can LLMs Solve Real-World Small Molecule Drug Design Tasks?

Pith reviewed 2026-05-22 08:44 UTC · model grok-4.3

classification 💻 cs.AI
keywords small molecule drug designLLM agentsagentic benchmarkdrug discoverypharmacophore identificationscaffold hoppinglead optimizationfragment assembly
0
0 comments X

The pith

Even the most advanced LLMs solve only about 40 percent of tasks in a new benchmark for small molecule drug design.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SMDD-Bench to measure how well LLM agents handle actual small molecule drug design work across varied chemistries and targets. The benchmark contains 502 multi-turn tasks in five categories that demand chemical reasoning, 3D intuition, specialized tools, and extended planning. When seven leading models are tested, the strongest one completes just 40.2 percent of the tasks. A reader would conclude that current LLMs lack the integrated skills for autonomous computational drug design. The benchmark is offered as a standard test to encourage progress toward fully automated systems.

Core claim

SMDD-Bench consists of 502 guaranteed-solvable task instances across five types—2D Pharmacophore Identification, Interaction Point Discovery, Scaffold Hopping, Lead Optimization, and Fragment Assembly—spanning wide chemical space and 102 unique protein targets. The central result is that even GPT5.4, the best model evaluated among seven frontier LLMs, solves only 40.2 percent of tasks, revealing shortfalls in chemical and biological reasoning, 3D intuition, tool use, and planning over limited oracle calls.

What carries the argument

SMDD-Bench, a multi-turn long-horizon agentic benchmark of 502 task instances that evaluates LLM agents on small molecule drug design requiring chemical reasoning and tool use.

If this is right

  • LLM agents must develop stronger chemical reasoning and 3D spatial intuition to complete these tasks reliably.
  • A standardized benchmark allows systematic tracking of progress toward autonomous drug design systems.
  • Full success on the benchmark would require effective long-horizon planning within limited oracle calls.
  • Current performance levels indicate that fully autonomous computational drug design is not yet achievable with frontier models.
  • The wide coverage of chemical space and targets means gains would apply across many discovery scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hybrid LLM systems paired with dedicated chemistry simulation tools may close the gap faster than LLM scaling alone.
  • Extensions could add experimental feedback or multi-target constraints to test more realistic discovery pipelines.
  • The results point to a need for training methods that embed domain knowledge more deeply into models for scientific tasks.
  • Better performance here might shorten early drug design cycles by automating initial iteration steps.

Load-bearing premise

The 502 task instances are representative of real-world small molecule drug design challenges and are solvable with appropriate chemical and biological reasoning plus tool use.

What would settle it

Evidence that expert human chemists with the same tools and oracles cannot solve most of the tasks, or data showing the tasks do not match typical challenges faced in actual pharmaceutical research.

Figures

Figures reproduced from arXiv: 2605.21740 by Amir Barati Farimani, Hamed Mahdavi, Kathy Wei, Kevin Han, Niloofar Mireshghallah, Renfei Zhang.

Figure 1
Figure 1. Figure 1: Overview of SMDD-Bench’s task types and example reasoning turns. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) The distribution of SMDD-Bench tasks across the five task types. Each task type is [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (a) Frequency with which each ADMET and binding affinity property appears as an [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The complete breakdown of task instances into protein targets. There are 102 unique [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The baseline ADMET property values for all optimization objectives and hold-constant [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: A histogram of the tanimoto similarities between all pairs of reference molecules provided [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: A breakdown of the task types of SMDD-Bench into the families of the protein targets [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The number of task instances that each LLM agent gets correct as well as the number of [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The number of task instances that each LLM agent gets correct as well as the number of [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The average success rate across all 7 evaluated frontier LLM agents with respect to [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: (a) The mean success rate across all 7 agents evalauted on SMDD-Bench plotted against [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: We investigate the relationship between the submissions of multiple different LLMs on the [PITH_FULL_IMAGE:figures/full_fig_p029_12.png] view at source ↗
read the original abstract

LLM agents have incredible potential for scientific discovery applications. However, the performance of LLM agents on real-world, small molecule drug design (SMDD) tasks across diverse chemistries and targets is unclear. Current evaluation methods are either ad hoc, too simple for real-world discovery, limited in scale, or restricted to single-turn question answering. In effort to standardize the evaluation of LLM agents on small molecule design, we introduce SMDD-Bench, a challenging, multi-turn, long-horizon agentic benchmark consisting of 502 guaranteed-solvable task instances spanning 5 task types: 2D Pharmacophore Identification, Interaction Point Discovery, Scaffold Hopping, Lead Optimization, and Fragment Assembly. SMDD-Bench tasks span a wide region of chemical space and involve 102 unique protein targets. Completely solving the benchmark would require having strong chemical and biological reasoning and 3D intuition, understanding specialized tool use, and displaying planning expertise over a limited number of oracle calls. We benchmark 7 frontier open and closed source LLMs and find even the most performant LLM, GPT5.4, solves only 40.2\% of tasks. We hope SMDD-Bench provides a standardized testbed to invigorate the field towards training and evaluating LLM agents for fully autonomous computational drug design. We host a public leaderboard at smddbench.com .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SMDD-Bench, a multi-turn agentic benchmark for LLM agents on small molecule drug design consisting of 502 guaranteed-solvable task instances spanning five categories (2D Pharmacophore Identification, Interaction Point Discovery, Scaffold Hopping, Lead Optimization, and Fragment Assembly) across 102 unique protein targets. The authors evaluate seven frontier open- and closed-source LLMs under constraints requiring chemical/biological reasoning, 3D intuition, tool use, and planning over a limited number of oracle calls. The central empirical result is that the strongest model, GPT5.4, solves only 40.2% of tasks. A public leaderboard is provided to standardize future evaluation.

Significance. If the tasks are representative of real-world challenges and verifiably solvable under the stated constraints, SMDD-Bench would constitute a meaningful advance over prior ad-hoc or single-turn evaluations by offering scale, diversity, and long-horizon agentic structure. The public leaderboard and focus on practical tool-use limits could usefully direct research toward autonomous computational drug design. The work's empirical nature and release of the benchmark are positive contributions to the AI-for-science literature.

major comments (2)
  1. [Task construction and validation] Task construction section: The manuscript asserts that the 502 instances are 'guaranteed-solvable' using chemical and biological reasoning, 3D intuition, and the provided tools within a limited oracle-call budget. No human-expert performance data collected under identical multi-turn, tool-interface, and call-budget conditions is reported. This is load-bearing for the central claim, because without such a baseline it is impossible to determine whether the reported LLM success rates (e.g., 40.2% for GPT5.4) measure reasoning deficits or partial unsolvability of the task set.
  2. [Results and evaluation protocol] Evaluation protocol (results section): Details on how solvability was validated for each of the five task categories and 102 targets, as well as the precise definition of a 'solved' trajectory (including success criteria and oracle-call limits), are insufficient to allow independent verification or reproduction of the performance numbers. This directly affects interpretability of the headline result.
minor comments (2)
  1. [Abstract] Abstract: A short statement of the concrete tools made available to agents and the maximum number of oracle calls permitted would help readers immediately gauge the benchmark's difficulty.
  2. [Throughout] Notation and figures: Ensure consistent use of 'GPT5.4' versus any other model naming conventions throughout the text and tables.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript introducing SMDD-Bench. We address each major comment below and describe the revisions we will incorporate to improve clarity and reproducibility.

read point-by-point responses
  1. Referee: [Task construction and validation] Task construction section: The manuscript asserts that the 502 instances are 'guaranteed-solvable' using chemical and biological reasoning, 3D intuition, and the provided tools within a limited oracle-call budget. No human-expert performance data collected under identical multi-turn, tool-interface, and call-budget conditions is reported. This is load-bearing for the central claim, because without such a baseline it is impossible to determine whether the reported LLM success rates (e.g., 40.2% for GPT5.4) measure reasoning deficits or partial unsolvability of the task set.

    Authors: We agree that human-expert performance data collected under identical conditions would aid interpretation. However, the 'guaranteed-solvable' status is not an assertion but follows from the expert-driven construction process: each task instance was explicitly designed by domain experts so that a finite sequence of tool calls and reasoning steps reaches the solution within the stated oracle budget. We will revise the Task Construction section to provide category-by-category examples of these solution trajectories and the validation steps used during design. Comprehensive human baseline collection under the precise multi-turn interface would require substantial additional resources and is outside the scope of this work, but the expanded construction details should address concerns about partial unsolvability. revision: partial

  2. Referee: [Results and evaluation protocol] Evaluation protocol (results section): Details on how solvability was validated for each of the five task categories and 102 targets, as well as the precise definition of a 'solved' trajectory (including success criteria and oracle-call limits), are insufficient to allow independent verification or reproduction of the performance numbers. This directly affects interpretability of the headline result.

    Authors: We acknowledge that the current description of the evaluation protocol lacks sufficient granularity for full reproducibility. In the revised manuscript we will expand the relevant subsection to specify: (i) the exact validation procedure used to confirm solvability for each task category and target, (ii) the formal definition of a solved trajectory, including quantitative success criteria (e.g., property thresholds or interaction matches), and (iii) the precise per-task oracle-call budgets. These additions will enable independent verification of the reported performance figures. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark evaluation with no derivation chain or circular reduction

full rationale

The paper introduces SMDD-Bench as an empirical testbed consisting of 502 task instances across five categories and 102 targets. Core results consist of direct performance measurements (e.g., GPT5.4 solving 40.2% of tasks) obtained by running LLMs on the benchmark under stated constraints. No equations, parameter fitting, predictions derived from subsets of data, or self-referential derivations appear in the provided text. The descriptor 'guaranteed-solvable' is attached to the task-construction process itself rather than functioning as a fitted input renamed as a prediction or a self-definition that collapses the evaluation metric. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked to justify central claims. The work is therefore self-contained as a benchmark proposal whose validity rests on external task design and measured outcomes rather than any internal reduction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The headline performance numbers rest on the unverified premise that the constructed tasks are both representative and solvable; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Tasks are guaranteed-solvable with strong chemical and biological reasoning, 3D intuition, specialized tool use, and planning over limited oracle calls.
    Explicitly stated in the abstract as the requirement for completely solving the benchmark.
  • domain assumption The 502 instances span a wide region of chemical space and involve 102 unique protein targets.
    Claimed in the abstract without further validation details.

pith-pipeline@v0.9.0 · 5792 in / 1231 out tokens · 30184 ms · 2026-05-22T08:44:41.750033+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.