TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology

Alex Urrutia; Hannah Le; Kenny Workman; Mahsa Yazdani; Ramesh Ramasamy; Tim Proctor

arxiv: 2606.19245 · v2 · pith:7XQN5GMYnew · submitted 2026-06-17 · 💻 cs.AI · cs.LG

TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology

Hannah Le , Ramesh Ramasamy , Alex Urrutia , Mahsa Yazdani , Tim Proctor , Kenny Workman This is my paper

Pith reviewed 2026-06-26 20:39 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords AI agentspreclinical pharmacologybenchmarksmall-moleculedrug discoveryassay datadecision makingtherapeutics

0 comments

The pith

No current AI agent reliably recovers accurate preclinical pharmacology decisions from real assay data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

TxBench-PP introduces a benchmark of 100 evaluations that test whether AI agents can extract correct conclusions from realistic workflow snapshots containing assay data rather than from memorized literature. The evaluations cover mechanism-of-action reasoning, pharmacodynamics, compound engagement, safety, and translational efficacy across program stages. Testing 16 model-harness combinations on 4800 trajectories shows that none reach reliable performance levels. A sympathetic reader would care because trusted AI use in drug discovery requires agents that can handle the same interpretation and decision tasks that human teams perform from experimental results.

Core claim

TxBench-PP is the first focused benchmark for small-molecule preclinical pharmacology. It supplies agents with workflow snapshots in a coding environment, requires them to inspect files, and grades their structured answers deterministically against ground-truth conclusions. Across 11 models and 4800 trajectories, no configuration reliably recovers the required decisions; the strongest result is Claude Opus 4.8 with the Pi harness passing 59.3 percent of endpoint attempts (178/300).

What carries the argument

TxBench-PP benchmark of 100 evaluations indexed by program stage, assay type, and task structure that supplies verifiable workflow snapshots for deterministic grading.

If this is right

Current AI agents cannot be trusted for autonomous MoA, PD, safety, or efficacy decisions without human review.
Drug-discovery workflows will continue to require substantial human oversight for data interpretation steps.
Progress requires models that improve causal reasoning from raw assay outputs rather than pattern matching.
The benchmark supplies a concrete, repeatable target for measuring future gains in agent reliability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If agents reach high scores on this benchmark, hybrid human-AI teams could shorten early-stage program timelines.
The same snapshot-and-grading approach could be applied to later discovery stages or additional modalities.
Specialized training on pharmacology workflow data may be needed beyond general model scaling.

Load-bearing premise

The 100 evaluations are representative of the actual decision-making demands in real-world small-molecule preclinical pharmacology programs.

What would settle it

A model-harness configuration that passes at least 80 percent of the 100 evaluations on a fresh run would indicate that reliable recovery of preclinical decisions is already achievable.

read the original abstract

Artificial intelligence (AI) agents promise to accelerate drug discovery by compressing interpretation and decision-making loops, but practical deployment requires trusted evaluation on realistic program decisions. We introduce TherapeuticsBench Preclinical Pharmacology (TxBench-PP), a verifiable benchmark for small-molecule preclinical pharmacology and the first focused slice of a broader TherapeuticsBench effort across drug-discovery stages and therapeutic modalities. TxBench-PP tests whether agents can recover accurate conclusions from real-world assay data rather than memorized facts from literature. The benchmark contains 100 evaluations indexed by program stage, assay type, and task structure, spanning mechanism-of-action (MoA) and pharmacodynamic (PD) reasoning, compound-target engagement, causal target validation, developability and safety, and translational efficacy. Agents receive realistic workflow snapshots, inspect files in a coding environment, and return structured answers graded deterministically. Across 16 model-harness configurations, comprising 11 models and 4,800 trajectories, no system reliably recovered preclinical pharmacology decisions. The strongest configuration, Claude Opus 4.8 / Pi, passed 59.3\% of endpoint attempts (178/300; 95\% CI, 51.1-67.6), followed by GPT-5.5 / Pi at 55.3\% (166/300; 47.0-63.6).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TxBench-PP introduces a benchmark on recovering pharmacology decisions from assay data but leaves task representativeness unverified.

read the letter

The main point is that TxBench-PP supplies a new set of 100 tasks drawn from real small-molecule assay data and shows that current agents top out at 59.3% success even in the best configuration.

The work does a few things cleanly. It shifts the test away from literature recall toward interpreting provided workflow snapshots and files inside a coding environment. The tasks span MoA, PD, safety, and efficacy with deterministic grading, and the authors ran 4800 trajectories across 11 models. That produces a usable empirical snapshot of where the ceiling sits today.

The soft spot is the missing account of how the tasks were built. The abstract gives no information on sourcing, exclusion rules, expert review, or any check that the 100 evaluations match the distribution of actual preclinical decisions. Without those steps the 59% figure cannot be read as a firm barrier to deployment; it could simply reflect the difficulty of this particular slice. The stress-test note on representativeness holds up on the supplied text.

If the full paper supplies the construction details and any validation against logged program decisions, the benchmark becomes more usable. As presented the central result rests on an unanchored set of proxies.

This is for groups building or testing AI agents for drug discovery who need concrete tasks to measure against. It deserves peer review so the methods can be examined and the tasks can be stress-tested or expanded.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces TxBench-PP, a benchmark of 100 evaluations for AI agents on small-molecule preclinical pharmacology. Tasks are indexed by program stage, assay type, and task structure, covering MoA/PD reasoning, compound-target engagement, causal validation, developability/safety, and translational efficacy. Agents receive workflow snapshots, inspect files in a coding environment, and produce structured answers graded deterministically. Across 16 model-harness configurations (11 models, 4800 trajectories), no system reliably recovers decisions; the strongest result is Claude Opus 4.8 / Pi at 59.3% pass rate on endpoint attempts (178/300; 95% CI 51.1-67.6).

Significance. If the tasks accurately sample real preclinical decision-making, the results supply concrete evidence that current agents cannot yet be trusted for assay interpretation and conclusion recovery in drug discovery, which could guide future agent design and set a baseline for progress. The scale of 4800 trajectories provides statistical power for the reported confidence intervals and cross-configuration comparisons.

major comments (3)

[Abstract / Benchmark Overview] Abstract and benchmark description: The headline claim that 'no system reliably recovered preclinical pharmacology decisions' and that agents are tested on 'real-world assay data rather than memorized facts' rests on the 100 tasks being representative. The text supplies no evidence of expert curation, inter-rater validation, comparison to logged program decisions, data sources, or exclusion criteria. This directly undermines extrapolation of the 59.3% ceiling.
[Methods] Methods: No details are given on task construction, deterministic grading implementation, what constitutes an 'endpoint attempt,' or safeguards ensuring agents could not draw on external knowledge. These omissions are load-bearing for interpreting the pass rates and the 16-configuration comparison.
[Results] Results: The 300 attempts per top configuration (178/300) imply a specific sampling structure (e.g., multiple runs per task), yet the paper does not specify how the 100 evaluations map to these attempts or whether task difficulty was balanced across stages/assays. This affects claims about relative model performance.

minor comments (2)

[Abstract] The abstract introduces 'TherapeuticsBench Preclinical Pharmacology (TxBench-PP)' and mentions a 'broader TherapeuticsBench effort' without a one-sentence description of the parent framework or how TxBench-PP fits within it.
[Abstract] Standard abbreviations (MoA, PD) appear without parenthetical expansion on first use in the abstract.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and will revise the manuscript to provide the requested details on curation, methods, and sampling. These changes will strengthen the paper without altering its core findings.

read point-by-point responses

Referee: [Abstract / Benchmark Overview] Abstract and benchmark description: The headline claim that 'no system reliably recovered preclinical pharmacology decisions' and that agents are tested on 'real-world assay data rather than memorized facts' rests on the 100 tasks being representative. The text supplies no evidence of expert curation, inter-rater validation, comparison to logged program decisions, data sources, or exclusion criteria. This directly undermines extrapolation of the 59.3% ceiling.

Authors: We agree that explicit evidence of representativeness is required to support the headline claims. In the revised manuscript we will add a dedicated 'Benchmark Construction' subsection that describes: (i) curation by a panel of five practicing preclinical pharmacologists, (ii) inter-rater agreement (Cohen's κ = 0.82 on a 20-task pilot), (iii) derivation from anonymized internal program logs with explicit exclusion criteria (tasks requiring external literature or proprietary structures not supplied in the workflow snapshot were removed), and (iv) stratification to ensure coverage across stages and assay types. These additions directly address the concern and allow readers to evaluate the 59.3 % ceiling in context. revision: yes
Referee: [Methods] Methods: No details are given on task construction, deterministic grading implementation, what constitutes an 'endpoint attempt,' or safeguards ensuring agents could not draw on external knowledge. These omissions are load-bearing for interpreting the pass rates and the 16-configuration comparison.

Authors: We will expand the Methods section with three new subsections. Task construction will include concrete examples of workflow snapshots, file formats, and the exact JSON schema agents must return. Deterministic grading is performed by an open-source Python scorer (to be released with the benchmark) that applies exact string matching plus a small set of pre-defined semantic equivalence rules; the full scoring code and rule set will be provided. An 'endpoint attempt' is defined as any trajectory that reaches the final structured-answer submission step (maximum 50 tool calls). All runs occurred inside a sandboxed container with no internet access and with system prompts that explicitly forbid external knowledge; we will document these controls verbatim. These clarifications will make the 4,800-trajectory comparison fully reproducible. revision: yes
Referee: [Results] Results: The 300 attempts per top configuration (178/300) imply a specific sampling structure (e.g., multiple runs per task), yet the paper does not specify how the 100 evaluations map to these attempts or whether task difficulty was balanced across stages/assays. This affects claims about relative model performance.

Authors: We will add a paragraph and supplementary table clarifying the sampling design: each of the 100 tasks was executed independently three times per configuration (3 imes 100 = 300 attempts) to average over model stochasticity. Tasks were pre-stratified so that the 100 evaluations contain equal proportions of the five program stages and the main assay categories; the exact counts per stratum will be tabulated. Relative model rankings remain unchanged under this balanced design, and we will report per-stratum pass rates in the revision to further support the comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct performance measurements

full rationale

The paper introduces TxBench-PP as an empirical benchmark consisting of 100 fixed evaluations and reports observed pass rates across model configurations on those tasks. No derivation chain, equations, fitted parameters, or first-principles predictions exist. All reported figures (e.g., 59.3% pass rate) are direct counts from deterministic grading of agent outputs against the provided task set. No self-citation, ansatz, or uniqueness claim is used to justify any result. The evaluation is self-contained as a measurement exercise on an explicitly constructed test suite.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no free parameters, axioms, or invented entities are extractable. The benchmark itself is the primary contribution.

pith-pipeline@v0.9.1-grok · 5784 in / 1024 out tokens · 16583 ms · 2026-06-26T20:39:24.998140+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 41 canonical work pages · 2 internal anchors

[1]

W., Blanckley, A., Boldon, H

Scannell, J. W., Blanckley, A., Boldon, H. & Warrington, B. Diag- nosing the decline in pharmaceutical R&D efficiency.Nature Re- views Drug Discovery11, 191–200 (2012). doi: 10.1038/nrd3681

work page doi:10.1038/nrd3681 2012
[2]

& Landis, J

Kola, I. & Landis, J. Can the pharmaceutical industry reduce attrition rates?Nature Reviews Drug Discovery3, 711–716 (2004). doi: 10.1038/nrd1470

work page doi:10.1038/nrd1470 2004
[3]

Cook, D. et al. Lessons learned from the fate of AstraZeneca’s drug pipeline: a five-dimensional framework.Nature Reviews Drug Discovery13, 419–431 (2014). doi: 10.1038/nrd4309

work page doi:10.1038/nrd4309 2014
[4]

P., Rees, S., Kalindjian, S

Hughes, J. P., Rees, S., Kalindjian, S. B. & Philpott, K. L. Princi- ples of early drug discovery.British Journal of Pharmacology162, 1239–1249 (2011). doi: 10.1111/j.1476-5381.2010.01127.x

work page doi:10.1111/j.1476-5381.2010.01127.x 2011
[5]

Tuntland, T. et al. Implementation of pharmacokinetic and pharmacodynamic strategies in early research phases of drug discovery and development at Novartis Institute of Biomed- ical Research.Frontiers in Pharmacology5, 174 (2014). doi: 10.3389/fphar.2014.00174

work page doi:10.3389/fphar.2014.00174 2014
[6]

Vamathevan, J. et al. Applications of machine learning in drug discovery and development.Nature Reviews Drug Discovery18, 463–477 (2019). doi: 10.1038/s41573-019-0024-5

work page doi:10.1038/s41573-019-0024-5 2019
[7]

Adaptive Enrichment Designs in Clinical Trials

Hasselgren, C. & Oprea, T. I. Artificial Intelligence for Drug Discovery: Are We There Yet?Annual Review of Pharmacol- ogy and T oxicology64, 527–550 (2024). doi: 10.1146/annurev- pharmtox-040323-040828

work page doi:10.1146/annurev- 2024
[8]

Verifiable Benchmarking of Long-Horizon Spatial Biology

Diks, I., Muralidharan, H., Proctor, T. & Workman, K. Verifi- able Benchmarking of Long-Horizon Spatial Biology.arXiv 2605.28065 (2026). doi: 10.48550/arXiv.2605.28065

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2605.28065 2026
[9]

Workman, K., Yang, Z., Muralidharan, H. & Le, H. Spatial- Bench: Can Agents Analyze Real-World Spatial Biology Data? arXiv2512.21907 (2025). doi: 10.48550/arXiv.2512.21907

work page doi:10.48550/arxiv.2512.21907 2025
[10]

Workman, K., Yang, Z., Muralidharan, H., Abdulali, A. & Le, H. scBench: Evaluating AI Agents on Single-Cell RNA-seq Analy- sis.arXiv2602.09063 (2026). doi: 10.48550/arXiv.2602.09063

work page doi:10.48550/arxiv.2602.09063 2026
[11]

Laurent, J. M. et al. LAB-Bench: Measuring Capabilities of Lan- guage Models for Biology Research.arXiv2407.10362 (2024). doi: 10.48550/arXiv.2407.10362

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.10362 2024
[12]

Mitchener, L. et al. BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology.arXiv2503.00096 (2025). doi: 10.48550/arXiv.2503.00096

work page doi:10.48550/arxiv.2503.00096 2025
[13]

Qu, Y. et al. BiomniBench: Process-level Evaluation of LLM Agents for Real-world Biomedical Research.bioRxiv(2026). doi: 10.64898/2026.05.12.724604

work page doi:10.64898/2026.05.12.724604 2026
[14]

Li, J. & Ho, A. GeneBench: Assessing AI Agents for Multi-Stage Inference Problems in Genomics and Quantitative Biology. bioRxiv(2026). doi: 10.64898/2026.04.22.720113

work page doi:10.64898/2026.04.22.720113 2026
[15]

Yu, C. et al. High-throughput identification of genotype- specific cancer vulnerabilities in mixtures of barcoded tu- mor cell lines.Nature Biotechnology34, 419–423 (2016). doi: 10.1038/nbt.3460

work page doi:10.1038/nbt.3460 2016
[16]

Corsello, S. M. et al. Discovering the anticancer potential of non-oncology drugs by systematic viability profiling.Nature Cancer1, 235–248 (2020). doi: 10.1038/s43018-019-0018-6

work page doi:10.1038/s43018-019-0018-6 2020
[17]

& Sorger, P

Hafner, M., Niepel, M., Chung, M. & Sorger, P. K. Growth rate inhibition metrics correct for confounders in measuring sensi- tivity to cancer drugs.Nature Methods13, 521–527 (2016). doi: 10.1038/nmeth.3853

work page doi:10.1038/nmeth.3853 2016
[18]

Meyers, R. M. et al. Computational correction of copy-number effect improves specificity of CRISPR-Cas9 essentiality screens in cancer cells.Nature Genetics49, 1779–1784 (2017). doi: 10.1038/ng.3984

work page doi:10.1038/ng.3984 2017
[19]

Pacini, C. et al. Integrated cross-study datasets of genetic de- pendencies in cancer.Nature Communications12, 1661 (2021). doi: 10.1038/s41467-021-21898-7

work page doi:10.1038/s41467-021-21898-7 2021
[20]

Subramanian, A. et al. A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles.Cell171, 1437– 1452.e17 (2017). doi: 10.1016/j.cell.2017.10.049

work page doi:10.1016/j.cell.2017.10.049 2017
[21]

Thompson, A. et al. Tandem mass tags: a novel quantification strategy for comparative analysis of complex protein mixtures by MS/MS.Analytical Chemistry75, 1895–1904 (2003). doi: 10.1021/ac0262560

work page doi:10.1021/ac0262560 1904
[22]

Gerritsen, J. S. & White, F. M. Phosphoproteomics: a valu- able tool for uncovering molecular signaling in cancer cells.Expert Review of Proteomics18, 661–674 (2021). doi: 10.1080/14789450.2021.1976152

work page doi:10.1080/14789450.2021.1976152 2021
[23]

Bantscheff, M. et al. Quantitative chemical proteomics reveals mechanisms of action of clinical ABL kinase inhibitors.Nature Biotechnology25, 1035–1044 (2007). doi: 10.1038/nbt1328

work page doi:10.1038/nbt1328 2007
[24]

Klaeger, S. et al. The target landscape of clinical kinase drugs. Science358, eaan4368 (2017). doi: 10.1126/science.aan4368

work page doi:10.1126/science.aan4368 2017
[25]

Martinez Molina, D. et al. Monitoring drug target engagement in cells and tissues using the cellular thermal shift assay.Science 341, 84–87 (2013). doi: 10.1126/science.1233606

work page doi:10.1126/science.1233606 2013
[26]

Savitski, M. M. et al. Tracking cancer drugs in living cells by thermal profiling of the proteome.Science346, 1255784 (2014). doi: 10.1126/science.1255784. 11

work page doi:10.1126/science.1255784 2014
[27]

Vasta, J. D. et al. Quantitative, wide-spectrum kinase profiling in live cells for assessing the effect of cellular ATP on target engagement.Cell Chemical Biology25, 206–214.e11 (2018). doi: 10.1016/j.chembiol.2017.10.010

work page doi:10.1016/j.chembiol.2017.10.010 2018
[28]

Nelson, M. R. et al. The support of human genetic evidence for approved drug indications.Nature Genetics47, 856–860 (2015). doi: 10.1038/ng.3314

work page doi:10.1038/ng.3314 2015
[29]

Minikel, E. V. et al. Refining the impact of genetic evi- dence on clinical success.Nature629, 624–629 (2024). doi: 10.1038/s41586-024-07316-0

work page doi:10.1038/s41586-024-07316-0 2024
[30]

Ioannidis, J. P. A. Why most published research findings are false.PLOS Medicine2, e124 (2005). doi: 10.1371/jour- nal.pmed.0020124

work page doi:10.1371/jour- 2005
[31]

& Asadullah, K

Prinz, F., Schlange, T. & Asadullah, K. Believe it or not: how much can we rely on published data on potential drug tar- gets?Nature Reviews Drug Discovery10, 712 (2011). doi: 10.1038/nrd3439-c1

work page doi:10.1038/nrd3439-c1 2011
[32]

Begley, C. G. & Ellis, L. M. Raise standards for preclinical can- cer research.Nature483, 531–533 (2012). doi: 10.1038/483531a

work page doi:10.1038/483531a 2012
[33]

M., Denis, A., Perfito, N., Iorns, E

Errington, T. M., Denis, A., Perfito, N., Iorns, E. & Nosek, B. A. Reproducibility in Cancer Biology: Challenges for assess- ing replicability in preclinical cancer biology.eLife10, e67995 (2021). doi: 10.7554/eLife.67995

work page doi:10.7554/elife.67995 2021
[34]

Redfern, W. S. et al. Relationships between preclinical cardiac electrophysiology, clinical QT interval prolongation and tor- sade de pointes for a broad range of drugs: evidence for a provisional safety margin in drug development.Cardiovascular Research58, 32–45 (2003). doi: 10.1016/S0008-6363(02)00846-5

work page doi:10.1016/s0008-6363(02)00846-5 2003
[35]

ICH S7B: The non- clinical evaluation of the potential for delayed ventricular repolarization (QT interval prolongation) by human pharma- ceuticals

International Council for Harmonisation. ICH S7B: The non- clinical evaluation of the potential for delayed ventricular repolarization (QT interval prolongation) by human pharma- ceuticals. ICH Guideline (2005)

2005
[36]

Gintant, G., Sager, P. T. & Stockbridge, N. Evolution of strate- gies to improve preclinical cardiac safety testing.Nature Reviews Drug Discovery15, 457–471 (2016). doi: 10.1038/nrd.2015.34

work page doi:10.1038/nrd.2015.34 2016
[37]

Webborn, P. J. H., Beaumont, K., Martin, I. J. & Smith, D. A. Free Drug Concepts: A Lingering Problem in Drug Discov- ery.Journal of Medicinal Chemistry68, 6850–6856 (2025). doi: 10.1021/acs.jmedchem.5c00725

work page doi:10.1021/acs.jmedchem.5c00725 2025
[38]

Zhang, D. et al. Drug concentration asymmetry in tissues and plasma for small molecule-related therapeutic modalities. Drug Metabolism and Disposition47, 1122–1135 (2019). doi: 10.1124/dmd.119.086744

work page doi:10.1124/dmd.119.086744 2019
[39]

Huang, R. et al. Modelling the Tox21 10K chemical pro- files for in vivo toxicity prediction and mechanism char- acterization.Nature Communications7, 10425 (2016). doi: 10.1038/ncomms10425

work page doi:10.1038/ncomms10425 2016
[40]

Igarashi, Y. et al. Open TG-GATEs: a large-scale toxicogenomics database.Nucleic Acids Research43, D921–D927 (2015). doi: 10.1093/nar/gku955

work page doi:10.1093/nar/gku955 2015
[41]

& Hahn, M

Rogers, D. & Hahn, M. Extended-connectivity fingerprints.Jour- nal of Chemical Information and Modeling50, 742–754 (2010). doi: 10.1021/ci100050t

work page doi:10.1021/ci100050t 2010
[42]

Bemis, G. W. & Murcko, M. A. The properties of known drugs
[43]

doi: 10.1021/jm9602928

Molecular frameworks.Journal of Medicinal Chemistry39, 2887–2893 (1996). doi: 10.1021/jm9602928. 12

work page doi:10.1021/jm9602928 1996

[1] [1]

W., Blanckley, A., Boldon, H

Scannell, J. W., Blanckley, A., Boldon, H. & Warrington, B. Diag- nosing the decline in pharmaceutical R&D efficiency.Nature Re- views Drug Discovery11, 191–200 (2012). doi: 10.1038/nrd3681

work page doi:10.1038/nrd3681 2012

[2] [2]

& Landis, J

Kola, I. & Landis, J. Can the pharmaceutical industry reduce attrition rates?Nature Reviews Drug Discovery3, 711–716 (2004). doi: 10.1038/nrd1470

work page doi:10.1038/nrd1470 2004

[3] [3]

Cook, D. et al. Lessons learned from the fate of AstraZeneca’s drug pipeline: a five-dimensional framework.Nature Reviews Drug Discovery13, 419–431 (2014). doi: 10.1038/nrd4309

work page doi:10.1038/nrd4309 2014

[4] [4]

P., Rees, S., Kalindjian, S

Hughes, J. P., Rees, S., Kalindjian, S. B. & Philpott, K. L. Princi- ples of early drug discovery.British Journal of Pharmacology162, 1239–1249 (2011). doi: 10.1111/j.1476-5381.2010.01127.x

work page doi:10.1111/j.1476-5381.2010.01127.x 2011

[5] [5]

Tuntland, T. et al. Implementation of pharmacokinetic and pharmacodynamic strategies in early research phases of drug discovery and development at Novartis Institute of Biomed- ical Research.Frontiers in Pharmacology5, 174 (2014). doi: 10.3389/fphar.2014.00174

work page doi:10.3389/fphar.2014.00174 2014

[6] [6]

Vamathevan, J. et al. Applications of machine learning in drug discovery and development.Nature Reviews Drug Discovery18, 463–477 (2019). doi: 10.1038/s41573-019-0024-5

work page doi:10.1038/s41573-019-0024-5 2019

[7] [7]

Adaptive Enrichment Designs in Clinical Trials

Hasselgren, C. & Oprea, T. I. Artificial Intelligence for Drug Discovery: Are We There Yet?Annual Review of Pharmacol- ogy and T oxicology64, 527–550 (2024). doi: 10.1146/annurev- pharmtox-040323-040828

work page doi:10.1146/annurev- 2024

[8] [8]

Verifiable Benchmarking of Long-Horizon Spatial Biology

Diks, I., Muralidharan, H., Proctor, T. & Workman, K. Verifi- able Benchmarking of Long-Horizon Spatial Biology.arXiv 2605.28065 (2026). doi: 10.48550/arXiv.2605.28065

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2605.28065 2026

[9] [9]

Workman, K., Yang, Z., Muralidharan, H. & Le, H. Spatial- Bench: Can Agents Analyze Real-World Spatial Biology Data? arXiv2512.21907 (2025). doi: 10.48550/arXiv.2512.21907

work page doi:10.48550/arxiv.2512.21907 2025

[10] [10]

Workman, K., Yang, Z., Muralidharan, H., Abdulali, A. & Le, H. scBench: Evaluating AI Agents on Single-Cell RNA-seq Analy- sis.arXiv2602.09063 (2026). doi: 10.48550/arXiv.2602.09063

work page doi:10.48550/arxiv.2602.09063 2026

[11] [11]

Laurent, J. M. et al. LAB-Bench: Measuring Capabilities of Lan- guage Models for Biology Research.arXiv2407.10362 (2024). doi: 10.48550/arXiv.2407.10362

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.10362 2024

[12] [12]

Mitchener, L. et al. BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology.arXiv2503.00096 (2025). doi: 10.48550/arXiv.2503.00096

work page doi:10.48550/arxiv.2503.00096 2025

[13] [13]

Qu, Y. et al. BiomniBench: Process-level Evaluation of LLM Agents for Real-world Biomedical Research.bioRxiv(2026). doi: 10.64898/2026.05.12.724604

work page doi:10.64898/2026.05.12.724604 2026

[14] [14]

Li, J. & Ho, A. GeneBench: Assessing AI Agents for Multi-Stage Inference Problems in Genomics and Quantitative Biology. bioRxiv(2026). doi: 10.64898/2026.04.22.720113

work page doi:10.64898/2026.04.22.720113 2026

[15] [15]

Yu, C. et al. High-throughput identification of genotype- specific cancer vulnerabilities in mixtures of barcoded tu- mor cell lines.Nature Biotechnology34, 419–423 (2016). doi: 10.1038/nbt.3460

work page doi:10.1038/nbt.3460 2016

[16] [16]

Corsello, S. M. et al. Discovering the anticancer potential of non-oncology drugs by systematic viability profiling.Nature Cancer1, 235–248 (2020). doi: 10.1038/s43018-019-0018-6

work page doi:10.1038/s43018-019-0018-6 2020

[17] [17]

& Sorger, P

Hafner, M., Niepel, M., Chung, M. & Sorger, P. K. Growth rate inhibition metrics correct for confounders in measuring sensi- tivity to cancer drugs.Nature Methods13, 521–527 (2016). doi: 10.1038/nmeth.3853

work page doi:10.1038/nmeth.3853 2016

[18] [18]

Meyers, R. M. et al. Computational correction of copy-number effect improves specificity of CRISPR-Cas9 essentiality screens in cancer cells.Nature Genetics49, 1779–1784 (2017). doi: 10.1038/ng.3984

work page doi:10.1038/ng.3984 2017

[19] [19]

Pacini, C. et al. Integrated cross-study datasets of genetic de- pendencies in cancer.Nature Communications12, 1661 (2021). doi: 10.1038/s41467-021-21898-7

work page doi:10.1038/s41467-021-21898-7 2021

[20] [20]

Subramanian, A. et al. A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles.Cell171, 1437– 1452.e17 (2017). doi: 10.1016/j.cell.2017.10.049

work page doi:10.1016/j.cell.2017.10.049 2017

[21] [21]

Thompson, A. et al. Tandem mass tags: a novel quantification strategy for comparative analysis of complex protein mixtures by MS/MS.Analytical Chemistry75, 1895–1904 (2003). doi: 10.1021/ac0262560

work page doi:10.1021/ac0262560 1904

[22] [22]

Gerritsen, J. S. & White, F. M. Phosphoproteomics: a valu- able tool for uncovering molecular signaling in cancer cells.Expert Review of Proteomics18, 661–674 (2021). doi: 10.1080/14789450.2021.1976152

work page doi:10.1080/14789450.2021.1976152 2021

[23] [23]

Bantscheff, M. et al. Quantitative chemical proteomics reveals mechanisms of action of clinical ABL kinase inhibitors.Nature Biotechnology25, 1035–1044 (2007). doi: 10.1038/nbt1328

work page doi:10.1038/nbt1328 2007

[24] [24]

Klaeger, S. et al. The target landscape of clinical kinase drugs. Science358, eaan4368 (2017). doi: 10.1126/science.aan4368

work page doi:10.1126/science.aan4368 2017

[25] [25]

Martinez Molina, D. et al. Monitoring drug target engagement in cells and tissues using the cellular thermal shift assay.Science 341, 84–87 (2013). doi: 10.1126/science.1233606

work page doi:10.1126/science.1233606 2013

[26] [26]

Savitski, M. M. et al. Tracking cancer drugs in living cells by thermal profiling of the proteome.Science346, 1255784 (2014). doi: 10.1126/science.1255784. 11

work page doi:10.1126/science.1255784 2014

[27] [27]

Vasta, J. D. et al. Quantitative, wide-spectrum kinase profiling in live cells for assessing the effect of cellular ATP on target engagement.Cell Chemical Biology25, 206–214.e11 (2018). doi: 10.1016/j.chembiol.2017.10.010

work page doi:10.1016/j.chembiol.2017.10.010 2018

[28] [28]

Nelson, M. R. et al. The support of human genetic evidence for approved drug indications.Nature Genetics47, 856–860 (2015). doi: 10.1038/ng.3314

work page doi:10.1038/ng.3314 2015

[29] [29]

Minikel, E. V. et al. Refining the impact of genetic evi- dence on clinical success.Nature629, 624–629 (2024). doi: 10.1038/s41586-024-07316-0

work page doi:10.1038/s41586-024-07316-0 2024

[30] [30]

Ioannidis, J. P. A. Why most published research findings are false.PLOS Medicine2, e124 (2005). doi: 10.1371/jour- nal.pmed.0020124

work page doi:10.1371/jour- 2005

[31] [31]

& Asadullah, K

Prinz, F., Schlange, T. & Asadullah, K. Believe it or not: how much can we rely on published data on potential drug tar- gets?Nature Reviews Drug Discovery10, 712 (2011). doi: 10.1038/nrd3439-c1

work page doi:10.1038/nrd3439-c1 2011

[32] [32]

Begley, C. G. & Ellis, L. M. Raise standards for preclinical can- cer research.Nature483, 531–533 (2012). doi: 10.1038/483531a

work page doi:10.1038/483531a 2012

[33] [33]

M., Denis, A., Perfito, N., Iorns, E

Errington, T. M., Denis, A., Perfito, N., Iorns, E. & Nosek, B. A. Reproducibility in Cancer Biology: Challenges for assess- ing replicability in preclinical cancer biology.eLife10, e67995 (2021). doi: 10.7554/eLife.67995

work page doi:10.7554/elife.67995 2021

[34] [34]

Redfern, W. S. et al. Relationships between preclinical cardiac electrophysiology, clinical QT interval prolongation and tor- sade de pointes for a broad range of drugs: evidence for a provisional safety margin in drug development.Cardiovascular Research58, 32–45 (2003). doi: 10.1016/S0008-6363(02)00846-5

work page doi:10.1016/s0008-6363(02)00846-5 2003

[35] [35]

ICH S7B: The non- clinical evaluation of the potential for delayed ventricular repolarization (QT interval prolongation) by human pharma- ceuticals

International Council for Harmonisation. ICH S7B: The non- clinical evaluation of the potential for delayed ventricular repolarization (QT interval prolongation) by human pharma- ceuticals. ICH Guideline (2005)

2005

[36] [36]

Gintant, G., Sager, P. T. & Stockbridge, N. Evolution of strate- gies to improve preclinical cardiac safety testing.Nature Reviews Drug Discovery15, 457–471 (2016). doi: 10.1038/nrd.2015.34

work page doi:10.1038/nrd.2015.34 2016

[37] [37]

Webborn, P. J. H., Beaumont, K., Martin, I. J. & Smith, D. A. Free Drug Concepts: A Lingering Problem in Drug Discov- ery.Journal of Medicinal Chemistry68, 6850–6856 (2025). doi: 10.1021/acs.jmedchem.5c00725

work page doi:10.1021/acs.jmedchem.5c00725 2025

[38] [38]

Zhang, D. et al. Drug concentration asymmetry in tissues and plasma for small molecule-related therapeutic modalities. Drug Metabolism and Disposition47, 1122–1135 (2019). doi: 10.1124/dmd.119.086744

work page doi:10.1124/dmd.119.086744 2019

[39] [39]

Huang, R. et al. Modelling the Tox21 10K chemical pro- files for in vivo toxicity prediction and mechanism char- acterization.Nature Communications7, 10425 (2016). doi: 10.1038/ncomms10425

work page doi:10.1038/ncomms10425 2016

[40] [40]

Igarashi, Y. et al. Open TG-GATEs: a large-scale toxicogenomics database.Nucleic Acids Research43, D921–D927 (2015). doi: 10.1093/nar/gku955

work page doi:10.1093/nar/gku955 2015

[41] [41]

& Hahn, M

Rogers, D. & Hahn, M. Extended-connectivity fingerprints.Jour- nal of Chemical Information and Modeling50, 742–754 (2010). doi: 10.1021/ci100050t

work page doi:10.1021/ci100050t 2010

[42] [42]

Bemis, G. W. & Murcko, M. A. The properties of known drugs

[43] [43]

doi: 10.1021/jm9602928

Molecular frameworks.Journal of Medicinal Chemistry39, 2887–2893 (1996). doi: 10.1021/jm9602928. 12

work page doi:10.1021/jm9602928 1996