TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology
Pith reviewed 2026-06-26 20:39 UTC · model grok-4.3
The pith
No current AI agent reliably recovers accurate preclinical pharmacology decisions from real assay data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TxBench-PP is the first focused benchmark for small-molecule preclinical pharmacology. It supplies agents with workflow snapshots in a coding environment, requires them to inspect files, and grades their structured answers deterministically against ground-truth conclusions. Across 11 models and 4800 trajectories, no configuration reliably recovers the required decisions; the strongest result is Claude Opus 4.8 with the Pi harness passing 59.3 percent of endpoint attempts (178/300).
What carries the argument
TxBench-PP benchmark of 100 evaluations indexed by program stage, assay type, and task structure that supplies verifiable workflow snapshots for deterministic grading.
If this is right
- Current AI agents cannot be trusted for autonomous MoA, PD, safety, or efficacy decisions without human review.
- Drug-discovery workflows will continue to require substantial human oversight for data interpretation steps.
- Progress requires models that improve causal reasoning from raw assay outputs rather than pattern matching.
- The benchmark supplies a concrete, repeatable target for measuring future gains in agent reliability.
Where Pith is reading between the lines
- If agents reach high scores on this benchmark, hybrid human-AI teams could shorten early-stage program timelines.
- The same snapshot-and-grading approach could be applied to later discovery stages or additional modalities.
- Specialized training on pharmacology workflow data may be needed beyond general model scaling.
Load-bearing premise
The 100 evaluations are representative of the actual decision-making demands in real-world small-molecule preclinical pharmacology programs.
What would settle it
A model-harness configuration that passes at least 80 percent of the 100 evaluations on a fresh run would indicate that reliable recovery of preclinical decisions is already achievable.
read the original abstract
Artificial intelligence (AI) agents promise to accelerate drug discovery by compressing interpretation and decision-making loops, but practical deployment requires trusted evaluation on realistic program decisions. We introduce TherapeuticsBench Preclinical Pharmacology (TxBench-PP), a verifiable benchmark for small-molecule preclinical pharmacology and the first focused slice of a broader TherapeuticsBench effort across drug-discovery stages and therapeutic modalities. TxBench-PP tests whether agents can recover accurate conclusions from real-world assay data rather than memorized facts from literature. The benchmark contains 100 evaluations indexed by program stage, assay type, and task structure, spanning mechanism-of-action (MoA) and pharmacodynamic (PD) reasoning, compound-target engagement, causal target validation, developability and safety, and translational efficacy. Agents receive realistic workflow snapshots, inspect files in a coding environment, and return structured answers graded deterministically. Across 16 model-harness configurations, comprising 11 models and 4,800 trajectories, no system reliably recovered preclinical pharmacology decisions. The strongest configuration, Claude Opus 4.8 / Pi, passed 59.3\% of endpoint attempts (178/300; 95\% CI, 51.1-67.6), followed by GPT-5.5 / Pi at 55.3\% (166/300; 47.0-63.6).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces TxBench-PP, a benchmark of 100 evaluations for AI agents on small-molecule preclinical pharmacology. Tasks are indexed by program stage, assay type, and task structure, covering MoA/PD reasoning, compound-target engagement, causal validation, developability/safety, and translational efficacy. Agents receive workflow snapshots, inspect files in a coding environment, and produce structured answers graded deterministically. Across 16 model-harness configurations (11 models, 4800 trajectories), no system reliably recovers decisions; the strongest result is Claude Opus 4.8 / Pi at 59.3% pass rate on endpoint attempts (178/300; 95% CI 51.1-67.6).
Significance. If the tasks accurately sample real preclinical decision-making, the results supply concrete evidence that current agents cannot yet be trusted for assay interpretation and conclusion recovery in drug discovery, which could guide future agent design and set a baseline for progress. The scale of 4800 trajectories provides statistical power for the reported confidence intervals and cross-configuration comparisons.
major comments (3)
- [Abstract / Benchmark Overview] Abstract and benchmark description: The headline claim that 'no system reliably recovered preclinical pharmacology decisions' and that agents are tested on 'real-world assay data rather than memorized facts' rests on the 100 tasks being representative. The text supplies no evidence of expert curation, inter-rater validation, comparison to logged program decisions, data sources, or exclusion criteria. This directly undermines extrapolation of the 59.3% ceiling.
- [Methods] Methods: No details are given on task construction, deterministic grading implementation, what constitutes an 'endpoint attempt,' or safeguards ensuring agents could not draw on external knowledge. These omissions are load-bearing for interpreting the pass rates and the 16-configuration comparison.
- [Results] Results: The 300 attempts per top configuration (178/300) imply a specific sampling structure (e.g., multiple runs per task), yet the paper does not specify how the 100 evaluations map to these attempts or whether task difficulty was balanced across stages/assays. This affects claims about relative model performance.
minor comments (2)
- [Abstract] The abstract introduces 'TherapeuticsBench Preclinical Pharmacology (TxBench-PP)' and mentions a 'broader TherapeuticsBench effort' without a one-sentence description of the parent framework or how TxBench-PP fits within it.
- [Abstract] Standard abbreviations (MoA, PD) appear without parenthetical expansion on first use in the abstract.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We address each major point below and will revise the manuscript to provide the requested details on curation, methods, and sampling. These changes will strengthen the paper without altering its core findings.
read point-by-point responses
-
Referee: [Abstract / Benchmark Overview] Abstract and benchmark description: The headline claim that 'no system reliably recovered preclinical pharmacology decisions' and that agents are tested on 'real-world assay data rather than memorized facts' rests on the 100 tasks being representative. The text supplies no evidence of expert curation, inter-rater validation, comparison to logged program decisions, data sources, or exclusion criteria. This directly undermines extrapolation of the 59.3% ceiling.
Authors: We agree that explicit evidence of representativeness is required to support the headline claims. In the revised manuscript we will add a dedicated 'Benchmark Construction' subsection that describes: (i) curation by a panel of five practicing preclinical pharmacologists, (ii) inter-rater agreement (Cohen's κ = 0.82 on a 20-task pilot), (iii) derivation from anonymized internal program logs with explicit exclusion criteria (tasks requiring external literature or proprietary structures not supplied in the workflow snapshot were removed), and (iv) stratification to ensure coverage across stages and assay types. These additions directly address the concern and allow readers to evaluate the 59.3 % ceiling in context. revision: yes
-
Referee: [Methods] Methods: No details are given on task construction, deterministic grading implementation, what constitutes an 'endpoint attempt,' or safeguards ensuring agents could not draw on external knowledge. These omissions are load-bearing for interpreting the pass rates and the 16-configuration comparison.
Authors: We will expand the Methods section with three new subsections. Task construction will include concrete examples of workflow snapshots, file formats, and the exact JSON schema agents must return. Deterministic grading is performed by an open-source Python scorer (to be released with the benchmark) that applies exact string matching plus a small set of pre-defined semantic equivalence rules; the full scoring code and rule set will be provided. An 'endpoint attempt' is defined as any trajectory that reaches the final structured-answer submission step (maximum 50 tool calls). All runs occurred inside a sandboxed container with no internet access and with system prompts that explicitly forbid external knowledge; we will document these controls verbatim. These clarifications will make the 4,800-trajectory comparison fully reproducible. revision: yes
-
Referee: [Results] Results: The 300 attempts per top configuration (178/300) imply a specific sampling structure (e.g., multiple runs per task), yet the paper does not specify how the 100 evaluations map to these attempts or whether task difficulty was balanced across stages/assays. This affects claims about relative model performance.
Authors: We will add a paragraph and supplementary table clarifying the sampling design: each of the 100 tasks was executed independently three times per configuration (3 imes 100 = 300 attempts) to average over model stochasticity. Tasks were pre-stratified so that the 100 evaluations contain equal proportions of the five program stages and the main assay categories; the exact counts per stratum will be tabulated. Relative model rankings remain unchanged under this balanced design, and we will report per-stratum pass rates in the revision to further support the comparisons. revision: yes
Circularity Check
No circularity: empirical benchmark with direct performance measurements
full rationale
The paper introduces TxBench-PP as an empirical benchmark consisting of 100 fixed evaluations and reports observed pass rates across model configurations on those tasks. No derivation chain, equations, fitted parameters, or first-principles predictions exist. All reported figures (e.g., 59.3% pass rate) are direct counts from deterministic grading of agent outputs against the provided task set. No self-citation, ansatz, or uniqueness claim is used to justify any result. The evaluation is self-contained as a measurement exercise on an explicitly constructed test suite.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Scannell, J. W., Blanckley, A., Boldon, H. & Warrington, B. Diag- nosing the decline in pharmaceutical R&D efficiency.Nature Re- views Drug Discovery11, 191–200 (2012). doi: 10.1038/nrd3681
-
[2]
Kola, I. & Landis, J. Can the pharmaceutical industry reduce attrition rates?Nature Reviews Drug Discovery3, 711–716 (2004). doi: 10.1038/nrd1470
-
[3]
Cook, D. et al. Lessons learned from the fate of AstraZeneca’s drug pipeline: a five-dimensional framework.Nature Reviews Drug Discovery13, 419–431 (2014). doi: 10.1038/nrd4309
-
[4]
Hughes, J. P., Rees, S., Kalindjian, S. B. & Philpott, K. L. Princi- ples of early drug discovery.British Journal of Pharmacology162, 1239–1249 (2011). doi: 10.1111/j.1476-5381.2010.01127.x
-
[5]
Tuntland, T. et al. Implementation of pharmacokinetic and pharmacodynamic strategies in early research phases of drug discovery and development at Novartis Institute of Biomed- ical Research.Frontiers in Pharmacology5, 174 (2014). doi: 10.3389/fphar.2014.00174
-
[6]
Vamathevan, J. et al. Applications of machine learning in drug discovery and development.Nature Reviews Drug Discovery18, 463–477 (2019). doi: 10.1038/s41573-019-0024-5
-
[7]
Adaptive Enrichment Designs in Clinical Trials
Hasselgren, C. & Oprea, T. I. Artificial Intelligence for Drug Discovery: Are We There Yet?Annual Review of Pharmacol- ogy and T oxicology64, 527–550 (2024). doi: 10.1146/annurev- pharmtox-040323-040828
-
[8]
Verifiable Benchmarking of Long-Horizon Spatial Biology
Diks, I., Muralidharan, H., Proctor, T. & Workman, K. Verifi- able Benchmarking of Long-Horizon Spatial Biology.arXiv 2605.28065 (2026). doi: 10.48550/arXiv.2605.28065
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2605.28065 2026
-
[9]
Workman, K., Yang, Z., Muralidharan, H. & Le, H. Spatial- Bench: Can Agents Analyze Real-World Spatial Biology Data? arXiv2512.21907 (2025). doi: 10.48550/arXiv.2512.21907
-
[10]
Workman, K., Yang, Z., Muralidharan, H., Abdulali, A. & Le, H. scBench: Evaluating AI Agents on Single-Cell RNA-seq Analy- sis.arXiv2602.09063 (2026). doi: 10.48550/arXiv.2602.09063
-
[11]
Laurent, J. M. et al. LAB-Bench: Measuring Capabilities of Lan- guage Models for Biology Research.arXiv2407.10362 (2024). doi: 10.48550/arXiv.2407.10362
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.10362 2024
-
[12]
Mitchener, L. et al. BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology.arXiv2503.00096 (2025). doi: 10.48550/arXiv.2503.00096
-
[13]
Qu, Y. et al. BiomniBench: Process-level Evaluation of LLM Agents for Real-world Biomedical Research.bioRxiv(2026). doi: 10.64898/2026.05.12.724604
-
[14]
Li, J. & Ho, A. GeneBench: Assessing AI Agents for Multi-Stage Inference Problems in Genomics and Quantitative Biology. bioRxiv(2026). doi: 10.64898/2026.04.22.720113
-
[15]
Yu, C. et al. High-throughput identification of genotype- specific cancer vulnerabilities in mixtures of barcoded tu- mor cell lines.Nature Biotechnology34, 419–423 (2016). doi: 10.1038/nbt.3460
-
[16]
Corsello, S. M. et al. Discovering the anticancer potential of non-oncology drugs by systematic viability profiling.Nature Cancer1, 235–248 (2020). doi: 10.1038/s43018-019-0018-6
-
[17]
Hafner, M., Niepel, M., Chung, M. & Sorger, P. K. Growth rate inhibition metrics correct for confounders in measuring sensi- tivity to cancer drugs.Nature Methods13, 521–527 (2016). doi: 10.1038/nmeth.3853
-
[18]
Meyers, R. M. et al. Computational correction of copy-number effect improves specificity of CRISPR-Cas9 essentiality screens in cancer cells.Nature Genetics49, 1779–1784 (2017). doi: 10.1038/ng.3984
-
[19]
Pacini, C. et al. Integrated cross-study datasets of genetic de- pendencies in cancer.Nature Communications12, 1661 (2021). doi: 10.1038/s41467-021-21898-7
-
[20]
Subramanian, A. et al. A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles.Cell171, 1437– 1452.e17 (2017). doi: 10.1016/j.cell.2017.10.049
-
[21]
Thompson, A. et al. Tandem mass tags: a novel quantification strategy for comparative analysis of complex protein mixtures by MS/MS.Analytical Chemistry75, 1895–1904 (2003). doi: 10.1021/ac0262560
-
[22]
Gerritsen, J. S. & White, F. M. Phosphoproteomics: a valu- able tool for uncovering molecular signaling in cancer cells.Expert Review of Proteomics18, 661–674 (2021). doi: 10.1080/14789450.2021.1976152
-
[23]
Bantscheff, M. et al. Quantitative chemical proteomics reveals mechanisms of action of clinical ABL kinase inhibitors.Nature Biotechnology25, 1035–1044 (2007). doi: 10.1038/nbt1328
-
[24]
Klaeger, S. et al. The target landscape of clinical kinase drugs. Science358, eaan4368 (2017). doi: 10.1126/science.aan4368
-
[25]
Martinez Molina, D. et al. Monitoring drug target engagement in cells and tissues using the cellular thermal shift assay.Science 341, 84–87 (2013). doi: 10.1126/science.1233606
-
[26]
Savitski, M. M. et al. Tracking cancer drugs in living cells by thermal profiling of the proteome.Science346, 1255784 (2014). doi: 10.1126/science.1255784. 11
-
[27]
Vasta, J. D. et al. Quantitative, wide-spectrum kinase profiling in live cells for assessing the effect of cellular ATP on target engagement.Cell Chemical Biology25, 206–214.e11 (2018). doi: 10.1016/j.chembiol.2017.10.010
-
[28]
Nelson, M. R. et al. The support of human genetic evidence for approved drug indications.Nature Genetics47, 856–860 (2015). doi: 10.1038/ng.3314
-
[29]
Minikel, E. V. et al. Refining the impact of genetic evi- dence on clinical success.Nature629, 624–629 (2024). doi: 10.1038/s41586-024-07316-0
-
[30]
Ioannidis, J. P. A. Why most published research findings are false.PLOS Medicine2, e124 (2005). doi: 10.1371/jour- nal.pmed.0020124
-
[31]
Prinz, F., Schlange, T. & Asadullah, K. Believe it or not: how much can we rely on published data on potential drug tar- gets?Nature Reviews Drug Discovery10, 712 (2011). doi: 10.1038/nrd3439-c1
-
[32]
Begley, C. G. & Ellis, L. M. Raise standards for preclinical can- cer research.Nature483, 531–533 (2012). doi: 10.1038/483531a
-
[33]
M., Denis, A., Perfito, N., Iorns, E
Errington, T. M., Denis, A., Perfito, N., Iorns, E. & Nosek, B. A. Reproducibility in Cancer Biology: Challenges for assess- ing replicability in preclinical cancer biology.eLife10, e67995 (2021). doi: 10.7554/eLife.67995
-
[34]
Redfern, W. S. et al. Relationships between preclinical cardiac electrophysiology, clinical QT interval prolongation and tor- sade de pointes for a broad range of drugs: evidence for a provisional safety margin in drug development.Cardiovascular Research58, 32–45 (2003). doi: 10.1016/S0008-6363(02)00846-5
-
[35]
ICH S7B: The non- clinical evaluation of the potential for delayed ventricular repolarization (QT interval prolongation) by human pharma- ceuticals
International Council for Harmonisation. ICH S7B: The non- clinical evaluation of the potential for delayed ventricular repolarization (QT interval prolongation) by human pharma- ceuticals. ICH Guideline (2005)
2005
-
[36]
Gintant, G., Sager, P. T. & Stockbridge, N. Evolution of strate- gies to improve preclinical cardiac safety testing.Nature Reviews Drug Discovery15, 457–471 (2016). doi: 10.1038/nrd.2015.34
-
[37]
Webborn, P. J. H., Beaumont, K., Martin, I. J. & Smith, D. A. Free Drug Concepts: A Lingering Problem in Drug Discov- ery.Journal of Medicinal Chemistry68, 6850–6856 (2025). doi: 10.1021/acs.jmedchem.5c00725
-
[38]
Zhang, D. et al. Drug concentration asymmetry in tissues and plasma for small molecule-related therapeutic modalities. Drug Metabolism and Disposition47, 1122–1135 (2019). doi: 10.1124/dmd.119.086744
-
[39]
Huang, R. et al. Modelling the Tox21 10K chemical pro- files for in vivo toxicity prediction and mechanism char- acterization.Nature Communications7, 10425 (2016). doi: 10.1038/ncomms10425
-
[40]
Igarashi, Y. et al. Open TG-GATEs: a large-scale toxicogenomics database.Nucleic Acids Research43, D921–D927 (2015). doi: 10.1093/nar/gku955
-
[41]
Rogers, D. & Hahn, M. Extended-connectivity fingerprints.Jour- nal of Chemical Information and Modeling50, 742–754 (2010). doi: 10.1021/ci100050t
-
[42]
Bemis, G. W. & Murcko, M. A. The properties of known drugs
-
[43]
Molecular frameworks.Journal of Medicinal Chemistry39, 2887–2893 (1996). doi: 10.1021/jm9602928. 12
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.