Verifiable Benchmarking of Long-Horizon Spatial Biology

· 2026 · cs.AI · arXiv 2605.28065

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

AI agents are increasingly useful for biological data analysis, but existing benchmarks mostly test broad biological knowledge, executable workflows, or localized analysis steps rather than end-to-end scientific reasoning over spatial measurements. We introduce SpatialBench-Long, a benchmark for long-horizon spatial biology in which agents must recover biological claims from raw or near-raw data and calibrated experimental context without prescribed methods. SpatialBench-Long contains 24 evaluations across primary pancreatic ductal adenocarcinoma (PDAC), engineered glioblastoma organoids and in vivo tumors, Cas9 lineage-traced lung adenocarcinoma, and mouse optic nerve aging/intervention systems, spanning CosMx, Visium, Xenium, multiplexed error-robust fluorescence in situ hybridization (MERFISH), single-cell RNA sequencing (scRNA-seq), Slide-seq, Slide-tags, histology, and lineage-recording data. Candidate claims are hardened through reproduction, independent scientist review, and trajectory inspection. Final answers are graded deterministically over controlled vocabularies and symbols with companion rubrics capturing progress through key analysis chokepoints. Across the SpatialBench-Long benchmark, three model-harness pairs tie at 8/72 runs (11.1\%): Gemini 3.5 Flash / Pi terminal coding harness, GPT-5.5 / Pi, and GPT-5.5 / OpenAI Codex. SpatialBench-Long tests whether agents can move beyond executing procedural analysis to deriving accurate scientific conclusions from complex spatial measurements.

representative citing papers

TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology

cs.AI · 2026-06-17 · unverdicted · novelty 6.0

TxBench-PP benchmark shows leading AI agents achieve at most 59% success on tasks requiring recovery of preclinical pharmacology conclusions from assay data.

citing papers explorer

Showing 1 of 1 citing paper after filters.

TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology cs.AI · 2026-06-17 · unverdicted · none · ref 8 · internal anchor
TxBench-PP benchmark shows leading AI agents achieve at most 59% success on tasks requiring recovery of preclinical pharmacology conclusions from assay data.

Verifiable Benchmarking of Long-Horizon Spatial Biology

fields

years

verdicts

representative citing papers

citing papers explorer