pith. sign in

arxiv: 2605.28065 · v1 · pith:KKSP7VC7new · submitted 2026-05-27 · 💻 cs.AI

Verifiable Benchmarking of Long-Horizon Spatial Biology

classification 💻 cs.AI
keywords analysisspatialspatialbench-longagentsbiologicaldataacrossadenocarcinoma
0
0 comments X
read the original abstract

AI agents are increasingly useful for biological data analysis, but existing benchmarks mostly test broad biological knowledge, executable workflows, or localized analysis steps rather than end-to-end scientific reasoning over spatial measurements. We introduce SpatialBench-Long, a benchmark for long-horizon spatial biology in which agents must recover biological claims from raw or near-raw data and calibrated experimental context without prescribed methods. SpatialBench-Long contains 24 evaluations across primary pancreatic ductal adenocarcinoma (PDAC), engineered glioblastoma organoids and in vivo tumors, Cas9 lineage-traced lung adenocarcinoma, and mouse optic nerve aging/intervention systems, spanning CosMx, Visium, Xenium, multiplexed error-robust fluorescence in situ hybridization (MERFISH), single-cell RNA sequencing (scRNA-seq), Slide-seq, Slide-tags, histology, and lineage-recording data. Candidate claims are hardened through reproduction, independent scientist review, and trajectory inspection. Final answers are graded deterministically over controlled vocabularies and symbols with companion rubrics capturing progress through key analysis chokepoints. Across the SpatialBench-Long benchmark, three model-harness pairs tie at 8/72 runs (11.1\%): Gemini 3.5 Flash / Pi terminal coding harness, GPT-5.5 / Pi, and GPT-5.5 / OpenAI Codex. SpatialBench-Long tests whether agents can move beyond executing procedural analysis to deriving accurate scientific conclusions from complex spatial measurements.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. scBench-Long: Verifiable Benchmarking of Long-Horizon Single-Cell Biology

    q-bio.GN 2026-06 unverdicted novelty 6.0

    scBench-Long is a benchmark with 21 evaluations where the strongest AI model-harness pair succeeds on 25.4% of long-horizon single-cell biology tasks.

  2. TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology

    cs.AI 2026-06 unverdicted novelty 6.0

    TxBench-PP benchmark shows leading AI agents achieve at most 59% success on tasks requiring recovery of preclinical pharmacology conclusions from assay data.