pith. sign in

arxiv: 2606.26563 · v1 · pith:JZJ3UO3Lnew · submitted 2026-06-25 · 🧬 q-bio.GN · cs.AI

scBench-Long: Verifiable Benchmarking of Long-Horizon Single-Cell Biology

Pith reviewed 2026-06-26 02:10 UTC · model grok-4.3

classification 🧬 q-bio.GN cs.AI
keywords single-cell biologyAI agentslong-horizon tasksverifiable benchmarksscientific claimsmulti-step workflowsraw data analysisdeterministic grading
0
0 comments X

The pith

AI agents recover scientific claims from raw single-cell data in only 25 percent of long-horizon tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents scBench-Long as a benchmark that requires agents to produce specific biological conclusions from raw or near-raw single-cell measurements without given methods. It spans 21 evaluations covering melanoma T-cell reactivity, cross-species development, lung tumor aging, and COVID-19 pathology, using paired sequencing, chromatin profiling, immune repertoires, and validation resources. Results from 1,068 trajectories show the strongest model-harness pair succeeds on 16 of 63 runs. This setup tests whether agents can integrate data, metadata, and auxiliary evidence into supported claims over extended workflows rather than isolated steps. The low pass rate indicates that current systems remain limited in turning measurements into complex, verifiable biology statements.

Core claim

scBench-Long contains 21 evaluations in which agents must recover scientific conclusions from raw single-cell data across melanoma CD8 T-cell reactivity, CD8 RNA+ATAC regulatory inference, human-monkey chimera development, KRAS-driven lung tumor aging, and lethal COVID-19 lung pathology. Tasks draw on paired scRNA/TCR sequencing, RNA and chromatin profiling, cross-species transcriptomics, combinatorial scRNA-seq, single-nucleus RNA-seq, immune repertoires, ortholog maps, ligand-receptor resources, and validation evidence. Candidate claims are reproduced, reviewed, and turned into controlled answer vocabularies with deterministic grading and trajectory rubrics. Across 1,068 completed trajecto

What carries the argument

scBench-Long benchmark of 21 evaluations that convert reproduced claims into controlled vocabularies with deterministic grading and trajectory rubrics for long-horizon single-cell workflows.

If this is right

  • Agents must improve at chaining raw data processing with metadata integration and auxiliary evidence to reach supported claims.
  • Existing benchmarks that test only broad knowledge or local steps miss the full workflow demands of single-cell studies.
  • The controlled vocabularies and rubrics provide a repeatable way to track whether future agents close the 25 percent success gap.
  • Tasks that combine sequencing modalities, cross-species mapping, and validation evidence set concrete targets for multi-step scientific reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benchmark could be extended by adding new evaluations that require agents to handle noisy or incomplete metadata.
  • Low success rates suggest current systems may still need human oversight for turning single-cell measurements into publishable claims.
  • Similar long-horizon setups could be applied to other data-rich fields such as spatial transcriptomics or proteomics to test transfer of the approach.

Load-bearing premise

The 21 evaluations accurately represent the multi-step integration of raw data, metadata, and auxiliary evidence that real single-cell biology research requires without prescribed methods.

What would settle it

A model-harness pair that passes more than half of the 63 runs on the 21 evaluations would show the reported performance gap is not general.

read the original abstract

Single-cell studies require analysts to convert raw measurements into specific biological claims through multi-step workflows and integration of metadata, assay context, and auxiliary evidence. Existing AI-biology benchmarks largely measure broad knowledge, executable workflows, or local analysis steps. We introduce scBench-Long, a benchmark for long-horizon single-cell biology in which agents must recover scientific conclusions from raw or near-raw data without prescribed methods. The benchmark contains 21 evaluations spanning melanoma CD8 T-cell reactivity, CD8 RNA+ATAC regulatory inference, human--monkey chimera development, KRAS-driven lung tumor aging, and lethal COVID-19 lung pathology. Tasks cover paired scRNA/TCR sequencing, RNA and chromatin profiling, cross-species transcriptomics, combinatorial scRNA-seq, single-nucleus RNA-seq, immune repertoires, ortholog maps, ligand--receptor resources, and validation evidence. Candidate claims are reproduced, reviewed, and converted into controlled answer vocabularies with deterministic grading and trajectory rubrics. Across 1,068 completed trajectories, the strongest model--harness pair passes 16/63 runs (25.4\%). scBench-Long evaluates whether agents can move beyond local analysis steps and make complex scientific claims that are supported by single-cell data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces scBench-Long, a benchmark of 21 evaluations spanning melanoma CD8 reactivity, RNA+ATAC inference, chimera development, KRAS lung aging, and COVID pathology. Agents must recover scientific claims from raw or near-raw single-cell data (scRNA/TCR, RNA+ATAC, cross-species, etc.) without prescribed methods; candidate claims are reproduced, reviewed, and converted to controlled vocabularies with deterministic grading. Across 1,068 trajectories the strongest model-harness pair passes 16/63 runs (25.4%).

Significance. If the 21 tasks are shown to require genuine multi-step integration of raw data, metadata, and auxiliary evidence without embedded method guidance or narrow grading, the benchmark would supply a falsifiable, verifiable test of long-horizon biological reasoning that existing knowledge or local-analysis benchmarks do not provide.

major comments (3)
  1. [Task construction / abstract] Task-construction section (and abstract): the claim that tasks require 'unprescribed multi-step integration of raw data, metadata, and auxiliary evidence' rests on the 21 evaluations faithfully reproducing real workflows, yet no expert-validation protocol, inter-reviewer agreement statistics, or explicit comparison demonstrating that the tasks cannot be solved by local steps or that the controlled vocabularies exclude valid alternative conclusions is supplied.
  2. [Results / evaluation protocol] Results and evaluation sections: the headline 16/63 (25.4%) pass rate is reported without baselines for random guessing, purely local analysis strategies, or ablations that remove metadata/auxiliary resources, so it is impossible to attribute the failure rate specifically to the long-horizon requirement rather than other factors.
  3. [Grading / trajectory rubrics] Grading and trajectory-rubric description: deterministic grading is asserted via 'controlled answer vocabularies,' but no concrete examples of rubric items, edge-case handling, or evidence that multiple biologically plausible paths are accepted are given, which directly affects whether the 25.4% figure measures the intended capability.
minor comments (2)
  1. [Abstract / introduction] The abstract and introduction use 'reproduced, reviewed' without citing the source publications or review process for each of the 21 tasks; adding a supplementary table with original references would improve traceability.
  2. [Task overview] Figure or table summarizing the 21 tasks should include columns for data modality, required auxiliary resources, and number of steps, to make the 'long-horizon' claim immediately verifiable.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below, indicating where we agree revisions are needed and where we provide additional clarification or defense based on the manuscript content.

read point-by-point responses
  1. Referee: [Task construction / abstract] Task-construction section (and abstract): the claim that tasks require 'unprescribed multi-step integration of raw data, metadata, and auxiliary evidence' rests on the 21 evaluations faithfully reproducing real workflows, yet no expert-validation protocol, inter-reviewer agreement statistics, or explicit comparison demonstrating that the tasks cannot be solved by local steps or that the controlled vocabularies exclude valid alternative conclusions is supplied.

    Authors: The 21 evaluations are derived directly from published single-cell studies (e.g., melanoma CD8 reactivity, KRAS lung aging), with candidate claims extracted from the original papers' reported conclusions rather than invented workflows. This ensures fidelity to real multi-step processes involving raw data, metadata, and auxiliary resources such as ortholog maps and ligand-receptor databases. We acknowledge that the submitted manuscript does not include a formal expert-validation protocol or inter-reviewer agreement statistics. In revision we will expand the Task construction section with a step-by-step account of claim extraction and review by the authoring team, plus explicit examples illustrating why local single-step analysis is insufficient for each task category. The controlled vocabularies are constructed to accept equivalent biological statements while excluding unsupported alternatives; we will add a paragraph clarifying this design and noting that alternative conclusions are only accepted if they align with the original paper's evidence. revision: partial

  2. Referee: [Results / evaluation protocol] Results and evaluation sections: the headline 16/63 (25.4%) pass rate is reported without baselines for random guessing, purely local analysis strategies, or ablations that remove metadata/auxiliary resources, so it is impossible to attribute the failure rate specifically to the long-horizon requirement rather than other factors.

    Authors: We agree that baselines and ablations are necessary to isolate the contribution of long-horizon integration. The current results section reports aggregate pass rates across 1,068 trajectories but does not include these controls. In the revised manuscript we will add (i) a random-guessing baseline computed over the controlled vocabularies, (ii) performance of local-analysis-only agents that receive only one data modality at a time, and (iii) ablation runs that withhold metadata or auxiliary evidence. These additions will allow readers to quantify how much of the 25.4% ceiling is attributable to the requirement for multi-step reasoning. revision: yes

  3. Referee: [Grading / trajectory rubrics] Grading and trajectory-rubric description: deterministic grading is asserted via 'controlled answer vocabularies,' but no concrete examples of rubric items, edge-case handling, or evidence that multiple biologically plausible paths are accepted are given, which directly affects whether the 25.4% figure measures the intended capability.

    Authors: The manuscript states that candidate claims are converted to controlled vocabularies with deterministic grading and trajectory rubrics, but the submitted version indeed omits concrete rubric excerpts. In revision we will append a new subsection under Grading that provides (a) verbatim examples of rubric items for two representative tasks (e.g., CD8 TCR reactivity and cross-species chimera development), (b) explicit edge-case rules (e.g., how partial matches or synonymous terminology are scored), and (c) a statement that any biologically equivalent conclusion supported by the same evidence is accepted, with an illustration of two distinct but valid reasoning paths that both receive full credit. revision: partial

Circularity Check

0 steps flagged

No circularity: benchmark introduction is self-contained with no derivation chain

full rationale

The paper presents scBench-Long as a new external benchmark consisting of 21 evaluations drawn from published single-cell studies. It does not derive predictions, fit parameters, or claim mathematical results from its own equations. No steps match the enumerated circularity patterns (self-definitional, fitted-input-as-prediction, self-citation load-bearing, etc.). The central claim—that agents must recover claims from raw data without prescribed methods—is supported by the benchmark's construction details rather than reducing to any internal fit or self-reference. This is the expected non-finding for a benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no information available on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5761 in / 1112 out tokens · 17005 ms · 2026-06-26T02:10:33.156078+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 33 canonical work pages · 2 internal anchors

  1. [1]

    & Pachter, L

    Moses, L. & Pachter, L. Museum of spatial transcrip- tomics.Nature Methods19, 534–546 (2022).https: //doi.org/10.1038/s41592-022-01409-2

  2. [2]

    G., Lee, H

    Williams, C. G., Lee, H. J., Asatsuma, T., Vento-Tormo, R. & Haque, A. An introduction to spatial transcriptomics for biomedical research.Genome Medicine14, 68 (2022). https://doi.org/10.1186/s13073-022-01075-1

  3. [3]

    M., Sistig, A

    Dries, R., Chen, J., Del Rossi, N., Khan, M. M., Sistig, A. & Yuan, G. C. Advances in spatial transcriptomic data analysis.Genome Research31, 1706–1718 (2021). https://doi.org/10.1101/gr.275224.121

  4. [4]

    et al.In situ multi-modal characterization of pancreatic cancer reveals tumor cell identity as a defining factor of the surround- ing microenvironment.Cell Reports45, 116827 (2026)

    Lyubetskaya, A., Rabe, B., Kavran, A., Bai, Y. et al.In situ multi-modal characterization of pancreatic cancer reveals tumor cell identity as a defining factor of the surround- ing microenvironment.Cell Reports45, 116827 (2026). https://doi.org/10.1016/j.celrep.2025.116827

  5. [5]

    H., Annamalai, D., Woodiwiss, T., McCornack, C., Cleary, R

    Ishahak, M., Han, R. H., Annamalai, D., Woodiwiss, T., McCornack, C., Cleary, R. T., DeSouza, P. A., Qu, X., Dahiya, S., Kim, A. H. & Millman, J. R. Genetically en- gineered brain organoids recapitulate spatial and devel- opmental states of glioblastoma progression.Advanced Science12, 2410110 (2025).https://doi.org/10.1002/ advs.202410110

  6. [6]

    G., Sun, D., Min, K

    Jones, M. G., Sun, D., Min, K. H. J., Colgan, W. N., Wang, H., Torok, T., Cardoso, E. C., Tian, L., Weir, J. A., Chen, V. Z., Koblan, L. W., Yost, K. E., Mathey-Andrews, N., D’Souza, E., Russell, A. J. C., Stickels, R. R., Balderrama, K. S., Rideout, W. M., Dai, M., Marrero, G., Kumar, V., Saqi, A., Chen, F., Weissman, J. S., Yosef, N. & Yang, D. Spatiote...

  7. [7]

    Darwish, A

    Yang, D., Jones, M. G., Naranjo, S., Rideout, W. M., Min, K. H. J., Ho, R., Wu, W., Replogle, J. M., Page, J. L., Quinn, J. J., Horns, F., Qiu, X., Chen, M. Z., Freed-Pastor, W. A., McGinnis, C. S., Patterson, D. M., Gartner, Z. J., Chow, E. D., Bivona, T. G., Chan, M. M., Yosef, N., Jacks, T. & Weissman, J. S. Lineage tracing reveals the phylodynam- ics,...

  8. [8]

    DuPage, M., Dooley, A. L. & Jacks, T. Conditional mouse lung cancer models using adenoviral or lentiviral deliv- ery of Cre recombinase.Nature Protocols4, 1064–1072 (2009).https://doi.org/10.1038/nprot.2009.95

  9. [9]

    G., Stickels, R

    Rodriques, S. G., Stickels, R. R., Goeva, A., Martin, C. A., Murray, E., Vanderburg, C. R., Welch, J., Chen, L. M., Chen, F. & Macosko, E. Z. Slide-seq: A scalable technology for measuring genome-wide expression at high spatial resolution.Science363, 1463–1467 (2019). https://doi.org/10.1126/science.aaw1219

  10. [10]

    Russell, A. J. C., Weir, J. A., Nadaf, N. M., Shabet, M., Kumar, V., Kambhampati, S., Raichur, R., Marrero, G. J., Liu, S., Balderrama, K. S., Vanderburg, C. R., Shan- mugam, V., Tian, L., Iorgulescu, J. B., Yoon, C. H., Wu, C. J., Macosko, E. Z. & Chen, F. Slide-tags enables single- nucleus barcoding for multimodal spatial genomics.Na- ture625, 101–109 (...

  11. [11]

    Groh, J., Feng, R., Yuan, X., Liu, L., Klein, D. et al. Mi- croglia activation orchestrates CXCL10-mediated CD8+ T cell recruitment to promote aging-related white matter degeneration.Nature Neuroscience28, 1160–1173 (2025). https://doi.org/10.1038/s41593-025-01955-w

  12. [12]

    K., Lin, L., Chang, Y.-C., Teo, E

    Singhal, V., Chou, N., Lee, J., Yue, Y., Liu, J., Chock, W. K., Lin, L., Chang, Y.-C., Teo, E. M. L., Aow, J., Lee, H. K., Chen, K. H. & Prabhakar, S. BANKSY unifies cell typ- ing and tissue domain segmentation for scalable spatial omics data analysis.Nature Genetics56, 431–441 (2024). https://doi.org/10.1038/s41588-024-01664-3

  13. [13]

    Varrone, M., Tavernari, D., Santamaria-Martinez, A., Walsh, L. A. et al. CellCharter reveals spatial cell niches associated with tissue remodeling and cell plasticity. Nature Genetics56, 74–84 (2024).https://doi.org/10. 1038/s41588-023-01588-4

  14. [14]

    & Cai, G

    Qin, F., Luo, X., Lu, Q., Cai, B., Xiao, F. & Cai, G. Spatial pattern and differential expression analysis with spatial transcriptomic data.Nucleic Acids Research52, e101 (2024). https://doi.org/10.1093/nar/gkae962

  15. [15]

    W., Li, T., Elmentaite, R., Lomakin, A., Kedlian, V., Gayoso, A., Jain, M

    Kleshchevnikov, V., Shmatko, A., Dann, E., Aivazidis, A., King, H. W., Li, T., Elmentaite, R., Lomakin, A., Kedlian, V., Gayoso, A., Jain, M. S., Park, J. S., Ra- mona, L., Tuck, E., Arutyunyan, A., Vento-Tormo, R., Gerstung, M., James, L., Stegle, O. & Bayraktar, O. A. Cell2location maps fine-grained cell types in spatial transcriptomics.Nature Biotechno...

  16. [16]

    R., Segerstolpe, A., Zhang, M., Avraham-Davidi, I

    Biancalani, T., Scalia, G., Buffoni, L., Avasthi, R., Lu, Z., Sanger, A., Tokcan, N., Vanderburg, C. R., Segerstolpe, A., Zhang, M., Avraham-Davidi, I. & Regev, A. Deep learning and alignment of spatially resolved single- cell transcriptomes with Tangram.Nature Methods 18, 1352–1362 (2021).https://doi.org/10.1038/ s41592-021-01264-7

  17. [17]

    J., Hicks, S

    Lähnemann, D., Köster, J., Szczurek, E., McCarthy, D. J., Hicks, S. C. et al. Eleven grand challenges in single-cell data science.Genome Biology21, 31 (2020). https://doi.org/10.1186/s13059-020-1926-6

  18. [18]

    C., Lance, C., Litinetskaya, A., Drost, F

    Heumos, L., Schaar, A. C., Lance, C., Litinetskaya, A., Drost, F. et al. Best practices for single-cell analysis across modalities.Nature Reviews Genetics24, 550–572 (2023). https://doi.org/10.1038/s41576-023-00586-w

  19. [19]

    M., Zheng, S

    Hao, Y., Hao, S., Andersen-Nissen, E., Mauck, W. M., Zheng, S. et al. Integrated analysis of multimodal single-cell data.Cell184, 3573–3587.e29 (2021).https: //doi.org/10.1016/j.cell.2021.04.048

  20. [20]

    Zheng, G. X. Y., Terry, J. M., Belgrader, P., Ryvkin, P., Bent, Z. W. et al. Massively parallel digital transcriptional profiling of single cells.Nature Communications8, 14049 (2017).https://doi.org/10.1038/ncomms14049

  21. [21]

    & Davis, M

    Han, A., Glanville, J., Hansmann, L. & Davis, M. M. Link- ing T-cell receptor sequence to functional phenotype at the single-cell level.Nature Biotechnology32, 684–692 (2014).https://doi.org/10.1038/nbt.2938

  22. [22]

    R., Björklund, Å

    Picelli, S., Faridani, O. R., Björklund, Å. K., Winberg, G., Sagasser, S. & Sandberg, R. Full-length RNA-seq from single cells using Smart-seq2.Nature Protocols9, 171–181 (2014).https://doi.org/10.1038/nprot.2014.006

  23. [23]

    B., Roco, C

    Rosenberg, A. B., Roco, C. M., Muscat, R. A., Kuchina, A., Sample, P. et al. Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding. Science360, 176–182 (2018).https://doi.org/10.1126/ science.aam8999

  24. [24]

    Efremova, M., Vento-Tormo, M., Teichmann, S. A. & Vento-Tormo, R. CellPhoneDB: inferring cell-cell communication from combined expression of multi- subunit ligand-receptor complexes.Nature Protocols 15, 1484–1506 (2020).https://doi.org/10.1038/ s41596-020-0292-x

  25. [25]

    F., Zhang, L., Chang, I., Ramos, R

    Jin, S., Guerrero-Juarez, C. F., Zhang, L., Chang, I., Ramos, R. et al. Inference and analysis of cell-cell communication using CellChat.Nature Communica- tions12, 1088 (2021).https://doi.org/10.1038/ s41467-021-21246-9

  26. [26]

    Workman, K., Yang, Z., Muralidharan, H. & Le, H. SpatialBench: Can agents analyze real-world spatial biology data?arXivarXiv:2512.21907 (2025). https: //doi.org/10.48550/arXiv.2512.21907

  27. [27]

    Workman, K., Yang, Z., Muralidharan, H., Abdulali, A. & Le, H. scBench: Evaluating AI agents on single- cell RNA-seq analysis.arXivarXiv:2602.09063 (2026). https://doi.org/10.48550/arXiv.2602.09063

  28. [28]

    & Workman, K

    Diks, I., Muralidharan, H., Proctor, T. & Workman, K. Verifiable benchmarking of long-horizon spatial biology. arXivarXiv:2605.28065 (2026). https://doi.org/10. 48550/arXiv.2605.28065

  29. [29]

    G., Shih, J.-H., Zhao, B

    Qu, Y., Lu, Y., Tu, X., Zhang, S., She, T., Shaw, A. G., Shih, J.-H., Zhao, B. et al. BiomniBench: Process-level evalua- tion of LLM agents for real-world biomedical research. bioRxiv(2026). https://doi.org/10.64898/2026.05. 12.724604

  30. [30]

    Let's Verify Step by Step

    Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I. & Cobbe, K. Let’s verify step by step.arXivarXiv:2305.20050 (2023). https://doi.org/10.48550/arXiv.2305.20050

  31. [31]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E. & Stoica, I. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena.arXivarXiv:2306.05685 (2023). https://doi.org/10.48550/arXiv.2306. 05685

  32. [32]

    Large Language Models are not Fair Evaluators

    Wang, P., Li, L., Chen, L., Cai, Z., Zhu, D., Lin, B., Cao, Y., Kong, L., Liu, Q., Liu, T. & Sui, Z. Large language models are not fair evaluators. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 9440–9450 (2024). https://doi.org/10.18653/v1/2024.acl-long.511

  33. [33]

    In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

    Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R. & Zhu, C. G-Eval: NLG evaluation using GPT-4 with better human align- ment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2511–2522 (2023). https://doi.org/10.18653/v1/2023.emnlp-main. 153

  34. [34]

    M., Janizek, J

    Laurent, J. M., Janizek, J. D., Ruzo, M., Hinks, M. M., Hammerling, M. J., Narayanan, S., Ponnapati, M., White, A. D. & Rodriques, S. G. LAB-Bench: Measuring capa- bilities of language models for biology research.arXiv arXiv:2407.10362 (2024).https://doi.org/10.48550/ arXiv.2407.10362

  35. [35]

    M., Andonian, A., Tenmann, B., Narayanan, S., Wellawatte, G

    Mitchener, L., Laurent, J. M., Andonian, A., Tenmann, B., Narayanan, S., Wellawatte, G. P., White, A., Sani, L. & Rodriques, S. G. BixBench: a comprehensive benchmark for LLM-based agents in computational biology.arXiv arXiv:2503.00096 (2025).https://doi.org/10.48550/ arXiv.2503.00096

  36. [36]

    H., Fletez-Brant, K., Xie, X., Corrada Bravo, H

    Nair, S., Gunsalus, L., Orcutt-Jahns, B., Rossen, J., Lal, A., De Donno, C., Celik, M. H., Fletez-Brant, K., Xie, X., Corrada Bravo, H. & Eraslan, G. Agentic sys- tems are adept at solving well-scoped, verifiable prob- lems in computational biology.bioRxiv(2026).https: //doi.org/10.64898/2026.04.06.716850

  37. [37]

    Li, J. & Ho, A. GeneBench: Assessing AI agents for multi- stage inference problems in genomics and quantitative biology.bioRxiv(2026). https://doi.org/10.64898/ 2026.04.22.720113

  38. [38]

    Evaluating Claude’s bioinformatics research capabilities with BioMysteryBench.Anthropic Research (2026)

    Anthropic. Evaluating Claude’s bioinformatics research capabilities with BioMysteryBench.Anthropic Research (2026). anthropic.com/research/BioMysteryBench

  39. [39]

    Ioannidis, J. P. A. Why most published research find- ings are false.PLOS Medicine2, e124 (2005).https: //doi.org/10.1371/journal.pmed.0020124

  40. [40]

    & Asadullah, K

    Prinz, F., Schlange, T. & Asadullah, K. Believe it or not: how much can we rely on published data on potential drug targets?Nature Reviews Drug Discovery10, 712 (2011).https://doi.org/10.1038/nrd3439-c1

  41. [41]

    Begley, C. G. & Ellis, L. M. Raise standards for pre- clinical cancer research.Nature483, 531–533 (2012). https://doi.org/10.1038/483531a

  42. [42]

    M., Denis, A., Perfito, N., Iorns, E

    Errington, T. M., Denis, A., Perfito, N., Iorns, E. & Nosek, B. A. Reproducibility in Cancer Biology: Challenges for assessing replicability in preclinical cancer biology.eLife 10, e67995 (2021).https://doi.org/10.7554/eLife. 67995

  43. [43]

    M., Mathur, M., Soderberg, C

    Errington, T. M., Mathur, M., Soderberg, C. K., Denis, A., Perfito, N., Iorns, E. & Nosek, B. A. Investigating the replicability of preclinical cancer biology.eLife10, e71601 (2021).https://doi.org/10.7554/eLife.71601

  44. [44]

    J., George, A., Hoefakker, K

    Ibáñez-Molero, S., Veldman, J., Simon Nieto, J., Traets, J. J., George, A., Hoefakker, K. et al. Tumour-reactive heterotypic CD8 T cell clusters from clinical samples.Na- ture649, 467–476 (2026).https://doi.org/10.1038/ s41586-025-09754-w

  45. [45]

    D., Gomez, A

    Green, W. D., Gomez, A. G., Plotkin, A. L., Pratt, B. M., Merritt, E. F. et al. Enhancer-driven gene regula- tory networks reveal transcription factors governing CD8 T cell adaptation and differentiation in the tumor microenvironment.Immunity58, 1725–1741 (2025). https://doi.org/10.1016/j.immuni.2025.04.030

  46. [46]

    Tan, T., Wu, J., Si, C., Dai, S., Zhang, Y. et al. Chimeric contribution of human extended pluripotent stem cells to monkey embryos ex vivo.Cell184, 2020–2032.e14 (2021). https://doi.org/10.1016/j.cell.2021.03.020

  47. [47]

    G., Karmakar, S., Tsai, M

    Shuldiner, E. G., Karmakar, S., Tsai, M. K., Hebert, J. D., Tang, Y. J. et al. Aging represses oncogenic KRAS-driven lung tumorigenesis and alters tumor suppression.Nature Aging5, 2263–2278 (2025).https://doi.org/10.1038/ s43587-025-00986-z

  48. [48]

    C., Biermann, J., Huang, H., Wang, Y., Nair, A

    Melms, J. C., Biermann, J., Huang, H., Wang, Y., Nair, A. et al. A molecular single-cell lung atlas of lethal COVID- 19.Nature595, 114–119 (2021).https://doi.org/10. 1038/s41586-021-03569-1