pith. machine review for the scientific record. sign in

arxiv: 2604.09554 · v2 · submitted 2026-02-04 · 💻 cs.AI · cs.CL· cs.LG

Recognition: no theorem link

LABBench2: An Improved Benchmark for AI Systems Performing Biology Research

Authors on Pith no claims yet

Pith reviewed 2026-05-16 07:14 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG
keywords AI benchmarkbiology researchlanguage agentsscientific discoveryAI evaluationautonomous labshypothesis generation
0
0 comments X

The pith

LABBench2 raises difficulty for AI biology research tasks by 26 to 46 percent across models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LABBench2 as an updated benchmark with nearly 1,900 tasks that evaluate AI systems on realistic biology research activities such as hypothesis generation and experiment planning. It extends the prior LAB-Bench by embedding similar skills in more authentic contexts to better capture the ability to do meaningful scientific work rather than isolated reasoning or recall. Frontier models show clear gains since the original benchmark, yet they suffer consistent accuracy drops of 26 to 46 percent on the new tasks. This gap demonstrates that current systems remain far from reliable performance on the kinds of open-ended, context-rich problems that biologists face daily. The authors release the dataset and evaluation harness to support ongoing tracking and improvement of AI tools for core research functions.

Core claim

LABBench2 comprises nearly 1,900 tasks that measure AI performance on realistic biology research capabilities, continuing the approach of LAB-Bench but in more authentic contexts; evaluations of frontier models reveal substantial overall progress alongside model-specific accuracy reductions of 26 to 46 percent across subtasks, indicating persistent limitations in handling useful scientific tasks.

What carries the argument

LABBench2 benchmark of nearly 1,900 tasks placed in realistic biology research contexts, which directly tests the transition from knowledge and reasoning to performing meaningful scientific work.

If this is right

  • AI systems must improve specifically on realistic, multi-step scientific workflows to close the observed performance gaps.
  • Progress on LABBench2 will serve as a continuing signal of readiness for autonomous hypothesis generation and experiment design.
  • Public release of the dataset and harness enables standardized community comparisons and targeted model development.
  • The benchmark highlights the need to shift evaluation focus from isolated reasoning to integrated research functions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the accuracy drops hold across future models, development priorities may shift toward training on simulated or real lab data streams.
  • The benchmark could be extended by adding tasks that require interaction with physical lab equipment or automated protocols.
  • Performance differences across subtasks may point to specific biology domains where current architectures need architectural changes.

Load-bearing premise

The tasks in LABBench2 accurately capture the capabilities required for real biology research without separate validation against actual lab results or expert performance ratings.

What would settle it

A demonstration that models scoring high on LABBench2 fail to produce successful outcomes in real laboratory experiments or expert-reviewed research projects would show the benchmark does not measure meaningful capabilities.

Figures

Figures reproduced from arXiv: 2604.09554 by Albert Bou, Alex Andonian, Alexandros Sanchez Vassopoulos, Andrew D White, Blake Lash, Conor Igoe, Jacob L Steenwyk, James Braza, Jon M Laurent, Michael Pieler, Samuel G Rodriques, Siddharth Narayanan.

Figure 1
Figure 1. Figure 1: Accuracy score comparison between LAB-Bench and LABBench2 for each of the high-level task families covered by both benchmarks. 2.4. Protocol Troubleshooting ProtocolQA from LAB-Bench was designed as an early measure of model ability to troubleshoot error-containing lab protocols. At release, ProtocolQA was very difficult in MCQ form (Laurent et al., 2024), though labs like OpenAI additionally created their… view at source ↗
Figure 2
Figure 2. Figure 2: Performance comparison of frontier language models on LABBench2 broad task families. Results for base models (without tools) and with web search and code execution tools are indicated with solid or hashed bars, respectively [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance on FigQA2 and TableQA2 across the three task modes (image-provided, paper-provided, and retrieval). subcategory-targeted development and evaluation. 4. Discussion LABBench2 is motivated by the simple premise that to be useful in real scientific workflows, models must reli￾ably perform real-world research tasks. The results pre￾sented here support this motivation: despite clear progress on LAB-B… view at source ↗
Figure 4
Figure 4. Figure 4: SeqQA2 and CloningQA performance. (A) Overall performance breakdown by sequence-input modality (inline, file, or retrieval). Note that GPT 5.2 Pro does not accept appropriate file types with the Response API, artificially deflating file mode performance. (B) Heatmap of SeqQA2 performance by subcategory across models. These results are in the default inject mode . and agent development: • Retrieval robustne… view at source ↗
read the original abstract

Optimism for accelerating scientific discovery with AI continues to grow. Current applications of AI in scientific research range from training dedicated foundation models on scientific data to agentic autonomous hypothesis generation systems to AI-driven autonomous labs. The need to measure progress of AI systems in scientific domains correspondingly must not only accelerate, but increasingly shift focus to more real-world capabilities. Beyond rote knowledge and even just reasoning to actually measuring the ability to perform meaningful work. Prior work introduced the Language Agent Biology Benchmark LAB-Bench as an initial attempt at measuring these abilities. Here we introduce an evolution of that benchmark, LABBench2, for measuring real-world capabilities of AI systems performing useful scientific tasks. LABBench2 comprises nearly 1,900 tasks and is, for the most part, a continuation of LAB-Bench, measuring similar capabilities but in more realistic contexts. We evaluate performance of current frontier models, and show that while abilities measured by LAB-Bench and LABBench2 have improved substantially, LABBench2 provides a meaningful jump in difficulty (model-specific accuracy differences range from -26% to -46% across subtasks) and underscores continued room for performance improvement. LABBench2 continues the legacy of LAB-Bench as a de facto benchmark for AI scientific research capabilities and we hope that it continues to help advance development of AI tools for these core research functions. To facilitate community use and development, we provide the task dataset at https://huggingface.co/datasets/futurehouse/labbench2 and a public eval harness at https://github.com/EdisonScientific/labbench2.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces LABBench2 as an evolution of the authors' prior LAB-Bench benchmark, comprising nearly 1,900 tasks to evaluate frontier AI models on performing useful scientific tasks in biology research within more realistic contexts. It reports substantial performance improvements over prior benchmarks yet a meaningful increase in difficulty, with model-specific accuracy drops ranging from 26% to 46% across subtasks, and releases the dataset on Hugging Face along with a public evaluation harness.

Significance. If the tasks can be shown to validly capture real-world biology research capabilities, LABBench2 could serve as a useful de facto standard for measuring progress toward AI systems that perform meaningful scientific work rather than rote or narrowly defined reasoning. The public release of data and code strengthens reproducibility and community adoption.

major comments (2)
  1. [Abstract] Abstract and introduction: the central claim that LABBench2 measures 'real-world capabilities' and 'useful scientific tasks' in 'more realistic contexts' rests on the unvalidated premise that the ~1,900 tasks mirror genuine lab workflows; no description is provided of expert biologist review, inter-rater reliability, or correlation with actual experimental outcomes, which is load-bearing for interpreting the reported accuracy drops as evidence of improved measurement rather than design artifacts.
  2. [Abstract] Abstract: the reported 'meaningful jump in difficulty' (model-specific accuracy differences of -26% to -46%) is presented without accompanying details on task selection criteria, post-hoc filtering, or ablation of prompt/data-format confounds, making it impossible to determine whether the performance gap reflects capability limits or benchmark construction choices.
minor comments (1)
  1. [Abstract] The abstract refers to 'LAB-Bench' and 'LABBench2' inconsistently in capitalization and hyphenation; standardize throughout.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive feedback. We address each major comment point by point below, clarifying the benchmark's construction while acknowledging areas where additional documentation is warranted.

read point-by-point responses
  1. Referee: [Abstract] Abstract and introduction: the central claim that LABBench2 measures 'real-world capabilities' and 'useful scientific tasks' in 'more realistic contexts' rests on the unvalidated premise that the ~1,900 tasks mirror genuine lab workflows; no description is provided of expert biologist review, inter-rater reliability, or correlation with actual experimental outcomes, which is load-bearing for interpreting the reported accuracy drops as evidence of improved measurement rather than design artifacts.

    Authors: We agree that the manuscript would benefit from more explicit documentation of the task curation process. LABBench2 adapts tasks from the original LAB-Bench by embedding them in realistic multi-step workflows drawn directly from published biology protocols and literature. Domain experts were consulted during development to ensure fidelity to real research practices, though inter-rater reliability statistics were not formally computed. We will add a dedicated subsection in the Methods describing the curation pipeline, including expert input and filtering criteria. Correlation with actual experimental outcomes is beyond the scope of a benchmark paper and would require separate empirical validation studies; we will note this limitation explicitly in the revised discussion. revision: yes

  2. Referee: [Abstract] Abstract: the reported 'meaningful jump in difficulty' (model-specific accuracy differences of -26% to -46%) is presented without accompanying details on task selection criteria, post-hoc filtering, or ablation of prompt/data-format confounds, making it impossible to determine whether the performance gap reflects capability limits or benchmark construction choices.

    Authors: Task selection and adaptation criteria are described in the Methods, where we explain the process of extending LAB-Bench tasks to include realistic elements such as data integration, multi-step planning, and tool use. Post-hoc filtering was limited to removing a small number of ambiguous or unsolvable items (detailed in the supplement). We did not perform prompt-format ablations in the main text because the public evaluation harness enforces a standardized format; however, we will add an appendix with sensitivity checks on prompt variations to address this concern. The consistent accuracy drops across independent frontier models support that the gap arises from increased task realism rather than construction artifacts. revision: partial

standing simulated objections not resolved
  • Direct correlation between benchmark scores and success rates in actual wet-lab experiments, which would require resource-intensive longitudinal validation studies outside the scope of this work.

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark extension stands independently

full rationale

The paper introduces LABBench2 as a direct evolution of the authors' prior LAB-Bench, adding ~1,900 tasks in more realistic contexts and reporting empirical accuracy drops of 26-46% on frontier models. No equations, fitted parameters, predictions, or derivations appear in the provided text; performance differences are measured directly rather than reduced by construction to prior definitions. The self-citation to LAB-Bench supplies historical context but does not bear the load of the central claims, which rest on new task construction and fresh evaluations. The work is self-contained as an empirical benchmark paper with no load-bearing reductions to inputs or self-citation chains.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the unverified premise that the chosen tasks serve as valid proxies for real biology research work.

axioms (1)
  • domain assumption Selected tasks accurately reflect meaningful real-world biology research capabilities
    Invoked throughout the abstract when claiming the benchmark measures 'real-world capabilities' and 'meaningful work'.

pith-pipeline@v0.9.0 · 5628 in / 1124 out tokens · 50802 ms · 2026-05-16T07:14:41.579721+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 5 internal anchors

  1. [1]

    com/system-cards

    URL https://www.anthropic. com/system-cards. Accessed 2026-01-28. Cell Press. Cell press transforms article methods section to improve transparency and accessibility. Press release, August

  2. [2]

    Cold Spring Harbor Laboratory

    doi: 10.1093/nar/gkad965. Cold Spring Harbor Laboratory. CSHL launches biorxiv, a freely accessible, citable preprint server for biology. Press release, Novem- ber

  3. [3]

    Accessed 2026-01-28

    URL https://www.cshl.edu/ cshl-launches-biorxiv-a-freely-accessible-citable-preprint-server-for-biology/ . Accessed 2026-01-28. Concordia AI. Frontier AI risk monitoring re- port (2025q3). Online report,

  4. [4]

    Published November 7,

    URL https://airiskmonitor.net/doc/en/ report/2025-Q3. Published November 7,

  5. [5]

    David, Y

    Accessed 2026-01-28. David, Y . et al. The european nucleotide archive in

  6. [6]

    Du, M., Xu, B., Zhu, C., Wang, X., and Mao, Z

    doi: 10.1093/nar/gkaf1295. Du, M., Xu, B., Zhu, C., Wang, X., and Mao, Z. DeepRe- search Bench: A Comprehensive Benchmark for Deep Research Agents, June

  7. [7]

    DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

    URL http://arxiv. org/abs/2506.11763. arXiv:2506.11763 [cs]. Higgins, J. P. T. et al. The cochrane collaboration’s tool for assessing risk of bias in randomised trials.BMJ,

  8. [8]

    Ivanov, I

    doi: 10.1136/bmj.d5928. Ivanov, I. BioLP-bench: Measuring understanding of biological lab protocols by large language models, October

  9. [9]

    Pages: 2024.08.21.608694, Section: New Results

    URL https://www.biorxiv.org/ content/10.1101/2024.08.21.608694v4. Pages: 2024.08.21.608694, Section: New Results. Jansen, P., Côté, M.-A., Khot, T., Bransom, E., Mishra, B. D., Majumder, B. P., Tafjord, O., and Clark, P. DIS- COVERYWORLD: A Virtual Environment for Devel- oping and Evaluating Automated Scientific Discovery Agents, June

  10. [10]

    arXiv:2406.06769 [cs]

    URL http://arxiv.org/abs/ 2406.06769. arXiv:2406.06769 [cs]. Jin, Q., Dhingra, B., Liu, Z., Cohen, W. W., and Lu, X. Pub- MedQA: A Dataset for Biomedical Research Question Answering, September

  11. [11]

    Pubmedqa: A dataset for biomedical research question answering.arXiv preprint arXiv:1909.06146, 2019

    URL http://arxiv. org/abs/1909.06146. arXiv:1909.06146 [cs]. Katz, K., Shutov, O., Lapoint, R., Kimelman, M., Brister, J. R., and O’Sullivan, C. The sequence read archive: a decade more of explosive growth.Nucleic Acids Re- search, 50(D1):D387–D390,

  12. [12]

    Krithara, A., Nentidis, A., Bougiatiotis, K., and Paliouras, G

    doi: 10.1093/nar/ gkab1053. Krithara, A., Nentidis, A., Bougiatiotis, K., and Paliouras, G. BioASQ-QA: A manually curated corpus for Biomedical Question Answering.Scientific Data, 10 (1):170, March

  13. [13]

    LAB-Bench: Measuring Capabilities of Language Models for Biology Research

    ISSN 2052-4463. doi: 10.1038/ s41597-023-02068-4. URL https://www.nature. com/articles/s41597-023-02068-4 . Pub- lisher: Nature Publishing Group. Laurent, J. M., Janizek, J. D., Ruzo, M., Hinks, M. M., Hammerling, M. J., Narayanan, S., Ponnapati, M., White, A. D., and Rodriques, S. G. LAB-Bench: Measuring ca- pabilities of language models for biology rese...

  14. [15]

    Lin, X., Ma, S., Shan, J., Zhang, X., Hu, S

    doi: 10.1016/j.gpb.2014.10.001. Lin, X., Ma, S., Shan, J., Zhang, X., Hu, S. X., Guo, T., Li, S. Z., and Yu, K. BioKGBench: A knowledge graph checking benchmark of AI agent for biomedical science,

  15. [16]

    doi: 10.1002/hsr2.165. Lála, J. et al. PaperQA: Retrieval-augmented gener- ative agent for scientific research.arXiv preprint arXiv:2312.07559,

  16. [17]

    Majumder, B

    doi: 10.1038/s41598-019-43935-8. Majumder, B. P., Surana, H., Agarwal, D., Dalvi Mishra, B., Meena, A., Prakhar, A., V ora, T., Khot, T., Sabharwal, A., and Clark, P. DiscoveryBench: Towards data-driven discovery with large language models.arXiv preprint arXiv:2407.01725,

  17. [18]

    doi: 10.48550/arXiv.2407. 01725. URL https://arxiv.org/abs/2407. 01725. Mirza, A., Alampara, N., Kunchapu, S., Emoekabu, B., Krishnan, A., Schilling-Wilhelmi, M., Okereke, M., Eber- hardt, J., Elahi, A. M., Greiner, M., Holick, C. T., Gupta, T., Asgari, M., Glaubitz, C., Klepsch, L. C., K"oster, Y ., Meyer, J., Miret, S., Hoffmann, T., Kreth, F. A., Ringl...

  18. [19]

    D., Griffiths, R.-R., Ponnapati, M., Bou, A., Laurent, J., Kabeli, O., Wellawatte, G., Cox, S., Rodriques, S

    Narayanan, S., Braza, J. D., Griffiths, R.-R., Ponnapati, M., Bou, A., Laurent, J., Kabeli, O., Wellawatte, G., Cox, S., Rodriques, S. G., and White, A. D. Aviary: Training language agents on challenging scientific tasks. arXiv preprint arXiv:2412.21154, December

  19. [20]

    D., Griffiths, R.-R., Ponnapati, M., Bou, A., Laurent, J., Kabeli, O., Wellawatte, G., Cox, S., Rodriques, S

    doi: 10.48550/arXiv.2412.21154. Nardini, C., Dent, J., and Tieri, P. Editorial: Multi-omic data integration.Frontiers in Cell and Developmental Biology, 3:46,

  20. [21]

    doi: 10.3389/fcell.2015.00046. OpenAI. OpenAI o1 System Card. System card, De- cember

  21. [22]

    Page, M. J. et al. The prisma 2020 statement: an updated guideline for reporting systematic reviews.BMJ,

  22. [23]

    doi: 10.1136/bmj.n71. Park, C. Strengthening american scientific innovation with ai. https://cdn.openai.com/pdf/ openai-ostp-accelerating-science-rfi. pdf. Accessed: 2026-02-03. Phan, L., Gatti, A., Han, Z., Li, N., Hu, J., Zhang, H., Zhang, C. B. C., Shaaban, M., Ling, J., Shi, S., Choi, M., Agrawal, A., Chopra, A., Khoja, A., Kim, R., Ren, R., Hausenloy...

  23. [24]

    Humanity's Last Exam

    URL https://arxiv.org/abs/2501.14249. Press, O., Hochlehnert, A., Prabhu, A., Udandarao, V ., Press, O., and Bethge, M. CiteME: Can Language Models Accu- rately Cite Scientific Claims?, July

  24. [25]

    arXiv:2407.12861 12 LABBench2 [cs]

    URLhttp:// arxiv.org/abs/2407.12861. arXiv:2407.12861 12 LABBench2 [cs]. Rein, D. et al. GPQA: A graduate-level google-proof Q&A benchmark.arXiv preprint arXiv:2311.12022,

  25. [26]

    SciEval: A multi-level large language model evaluation benchmark for scientific research.arXiv preprint arXiv:2308.13149,

    Sun, L., Han, Y ., Zhao, Z., Ma, D., Shen, Z., Chen, B., Chen, L., and Yu, K. SciEval: A multi-level large language model evaluation benchmark for scientific research.arXiv preprint arXiv:2308.13149,

  26. [27]

    Neural Controlled Differential Equations for Irregular Time Series

    doi: 10.48550/arXiv. 2308.13149. Teytelman, L., Stoliartchouk, A., Kindler, L., and Hur- witz, B. L. Protocols.io: Virtual communities for proto- col development and discussion.PLOS Biology, 14(8): e1002538,

  27. [28]

    Wang, M., Lin, R., Hu, K., Jiao, J., Chowdhury, N., Chang, E., and Patwardhan, T

    doi: 10.1371/journal.pbio.1002538. Wang, M., Lin, R., Hu, K., Jiao, J., Chowdhury, N., Chang, E., and Patwardhan, T. FrontierScience: Evaluating AI’s Ability to Perform Expert-Level Scientific Tasks, Jan- uary

  28. [29]

    URL http://arxiv.org/abs/2601. 21165. arXiv:2601.21165 [cs]. Wang, X. et al. SciBench: Evaluating college-level scientific problem-solving abilities of large language models.arXiv preprint arXiv:2307.10635,

  29. [30]

    BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

    URL http://arxiv. org/abs/2504.12516. arXiv:2504.12516 [cs]. 13