arxiv: 2407.10362 · v3 · submitted 2024-07-14 · 💻 cs.AI

Recognition: 3 theorem links

· Lean Theorem

LAB-Bench: Measuring Capabilities of Language Models for Biology Research

Jon M. Laurent , Joseph D. Janizek , Michael Ruzo , Michaela M. Hinks , Michael J. Hammerling , Siddharth Narayanan , Manvitha Ponnapati , Andrew D. White

show 1 more author

Samuel G. Rodriques

Authors on Pith no claims yet

Pith reviewed 2026-05-16 15:54 UTC · model grok-4.3

classification 💻 cs.AI

keywords LAB-Benchlanguage modelsbiology researchAI benchmarkliterature searchmolecular cloningscientific discoveryDNA sequences

0 comments

The pith

LAB-Bench introduces over 2,400 questions to test AI on practical biology research tasks such as literature search and sequence manipulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LAB-Bench, a dataset of more than 2,400 multiple-choice questions that measure language models on skills required for actual biology research work. These questions cover recalling and reasoning from literature, interpreting figures, navigating databases, and handling DNA and protein sequences. Frontier models are evaluated on the benchmark and compared directly to human expert biologists. The authors state that consistent high performance on the harder tasks would indicate the AI could function as a useful assistant for researchers in areas like literature search and molecular cloning. This benchmark is offered as a tool to guide the creation of automated research systems.

Core claim

LAB-Bench is introduced as a broad dataset of over 2,400 multiple-choice questions for evaluating AI systems on practical biology research capabilities, including recall and reasoning over literature, interpretation of figures, access and navigation of databases, and comprehension and manipulation of DNA and protein sequences, with the expectation that high scores on difficult tasks would mean the system serves as a useful assistant for researchers.

What carries the argument

The LAB-Bench dataset of multiple-choice questions organized into categories for literature, figures, databases, and sequences.

If this is right

Models that score high on difficult tasks could assist researchers with literature search.
High performance may indicate readiness to help with molecular cloning and protocol planning.
The benchmark provides a direct comparison between current AI systems and human biology experts.
Ongoing updates to the dataset will track improvements in AI scientific task performance.
Developers can use the results to prioritize capabilities needed for automated research tools.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the benchmark holds up, it could steer AI development toward specific lab-support functions.
Comparable benchmarks could be built for chemistry or physics using the same task categories.
Real validation would require checking whether AI-generated plans succeed in physical experiments.
Expanding beyond multiple-choice to open-ended tasks would better test full research workflows.

Load-bearing premise

Success on these multiple-choice questions reflects the practical skills needed for real biology research rather than surface-level pattern matching.

What would settle it

A language model that scores high on LAB-Bench but fails to produce workable results when used to plan an actual molecular cloning experiment would show the benchmark does not capture practical capability.

read the original abstract

There is widespread optimism that frontier Large Language Models (LLMs) and LLM-augmented systems have the potential to rapidly accelerate scientific discovery across disciplines. Today, many benchmarks exist to measure LLM knowledge and reasoning on textbook-style science questions, but few if any benchmarks are designed to evaluate language model performance on practical tasks required for scientific research, such as literature search, protocol planning, and data analysis. As a step toward building such benchmarks, we introduce the Language Agent Biology Benchmark (LAB-Bench), a broad dataset of over 2,400 multiple choice questions for evaluating AI systems on a range of practical biology research capabilities, including recall and reasoning over literature, interpretation of figures, access and navigation of databases, and comprehension and manipulation of DNA and protein sequences. Importantly, in contrast to previous scientific benchmarks, we expect that an AI system that can achieve consistently high scores on the more difficult LAB-Bench tasks would serve as a useful assistant for researchers in areas such as literature search and molecular cloning. As an initial assessment of the emergent scientific task capabilities of frontier language models, we measure performance of several against our benchmark and report results compared to human expert biology researchers. We will continue to update and expand LAB-Bench over time, and expect it to serve as a useful tool in the development of automated research systems going forward. A public subset of LAB-Bench is available for use at the following URL: https://huggingface.co/datasets/futurehouse/lab-bench

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LAB-Bench gives a practical new dataset for biology tasks but leaves the link from MCQ scores to real research help untested.

read the letter

LAB-Bench introduces a dataset of over 2400 multiple-choice questions aimed at practical biology work like literature navigation, figure reading, database access, and sequence handling. That focus is the main step forward from earlier science benchmarks that stuck to textbook problems. The authors also release a public subset and run initial model comparisons against human experts, which gives a concrete starting point for tracking progress on these workflows. Those pieces are useful and worth having in the open.

Referee Report

2 major / 1 minor

Summary. The paper introduces LAB-Bench, a dataset of over 2,400 multiple-choice questions targeting practical biology research capabilities such as literature recall and reasoning, figure interpretation, database navigation, and DNA/protein sequence manipulation. It provides initial evaluations of several frontier LLMs, compares them to human expert baselines, and claims that consistently high performance on the harder tasks would indicate an AI system could serve as a useful research assistant for tasks like literature search and molecular cloning. A public subset is released on Hugging Face.

Significance. If the benchmark questions prove to be faithful proxies for real research workflows, the work would address a clear gap between existing textbook-style science benchmarks and applied scientific tasks. The public data release supports reproducibility and future extensions. However, the significance is constrained by the lack of any evidence that benchmark scores transfer to open-ended research performance.

major comments (2)

[Abstract] Abstract: The central claim that 'an AI system that can achieve consistently high scores on the more difficult LAB-Bench tasks would serve as a useful assistant for researchers in areas such as literature search and molecular cloning' is unsupported. No transfer experiments, correlation studies, or open-ended task evaluations are reported to link MCQ accuracy to actual research utility.
[Evaluation and Results (inferred from abstract and dataset description)] The manuscript provides no details on question validation procedures, inter-rater reliability for the human expert baselines, or controls for potential data leakage from training corpora. These omissions directly affect the interpretability of the reported model-versus-human comparisons.

minor comments (1)

[Abstract] The abstract states 'we will continue to update and expand LAB-Bench over time' without specifying a versioning or maintenance plan; a brief statement on update criteria would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment below and outline revisions that will be incorporated to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'an AI system that can achieve consistently high scores on the more difficult LAB-Bench tasks would serve as a useful assistant for researchers in areas such as literature search and molecular cloning' is unsupported. No transfer experiments, correlation studies, or open-ended task evaluations are reported to link MCQ accuracy to actual research utility.

Authors: We agree that the manuscript provides no direct empirical evidence of transfer from LAB-Bench performance to open-ended research tasks. The claim is presented as a motivating hypothesis grounded in the benchmark's design, which draws tasks from authentic research workflows such as literature navigation and sequence manipulation. To address this concern without overstatement, we will revise the abstract to frame the statement explicitly as a hypothesis for future validation rather than an established implication. We will also expand the discussion section to acknowledge the current lack of transfer studies and to outline planned follow-up work on correlating benchmark scores with real-world research outcomes. revision: yes
Referee: [Evaluation and Results (inferred from abstract and dataset description)] The manuscript provides no details on question validation procedures, inter-rater reliability for the human expert baselines, or controls for potential data leakage from training corpora. These omissions directly affect the interpretability of the reported model-versus-human comparisons.

Authors: We appreciate this observation and acknowledge that these methodological details were insufficiently described. In the revised manuscript we will add a dedicated subsection on benchmark construction that details the question validation workflow, including multi-expert review processes. We will report inter-rater reliability statistics (e.g., Cohen's kappa or percentage agreement) for the human expert baselines and describe controls implemented to limit data leakage, such as the use of post-2023 literature sources and explicit checks against common pre-training corpora. These additions will enhance transparency and allow readers to better interpret the model-versus-human comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: LAB-Bench is an independent benchmark introduction with no derived claims reducing to inputs

full rationale

The paper introduces a new dataset of over 2,400 multiple-choice questions targeting practical biology tasks such as literature recall, figure interpretation, database navigation, and sequence manipulation. No equations, fitted parameters, or derivation chains exist in the manuscript. The statement that high scores on difficult tasks would indicate utility as a research assistant is presented explicitly as an expectation rather than a result derived from prior elements of the work. No self-citations are load-bearing for any central premise, and the evaluation protocol is defined directly from the collected questions without reduction to fitted inputs or renamed known results. The work is therefore self-contained as an empirical benchmark release.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark rests on the standard assumption that multiple-choice accuracy is a valid proxy for research capability; no free parameters, invented entities, or non-standard axioms are introduced.

axioms (1)

domain assumption Multiple-choice question performance correlates with ability to perform open-ended research tasks
Invoked when claiming high scores would indicate useful research assistants

pith-pipeline@v0.9.0 · 5607 in / 1214 out tokens · 31446 ms · 2026-05-16T15:54:22.695332+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we introduce the Language Agent Biology Benchmark (LAB-Bench), a broad dataset of over 2,400 multiple choice questions for evaluating AI systems on a range of practical biology research capabilities, including recall and reasoning over literature, interpretation of figures, access and navigation of databases, and comprehension and manipulation of DNA and protein sequences
IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we expect that an AI system that can achieve consistently high scores on the more difficult LAB-Bench tasks would serve as a useful assistant for researchers in areas such as literature search and molecular cloning
IndisputableMonolith.Foundation.LedgerCanonicality uniform_scaling_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

As an initial assessment of the emergent scientific task capabilities of frontier language models, we measure performance of several against our benchmark and report results compared to human expert biology researchers

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction
cs.LG 2026-05 unverdicted novelty 7.0

Collider-Bench is a new benchmark showing that current LLM agents cannot reliably reproduce LHC analyses at the level of a physicist-in-the-loop.
Jailbroken Frontier Models Retain Their Capabilities
cs.LG 2026-04 unverdicted novelty 7.0

Jailbreak-induced performance loss shrinks as model capability grows, with the strongest models showing almost no degradation on benchmarks.
AI scientists produce results without reasoning scientifically
cs.AI 2026-04 conditional novelty 7.0

LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.
SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?
cs.AI 2026-04 unverdicted novelty 7.0

LLMs predict outcomes of real scientific experiments at 14-26% accuracy, comparable to human experts, but lack calibration on prediction reliability while humans demonstrate strong calibration.
The limits of bio-molecular modeling with large language models : a cross-scale evaluation
cs.LG 2026-04 unverdicted novelty 7.0

LLMs perform adequately on bio-molecular classification tasks but remain weak on regression, with hybrid architectures outperforming others on long sequences and fine-tuning hurting generalization.
PolyReal: A Benchmark for Real-World Polymer Science Workflows
cs.CV 2026-04 unverdicted novelty 7.0

PolyReal benchmark shows leading MLLMs perform well on polymer knowledge reasoning but drop sharply on practical tasks like lab safety analysis and raw data extraction.
BioAgent Bench: An AI Agent Evaluation Suite for Bioinformatics
cs.AI 2026-01 accept novelty 7.0

BioAgent Bench is a new evaluation suite that tests AI agents on end-to-end bioinformatics pipelines and finds that frontier models often complete tasks reliably but fail under controlled perturbations like corrupted ...
BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents
cs.AI 2026-05 conditional novelty 6.0

BioMedArena releases a standardized toolkit with 147 biomedical benchmarks, 75 tools, and six harnesses that achieve SOTA results on eight tasks with a +15.03 percentage point average lift.
Prescriptive Scaling Laws for Data Constrained Training
cs.LG 2026-05 unverdicted novelty 6.0

A one-parameter scaling law models excess loss from data repetition as an additive overfitting penalty, recommending model capacity increases over excessive repetition and showing that strong weight decay reduces the ...
An Independent Safety Evaluation of Kimi K2.5
cs.CR 2026-04 conditional novelty 6.0

Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.
LABBench2: An Improved Benchmark for AI Systems Performing Biology Research
cs.AI 2026-02 unverdicted novelty 6.0

LABBench2 is a more challenging benchmark than LAB-Bench for assessing AI performance on biology research tasks, with frontier models showing accuracy drops of 26-46% across subtasks.
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
cs.CL 2025-06 conditional novelty 6.0

DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.
DeepER-Med: Advancing Deep Evidence-Based Research in Medicine Through Agentic AI
cs.AI 2026-04 unverdicted novelty 5.0

DeepER-Med introduces a three-module agentic AI workflow for evidence-based medical research that outperforms production platforms on a new expert-curated dataset of 100 questions and matches clinical recommendations ...
AgentCE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments
cs.AI 2026-04 unverdicted novelty 5.0

AgentCE-Bench is a lightweight grid-planning benchmark that controls task horizon via hidden slots H and difficulty via decoy budget B, validated across 13 models for consistent and discriminative evaluation.
Humanity's Last Exam
cs.LG 2025-01 unverdicted novelty 5.0

Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.
Useful for Exploration, Risky for Precision: Evaluating AI Tools in Academic Research
cs.AI 2026-05 unverdicted novelty 4.0

AI Q&A tools give useful overviews but fail at precise information extraction and source tracing, while literature review tools aid exploration yet lack reproducibility and transparency, making them unsuitable for sys...
Useful for Exploration, Risky for Precision: Evaluating AI Tools in Academic Research
cs.AI 2026-05 unverdicted novelty 4.0

AI tools deliver useful overviews for research exploration but prove unreliable for precise information extraction and systematic reviews due to low explainability, reproducibility, and transparency.
Risk Reporting for Developers' Internal AI Model Use
cs.CY 2026-04 unverdicted novelty 4.0

A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · cited by 17 Pith papers · 3 internal anchors

[1]

Joanna S Amberger, Carol A Bocchini, François Schiettecatte, Alan F Scott, and Ada Hamosh. Omim. org: Online mendelian inheritance in man (omim®), an online catalog of human genes and genetic disorders. Nucleic acids research, 43(D1):D789–D798, 2015

work page 2015
[2]

Introducing the next generation of claude, March 2024

Anthropic. Introducing the next generation of claude, March 2024. URL https://www. anthropic.com/news/claude-3-family. Accessed: 2024-06-11

work page 2024
[3]

Introducing the next generation of claude, March 2024

Anthropic. Introducing the next generation of claude, March 2024. URL https://www. anthropic.com/news/claude-3-5-sonnet . Accessed: 2024-07-11

work page 2024
[4]

Lessons from the Trenches on Reproducible Evaluation of Language Models

Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Ab- basi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, et al. Lessons from the trenches on reproducible evaluation of language models. arXiv preprint arXiv:2405.14782, 2024

work page internal anchor Pith review arXiv 2024
[5]

Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes

Daniil A. Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemi- cal research with large language models. Nature, 624(7992):570–578, Dec 2023. ISSN 1476-4687. doi: 10.1038/s41586-023-06792-0. URL https://doi.org/10.1038/ s41586-023-06792-0

work page doi:10.1038/s41586-023-06792-0 2023
[6]

Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S

Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stef...

work page 2022
[7]

Sparks of artificial general intelligence: Early experiments with gpt-4, 2023

Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments with gpt-4, 2023

work page 2023
[8]

mirdb: an online database for prediction of functional microrna targets

Yuhao Chen and Xiaowei Wang. mirdb: an online database for prediction of functional microrna targets. Nucleic acids research, 48(D1):D127–D131, 2020

work page 2020
[9]

Peter J. A. Cock, Tiago Antao, Jeffrey T. Chang, Brad A. Chapman, Cymon J. Cox, Andrew Dalke, Iddo Friedberg, Thomas Hamelryck, Frank Kauff, Bartek Wilczynski, and Michiel J. L. de Hoon. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 25(11):1422–1423, 03 2009. ISSN 1367-4803. doi: 10.1093/ ...

work page doi:10.1093/bioinformatics/btp163 2009
[10]

Wayne Davis and Erik M

M. Wayne Davis and Erik M. Jorgensen. Ape, a plasmid editor: A freely available dna manipulation and visualization program. Frontiers in Bioinformatics, 2, 2022. ISSN 2673-

work page 2022
[11]

URL https://www.frontiersin.org/articles/ 10.3389/fbinf.2022.818619

doi: 10.3389/fbinf.2022.818619. URL https://www.frontiersin.org/articles/ 10.3389/fbinf.2022.818619

work page doi:10.3389/fbinf.2022.818619 2022
[12]

What’s going on with the open llm leaderboard? Hugging Face, June 2023

Clémentine Fourrier, Najoung Habib, Julien Launay, and Thomas Wolf. What’s going on with the open llm leaderboard? Hugging Face, June 2023. URL https://huggingface.co/ blog/open-llm-leaderboard-mmlu

work page 2023
[13]

Introducing gemini 1.5, google’s next-generation ai model, February 2024

Google. Introducing gemini 1.5, google’s next-generation ai model, February 2024. URL https://blog.google/technology/ai/ google-gemini-next-generation-model-february-2024/ . Accessed: 2024-06- 11

work page 2024
[14]

Ensembl 2024

Peter W Harrison, M Ridwan Amode, Olanrewaju Austine-Orimoloye, Andrey G Azov, Matthieu Barba, If Barnes, Arne Becker, Ruth Bennett, Andrew Berry, Jyothish Bhai, et al. Ensembl 2024. Nucleic Acids Research, 52(D1):D891–D899, 2024

work page 2024
[15]

Machine learning with a reject option: A survey

Kilian Hendrickx, Lorenzo Perini, Dries Van der Plas, Wannes Meert, and Jesse Davis. Machine learning with a reject option: A survey. Machine Learning, 113(5):3073–3110, 2024

work page 2024
[16]

Measuring massive multitask language understanding, 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021

work page 2021
[17]

Geneturing tests gpt models in genomics

Wenpin Hou and Zhicheng Ji. Geneturing tests gpt models in genomics. BioRxiv, 2023

work page 2023
[18]

Leverag- ing large language models for predictive chemistry

Kevin Maik Jablonka, Philippe Schwaller, Andres Ortega-Guerrero, and Berend Smit. Leverag- ing large language models for predictive chemistry. Nat. Mach. Intell., 6(2):161–169, February 2024

work page 2024
[19]

Kakade, and Eran Malach

Samy Jelassi, David Brandfonbrener, Sham M. Kakade, and Eran Malach. Repeat after me: Transformers are better than state space models at copying, 2024. 13 LAB-Bench: Measuring Capabilities of Large Language Models for Biology Research

work page 2024
[20]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024

work page 2024
[21]

GeneGPT: augmenting large language models with domain tools for improved access to biomedical information

Qiao Jin, Yifan Yang, Qingyu Chen, and Zhiyong Lu. GeneGPT: augmenting large language models with domain tools for improved access to biomedical information. Bioinformatics, 40 (2):btae075, 02 2024. ISSN 1367-4811. doi: 10.1093/bioinformatics/btae075. URL https: //doi.org/10.1093/bioinformatics/btae075

work page doi:10.1093/bioinformatics/btae075 2024
[22]

Highly accurate protein structure predic- tion with AlphaFold

John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ron- neberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A A Kohl, Andrew J Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Trevor Back, Stig Petersen, David Reiman,...

work page 2021
[23]

GPT-4 passes the bar exam

Daniel Martin Katz, Michael James Bommarito, Shang Gao, and Pablo Arredondo. GPT-4 passes the bar exam. SSRN Electron. J., 2023

work page 2023
[24]

BioASQ-QA: A manually curated corpus for biomedical question answering

Anastasia Krithara, Anastasios Nentidis, Konstantinos Bougiatiotis, and Georgios Paliouras. BioASQ-QA: A manually curated corpus for biomedical question answering. Sci. Data, 10(1): 170, March 2023

work page 2023
[25]

Enrichr: a comprehensive gene set enrichment analysis web server 2016 update

Maxim V Kuleshov, Matthew R Jones, Andrew D Rouillard, Nicolas F Fernandez, Qiaonan Duan, Zichen Wang, Simon Koplev, Sherry L Jenkins, Kathleen M Jagodnik, Alexander Lachmann, et al. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic acids research, 44(W1):W90–W97, 2016

work page 2016
[26]

Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models

Tiffany H Kung, Morgan Cheatham, Arielle Medenilla, Czarina Sillos, Lorie De Leon, Camille Elepaño, Maria Madriaga, Rimel Aggabao, Giezel Diaz-Candido, James Maningo, and Victor Tseng. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit. Health, 2(2):e0000198, February 2023

work page 2023
[27]

Clinvar: public archive of interpretations of clinically relevant variants

Melissa J Landrum, Jennifer M Lee, Mark Benson, Garth Brown, Chen Chao, Shanmuga Chitipiralla, Baoshan Gu, Jennifer Hart, Douglas Hoffman, Jeffrey Hoover, et al. Clinvar: public archive of interpretations of clinically relevant variants. Nucleic acids research, 44(D1): D862–D868, 2016

work page 2016
[28]

A structure- informed atlas of human-virus interactions

Gorka Lasso, Sandra V Mayer, Evandro R Winkelmann, Tim Chu, Oliver Elliot, Juan Angel Patino-Galindo, Kernyu Park, Raul Rabadan, Barry Honig, and Sagi D Shapira. A structure- informed atlas of human-virus interactions. Cell, 178(6):1526–1541, 2019

work page 2019
[29]

Holistic evaluation of language models

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D Manning, Christopher Ré, Diana Acosta-Navas, Drew A Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu R...

work page 2022
[30]

Molecular signatures database (msigdb) 3.0

Arthur Liberzon, Aravind Subramanian, Reid Pinchback, Helga Thorvaldsdóttir, Pablo Tamayo, and Jill P Mesirov. Molecular signatures database (msigdb) 3.0. Bioinformatics, 27(12): 1739–1740, 2011

work page 2011
[31]

Rodriques, and Andrew D

Jakub Lála, Odhran O’Donoghue, Aleksandar Shtedritski, Sam Cox, Samuel G. Rodriques, and Andrew D. White. Paperqa: Retrieval-augmented generative agent for scientific research, 2023. 14 LAB-Bench: Measuring Capabilities of Large Language Models for Biology Research

work page 2023
[32]

Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D

Andres M. Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D. White, and Philippe Schwaller. Augmenting large language models with chemistry tools. Nature Machine Intelli- gence, 6(5):525–535, May 2024. ISSN 2522-5839. doi: 10.1038/s42256-024-00832-8. URL https://doi.org/10.1038/s42256-024-00832-8

work page doi:10.1038/s42256-024-00832-8 2024
[33]

Scaling deep learning for materials discovery

Amil Merchant, Simon Batzner, Samuel S Schoenholz, Muratahan Aykol, Gowoon Cheon, and Ekin Dogus Cubuk. Scaling deep learning for materials discovery. Nature, 624(7990):80–85, December 2023

work page 2023
[34]

Chat: meta-llama/meta-llama-3-70b-instruct, April 2024

Meta. Chat: meta-llama/meta-llama-3-70b-instruct, April 2024. URL https: //docs.anyscale.com/endpoints/text-generation/supported-models/ meta-llama-Meta-Llama-3-70B-Instruct . Accessed: 2024-06-11

work page 2024
[35]

Introducing meta llama 3: The most capable openly available llm to date, April 2024

Meta. Introducing meta llama 3: The most capable openly available llm to date, April 2024. URL https://ai.meta.com/blog/meta-llama-3/. Accessed: 2024-06-11

work page 2024
[36]

Holick, Tanya Gupta, Mehrdad Asgari, Christina Glaubitz, Lea C

Adrian Mirza, Nawaf Alampara, Sreekanth Kunchapu, Benedict Emoekabu, Aswanth Krishnan, Mara Wilhelmi, Macjonathan Okereke, Juliane Eberhardt, Amir Mohammad Elahi, Maximilian Greiner, Caroline T. Holick, Tanya Gupta, Mehrdad Asgari, Christina Glaubitz, Lea C. Klepsch, Yannik Köster, Jakob Meyer, Santiago Miret, Tim Hoffmann, Fabian Alexander Kreth, Michael...

work page 2024
[37]

Durrant, Armin W

Eric Nguyen, Michael Poli, Matthew G. Durrant, Armin W. Thomas, Brian Kang, Jeremy Sullivan, Madelena Y . Ng, Ashley Lewis, Aman Patel, Aaron Lou, Stefano Ermon, Stephen A. Baccus, Tina Hernandez-Boussard, Christopher Ré, Patrick D. Hsu, and Brian L. Hie. Sequence modeling and design from molecular to genome scale with evo. bioRxiv, 2024. doi: 10. 1101/20...

work page 2024
[38]

Nicholson, Milo Mordaunt, Patrice Lopez, Ashish Uppala, Domenic Rosati, Neves P

Josh M. Nicholson, Milo Mordaunt, Patrice Lopez, Ashish Uppala, Domenic Rosati, Neves P. Rodrigues, Peter Grabitz, and Sean C. Rife. scite: A smart citation index that displays the context of citations and classifies their intent using deep learning. Quantitative Science Studies, 2(3):882–898, 11 2021. ISSN 2641-3337. doi: 10.1162/qss_a_00146. URL https:/...

work page doi:10.1162/qss_a_00146 2021
[39]

Proteingym: Large- scale benchmarks for protein fitness prediction and design

Pascal Notin, Aaron Kollasch, Daniel Ritter, Lood van Niekerk, Steffanie Paul, Han Spinner, Nathan Rollins, Ada Shaw, Rose Orenbuch, Ruben Weitzman, Jonathan Frazer, Mafalda Dias, Dinko Franceschi, Yarin Gal, and Debora Marks. Proteingym: Large- scale benchmarks for protein fitness prediction and design. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. H...

work page 2023
[40]

Hello gpt-4o, May 2024

OpenAI. Hello gpt-4o, May 2024. URL https://openai.com/index/hello-gpt-4o/. Accessed: 2024-06-11

work page 2024
[41]

Gpt-4 and gpt-4 turbo, 2024

OpenAI. Gpt-4 and gpt-4 turbo, 2024. URL https://platform.openai.com/docs/ models/gpt-4-and-gpt-4-turbo . Accessed: 2024-06-11

work page 2024
[42]

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Flo- rencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Bern...

work page 2024
[43]

Disgenet: a com- prehensive platform integrating information on human disease-associated genes and variants

Janet Piñero, Àlex Bravo, Núria Queralt-Rosinach, Alba Gutiérrez-Sacristán, Jordi Deu-Pons, Emilio Centeno, Javier García-García, Ferran Sanz, and Laura I Furlong. Disgenet: a com- prehensive platform integrating information on human disease-associated genes and variants. Nucleic acids research, page gkw943, 2016

work page 2016
[44]

Ai-assisted coding: Experiments with gpt-4, 2023

Russell A Poldrack, Thomas Lu, and Gašper Beguš. Ai-assisted coding: Experiments with gpt-4, 2023

work page 2023
[45]

The hugo gene nomenclature committee (hgnc)

Sue Povey, Ruth Lovering, Elspeth Bruford, Mathew Wright, Michael Lush, and Hester Wain. The hugo gene nomenclature committee (hgnc). Human genetics, 109:678–680, 2001

work page 2001
[46]

Can good benchmarks contain mistakes? https://wp.nyu.edu/arg/ can-good-benchmarks-contain-mistakes/ , 2024

David Rein. Can good benchmarks contain mistakes? https://wp.nyu.edu/arg/ can-good-benchmarks-contain-mistakes/ , 2024. Accessed: 2024-05-20

work page 2024
[47]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark, 2023. URL https://arxiv.org/abs/2311.12022

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Biollmbench: A comprehensive benchmarking of large language models in bioinformatics

Varuni Sarwal, Viorel Munteanu, Timur Suhodolschi, Dumitru Ciorba, Eleazar Eskin, Wei Wang, and Serghei Mangul. Biollmbench: A comprehensive benchmarking of large language models in bioinformatics. bioRxiv, 2023. doi: 10.1101/2023.12.19.572483. URL https: //www.biorxiv.org/content/early/2023/12/20/2023.12.19.572483. 16 LAB-Bench: Measuring Capabilities of...

work page doi:10.1101/2023.12.19.572483 2023
[49]

Large language models encode clinical knowledge

Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Senevi- ratne, Paul Gamble, Chris Kelly, Abubakr Babiker, Nathanael Schärli, Aakanksha Chowdhery, Philip Mansfield, Dina Demner-Fushman, Blaise Agüera Y Arcas, Dale Webster, Greg S Corrado,...

work page 2023
[50]

The mammalian phenotype ontology: enabling robust annotation and comparative analysis

Cynthia L Smith and Janan T Eppig. The mammalian phenotype ontology: enabling robust annotation and comparative analysis. Wiley Interdisciplinary Reviews: Systems Biology and Medicine, 1(3):390–399, 2009

work page 2009
[51]

Beyond the imitation game: Quantifying and extrapolating the capabilities of language models

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Ask...

work page 2022
[52]

Delocalized, asynchronous, closed-loop discovery of organic laser emitters

Felix Strieth-Kalthoff, Han Hao, Vandana Rathore, Joshua Derasp, Théophile Gaudin, Nicholas H Angello, Martin Seifrid, Ekaterina Trushina, Mason Guy, Junliang Liu, Xun Tang, Masashi Mamada, Wesley Wang, Tuul Tsagaantsooj, Cyrille Lavigne, Robert Pollice, Tony C Wu, Kazuhiro Hotta, Leticia Bodo, Shangyu Li, Mohammad Haddadnia, Agnieszka Wołos, Rafał Roszak...

work page 2024
[53]

Galactica: A large language model for science

Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. Galactica: A large language model for science. arXiv, 2022

work page 2022
[54]

De novo design of protein structure and function with RFdiffusion

Joseph L Watson, David Juergens, Nathaniel R Bennett, Brian L Trippe, Jason Yim, Helen E Eisenach, Woody Ahern, Andrew J Borst, Robert J Ragotte, Lukas F Milles, Basile I M Wicky, Nikita Hanikel, Samuel J Pellock, Alexis Courbet, William Sheffler, Jue Wang, Preetham Venkatesh, Isaac Sappington, Susana Vázquez Torres, Anna Lauko, Valentin De Bortoli, Emile...

work page 2023
[55]

Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. arXiv preprint arXiv:2306.13063, 2023. 18 LAB-Bench: Measuring Capabilities of Large Language Models for Biology Research

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

Gtrd: a database of transcription factor binding sites identified by chip-seq experiments

Ivan Yevshin, Ruslan Sharipov, Tagir Valeev, Alexander Kel, and Fedor Kolpakov. Gtrd: a database of transcription factor binding sites identified by chip-seq experiments. Nucleic acids research, page gkw951, 2016

work page 2016
[57]

Large language models are not robust multiple choice selectors

Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. Large language models are not robust multiple choice selectors. In The Twelfth International Conference on Learning Representations, 2023

work page 2023
[58]

Benchmarking large language models for molecule prediction tasks, 2024

Zhiqiang Zhong, Kuangyu Zhou, and Davide Mottin. Benchmarking large language models for molecule prediction tasks, 2024. A Appendix A.1 Example Questions A.1.1 DbQA DGA Input: Which of the following genes is associated with Distal renal tubular acidosis according to DisGeNet but not according to OMIM? Ideal: WDR72, Distractors: SLC4A1, LSM14B, WDR72, ATP6...

work page doi:10.1038/nmeth.4154 2024
[59]

which of the following

Distractors: Add 20 µL of Lipofectamine in step 6., Increase cell count to 3.0 x 10e7 in step 2., Aspirate the media after 2 hr in step 9. 21 LAB-Bench: Measuring Capabilities of Large Language Models for Biology Research A.1.7 FigQA Input: Where does the half waveplate occur relative to the SLM? Ideal: Before, Distractors: After, Different light path A.1...

work page 2024