pith. machine review for the scientific record. sign in

arxiv: 2407.10362 · v3 · submitted 2024-07-14 · 💻 cs.AI

Recognition: 3 theorem links

· Lean Theorem

LAB-Bench: Measuring Capabilities of Language Models for Biology Research

Authors on Pith no claims yet

Pith reviewed 2026-05-16 15:54 UTC · model grok-4.3

classification 💻 cs.AI
keywords LAB-Benchlanguage modelsbiology researchAI benchmarkliterature searchmolecular cloningscientific discoveryDNA sequences
0
0 comments X

The pith

LAB-Bench introduces over 2,400 questions to test AI on practical biology research tasks such as literature search and sequence manipulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LAB-Bench, a dataset of more than 2,400 multiple-choice questions that measure language models on skills required for actual biology research work. These questions cover recalling and reasoning from literature, interpreting figures, navigating databases, and handling DNA and protein sequences. Frontier models are evaluated on the benchmark and compared directly to human expert biologists. The authors state that consistent high performance on the harder tasks would indicate the AI could function as a useful assistant for researchers in areas like literature search and molecular cloning. This benchmark is offered as a tool to guide the creation of automated research systems.

Core claim

LAB-Bench is introduced as a broad dataset of over 2,400 multiple-choice questions for evaluating AI systems on practical biology research capabilities, including recall and reasoning over literature, interpretation of figures, access and navigation of databases, and comprehension and manipulation of DNA and protein sequences, with the expectation that high scores on difficult tasks would mean the system serves as a useful assistant for researchers.

What carries the argument

The LAB-Bench dataset of multiple-choice questions organized into categories for literature, figures, databases, and sequences.

If this is right

  • Models that score high on difficult tasks could assist researchers with literature search.
  • High performance may indicate readiness to help with molecular cloning and protocol planning.
  • The benchmark provides a direct comparison between current AI systems and human biology experts.
  • Ongoing updates to the dataset will track improvements in AI scientific task performance.
  • Developers can use the results to prioritize capabilities needed for automated research tools.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the benchmark holds up, it could steer AI development toward specific lab-support functions.
  • Comparable benchmarks could be built for chemistry or physics using the same task categories.
  • Real validation would require checking whether AI-generated plans succeed in physical experiments.
  • Expanding beyond multiple-choice to open-ended tasks would better test full research workflows.

Load-bearing premise

Success on these multiple-choice questions reflects the practical skills needed for real biology research rather than surface-level pattern matching.

What would settle it

A language model that scores high on LAB-Bench but fails to produce workable results when used to plan an actual molecular cloning experiment would show the benchmark does not capture practical capability.

read the original abstract

There is widespread optimism that frontier Large Language Models (LLMs) and LLM-augmented systems have the potential to rapidly accelerate scientific discovery across disciplines. Today, many benchmarks exist to measure LLM knowledge and reasoning on textbook-style science questions, but few if any benchmarks are designed to evaluate language model performance on practical tasks required for scientific research, such as literature search, protocol planning, and data analysis. As a step toward building such benchmarks, we introduce the Language Agent Biology Benchmark (LAB-Bench), a broad dataset of over 2,400 multiple choice questions for evaluating AI systems on a range of practical biology research capabilities, including recall and reasoning over literature, interpretation of figures, access and navigation of databases, and comprehension and manipulation of DNA and protein sequences. Importantly, in contrast to previous scientific benchmarks, we expect that an AI system that can achieve consistently high scores on the more difficult LAB-Bench tasks would serve as a useful assistant for researchers in areas such as literature search and molecular cloning. As an initial assessment of the emergent scientific task capabilities of frontier language models, we measure performance of several against our benchmark and report results compared to human expert biology researchers. We will continue to update and expand LAB-Bench over time, and expect it to serve as a useful tool in the development of automated research systems going forward. A public subset of LAB-Bench is available for use at the following URL: https://huggingface.co/datasets/futurehouse/lab-bench

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces LAB-Bench, a dataset of over 2,400 multiple-choice questions targeting practical biology research capabilities such as literature recall and reasoning, figure interpretation, database navigation, and DNA/protein sequence manipulation. It provides initial evaluations of several frontier LLMs, compares them to human expert baselines, and claims that consistently high performance on the harder tasks would indicate an AI system could serve as a useful research assistant for tasks like literature search and molecular cloning. A public subset is released on Hugging Face.

Significance. If the benchmark questions prove to be faithful proxies for real research workflows, the work would address a clear gap between existing textbook-style science benchmarks and applied scientific tasks. The public data release supports reproducibility and future extensions. However, the significance is constrained by the lack of any evidence that benchmark scores transfer to open-ended research performance.

major comments (2)
  1. [Abstract] Abstract: The central claim that 'an AI system that can achieve consistently high scores on the more difficult LAB-Bench tasks would serve as a useful assistant for researchers in areas such as literature search and molecular cloning' is unsupported. No transfer experiments, correlation studies, or open-ended task evaluations are reported to link MCQ accuracy to actual research utility.
  2. [Evaluation and Results (inferred from abstract and dataset description)] The manuscript provides no details on question validation procedures, inter-rater reliability for the human expert baselines, or controls for potential data leakage from training corpora. These omissions directly affect the interpretability of the reported model-versus-human comparisons.
minor comments (1)
  1. [Abstract] The abstract states 'we will continue to update and expand LAB-Bench over time' without specifying a versioning or maintenance plan; a brief statement on update criteria would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment below and outline revisions that will be incorporated to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that 'an AI system that can achieve consistently high scores on the more difficult LAB-Bench tasks would serve as a useful assistant for researchers in areas such as literature search and molecular cloning' is unsupported. No transfer experiments, correlation studies, or open-ended task evaluations are reported to link MCQ accuracy to actual research utility.

    Authors: We agree that the manuscript provides no direct empirical evidence of transfer from LAB-Bench performance to open-ended research tasks. The claim is presented as a motivating hypothesis grounded in the benchmark's design, which draws tasks from authentic research workflows such as literature navigation and sequence manipulation. To address this concern without overstatement, we will revise the abstract to frame the statement explicitly as a hypothesis for future validation rather than an established implication. We will also expand the discussion section to acknowledge the current lack of transfer studies and to outline planned follow-up work on correlating benchmark scores with real-world research outcomes. revision: yes

  2. Referee: [Evaluation and Results (inferred from abstract and dataset description)] The manuscript provides no details on question validation procedures, inter-rater reliability for the human expert baselines, or controls for potential data leakage from training corpora. These omissions directly affect the interpretability of the reported model-versus-human comparisons.

    Authors: We appreciate this observation and acknowledge that these methodological details were insufficiently described. In the revised manuscript we will add a dedicated subsection on benchmark construction that details the question validation workflow, including multi-expert review processes. We will report inter-rater reliability statistics (e.g., Cohen's kappa or percentage agreement) for the human expert baselines and describe controls implemented to limit data leakage, such as the use of post-2023 literature sources and explicit checks against common pre-training corpora. These additions will enhance transparency and allow readers to better interpret the model-versus-human comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: LAB-Bench is an independent benchmark introduction with no derived claims reducing to inputs

full rationale

The paper introduces a new dataset of over 2,400 multiple-choice questions targeting practical biology tasks such as literature recall, figure interpretation, database navigation, and sequence manipulation. No equations, fitted parameters, or derivation chains exist in the manuscript. The statement that high scores on difficult tasks would indicate utility as a research assistant is presented explicitly as an expectation rather than a result derived from prior elements of the work. No self-citations are load-bearing for any central premise, and the evaluation protocol is defined directly from the collected questions without reduction to fitted inputs or renamed known results. The work is therefore self-contained as an empirical benchmark release.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark rests on the standard assumption that multiple-choice accuracy is a valid proxy for research capability; no free parameters, invented entities, or non-standard axioms are introduced.

axioms (1)
  • domain assumption Multiple-choice question performance correlates with ability to perform open-ended research tasks
    Invoked when claiming high scores would indicate useful research assistants

pith-pipeline@v0.9.0 · 5607 in / 1214 out tokens · 31446 ms · 2026-05-16T15:54:22.695332+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    we introduce the Language Agent Biology Benchmark (LAB-Bench), a broad dataset of over 2,400 multiple choice questions for evaluating AI systems on a range of practical biology research capabilities, including recall and reasoning over literature, interpretation of figures, access and navigation of databases, and comprehension and manipulation of DNA and protein sequences

  • IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    we expect that an AI system that can achieve consistently high scores on the more difficult LAB-Bench tasks would serve as a useful assistant for researchers in areas such as literature search and molecular cloning

  • IndisputableMonolith.Foundation.LedgerCanonicality uniform_scaling_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    As an initial assessment of the emergent scientific task capabilities of frontier language models, we measure performance of several against our benchmark and report results compared to human expert biology researchers

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction

    cs.LG 2026-05 unverdicted novelty 7.0

    Collider-Bench is a new benchmark showing that current LLM agents cannot reliably reproduce LHC analyses at the level of a physicist-in-the-loop.

  2. Jailbroken Frontier Models Retain Their Capabilities

    cs.LG 2026-04 unverdicted novelty 7.0

    Jailbreak-induced performance loss shrinks as model capability grows, with the strongest models showing almost no degradation on benchmarks.

  3. AI scientists produce results without reasoning scientifically

    cs.AI 2026-04 conditional novelty 7.0

    LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.

  4. SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?

    cs.AI 2026-04 unverdicted novelty 7.0

    LLMs predict outcomes of real scientific experiments at 14-26% accuracy, comparable to human experts, but lack calibration on prediction reliability while humans demonstrate strong calibration.

  5. The limits of bio-molecular modeling with large language models : a cross-scale evaluation

    cs.LG 2026-04 unverdicted novelty 7.0

    LLMs perform adequately on bio-molecular classification tasks but remain weak on regression, with hybrid architectures outperforming others on long sequences and fine-tuning hurting generalization.

  6. PolyReal: A Benchmark for Real-World Polymer Science Workflows

    cs.CV 2026-04 unverdicted novelty 7.0

    PolyReal benchmark shows leading MLLMs perform well on polymer knowledge reasoning but drop sharply on practical tasks like lab safety analysis and raw data extraction.

  7. BioAgent Bench: An AI Agent Evaluation Suite for Bioinformatics

    cs.AI 2026-01 accept novelty 7.0

    BioAgent Bench is a new evaluation suite that tests AI agents on end-to-end bioinformatics pipelines and finds that frontier models often complete tasks reliably but fail under controlled perturbations like corrupted ...

  8. BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents

    cs.AI 2026-05 conditional novelty 6.0

    BioMedArena releases a standardized toolkit with 147 biomedical benchmarks, 75 tools, and six harnesses that achieve SOTA results on eight tasks with a +15.03 percentage point average lift.

  9. Prescriptive Scaling Laws for Data Constrained Training

    cs.LG 2026-05 unverdicted novelty 6.0

    A one-parameter scaling law models excess loss from data repetition as an additive overfitting penalty, recommending model capacity increases over excessive repetition and showing that strong weight decay reduces the ...

  10. An Independent Safety Evaluation of Kimi K2.5

    cs.CR 2026-04 conditional novelty 6.0

    Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.

  11. LABBench2: An Improved Benchmark for AI Systems Performing Biology Research

    cs.AI 2026-02 unverdicted novelty 6.0

    LABBench2 is a more challenging benchmark than LAB-Bench for assessing AI performance on biology research tasks, with frontier models showing accuracy drops of 26-46% across subtasks.

  12. DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

    cs.CL 2025-06 conditional novelty 6.0

    DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.

  13. DeepER-Med: Advancing Deep Evidence-Based Research in Medicine Through Agentic AI

    cs.AI 2026-04 unverdicted novelty 5.0

    DeepER-Med introduces a three-module agentic AI workflow for evidence-based medical research that outperforms production platforms on a new expert-curated dataset of 100 questions and matches clinical recommendations ...

  14. AgentCE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments

    cs.AI 2026-04 unverdicted novelty 5.0

    AgentCE-Bench is a lightweight grid-planning benchmark that controls task horizon via hidden slots H and difficulty via decoy budget B, validated across 13 models for consistent and discriminative evaluation.

  15. Humanity's Last Exam

    cs.LG 2025-01 unverdicted novelty 5.0

    Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.

  16. Useful for Exploration, Risky for Precision: Evaluating AI Tools in Academic Research

    cs.AI 2026-05 unverdicted novelty 4.0

    AI Q&A tools give useful overviews but fail at precise information extraction and source tracing, while literature review tools aid exploration yet lack reproducibility and transparency, making them unsuitable for sys...

  17. Useful for Exploration, Risky for Precision: Evaluating AI Tools in Academic Research

    cs.AI 2026-05 unverdicted novelty 4.0

    AI tools deliver useful overviews for research exploration but prove unreliable for precise information extraction and systematic reviews due to low explainability, reproducibility, and transparency.

  18. Risk Reporting for Developers' Internal AI Model Use

    cs.CY 2026-04 unverdicted novelty 4.0

    A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · cited by 17 Pith papers · 3 internal anchors

  1. [1]

    Joanna S Amberger, Carol A Bocchini, François Schiettecatte, Alan F Scott, and Ada Hamosh. Omim. org: Online mendelian inheritance in man (omim®), an online catalog of human genes and genetic disorders. Nucleic acids research, 43(D1):D789–D798, 2015

  2. [2]

    Introducing the next generation of claude, March 2024

    Anthropic. Introducing the next generation of claude, March 2024. URL https://www. anthropic.com/news/claude-3-family. Accessed: 2024-06-11

  3. [3]

    Introducing the next generation of claude, March 2024

    Anthropic. Introducing the next generation of claude, March 2024. URL https://www. anthropic.com/news/claude-3-5-sonnet . Accessed: 2024-07-11

  4. [4]

    Lessons from the Trenches on Reproducible Evaluation of Language Models

    Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Ab- basi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, et al. Lessons from the trenches on reproducible evaluation of language models. arXiv preprint arXiv:2405.14782, 2024

  5. [5]

    Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes

    Daniil A. Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemi- cal research with large language models. Nature, 624(7992):570–578, Dec 2023. ISSN 1476-4687. doi: 10.1038/s41586-023-06792-0. URL https://doi.org/10.1038/ s41586-023-06792-0

  6. [6]

    Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S

    Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stef...

  7. [7]

    Sparks of artificial general intelligence: Early experiments with gpt-4, 2023

    Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments with gpt-4, 2023

  8. [8]

    mirdb: an online database for prediction of functional microrna targets

    Yuhao Chen and Xiaowei Wang. mirdb: an online database for prediction of functional microrna targets. Nucleic acids research, 48(D1):D127–D131, 2020

  9. [9]

    Peter J. A. Cock, Tiago Antao, Jeffrey T. Chang, Brad A. Chapman, Cymon J. Cox, Andrew Dalke, Iddo Friedberg, Thomas Hamelryck, Frank Kauff, Bartek Wilczynski, and Michiel J. L. de Hoon. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 25(11):1422–1423, 03 2009. ISSN 1367-4803. doi: 10.1093/ ...

  10. [10]

    Wayne Davis and Erik M

    M. Wayne Davis and Erik M. Jorgensen. Ape, a plasmid editor: A freely available dna manipulation and visualization program. Frontiers in Bioinformatics, 2, 2022. ISSN 2673-

  11. [11]

    URL https://www.frontiersin.org/articles/ 10.3389/fbinf.2022.818619

    doi: 10.3389/fbinf.2022.818619. URL https://www.frontiersin.org/articles/ 10.3389/fbinf.2022.818619

  12. [12]

    What’s going on with the open llm leaderboard? Hugging Face, June 2023

    Clémentine Fourrier, Najoung Habib, Julien Launay, and Thomas Wolf. What’s going on with the open llm leaderboard? Hugging Face, June 2023. URL https://huggingface.co/ blog/open-llm-leaderboard-mmlu

  13. [13]

    Introducing gemini 1.5, google’s next-generation ai model, February 2024

    Google. Introducing gemini 1.5, google’s next-generation ai model, February 2024. URL https://blog.google/technology/ai/ google-gemini-next-generation-model-february-2024/ . Accessed: 2024-06- 11

  14. [14]

    Ensembl 2024

    Peter W Harrison, M Ridwan Amode, Olanrewaju Austine-Orimoloye, Andrey G Azov, Matthieu Barba, If Barnes, Arne Becker, Ruth Bennett, Andrew Berry, Jyothish Bhai, et al. Ensembl 2024. Nucleic Acids Research, 52(D1):D891–D899, 2024

  15. [15]

    Machine learning with a reject option: A survey

    Kilian Hendrickx, Lorenzo Perini, Dries Van der Plas, Wannes Meert, and Jesse Davis. Machine learning with a reject option: A survey. Machine Learning, 113(5):3073–3110, 2024

  16. [16]

    Measuring massive multitask language understanding, 2021

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021

  17. [17]

    Geneturing tests gpt models in genomics

    Wenpin Hou and Zhicheng Ji. Geneturing tests gpt models in genomics. BioRxiv, 2023

  18. [18]

    Leverag- ing large language models for predictive chemistry

    Kevin Maik Jablonka, Philippe Schwaller, Andres Ortega-Guerrero, and Berend Smit. Leverag- ing large language models for predictive chemistry. Nat. Mach. Intell., 6(2):161–169, February 2024

  19. [19]

    Kakade, and Eran Malach

    Samy Jelassi, David Brandfonbrener, Sham M. Kakade, and Eran Malach. Repeat after me: Transformers are better than state space models at copying, 2024. 13 LAB-Bench: Measuring Capabilities of Large Language Models for Biology Research

  20. [20]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024

  21. [21]

    GeneGPT: augmenting large language models with domain tools for improved access to biomedical information

    Qiao Jin, Yifan Yang, Qingyu Chen, and Zhiyong Lu. GeneGPT: augmenting large language models with domain tools for improved access to biomedical information. Bioinformatics, 40 (2):btae075, 02 2024. ISSN 1367-4811. doi: 10.1093/bioinformatics/btae075. URL https: //doi.org/10.1093/bioinformatics/btae075

  22. [22]

    Highly accurate protein structure predic- tion with AlphaFold

    John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ron- neberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A A Kohl, Andrew J Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Trevor Back, Stig Petersen, David Reiman,...

  23. [23]

    GPT-4 passes the bar exam

    Daniel Martin Katz, Michael James Bommarito, Shang Gao, and Pablo Arredondo. GPT-4 passes the bar exam. SSRN Electron. J., 2023

  24. [24]

    BioASQ-QA: A manually curated corpus for biomedical question answering

    Anastasia Krithara, Anastasios Nentidis, Konstantinos Bougiatiotis, and Georgios Paliouras. BioASQ-QA: A manually curated corpus for biomedical question answering. Sci. Data, 10(1): 170, March 2023

  25. [25]

    Enrichr: a comprehensive gene set enrichment analysis web server 2016 update

    Maxim V Kuleshov, Matthew R Jones, Andrew D Rouillard, Nicolas F Fernandez, Qiaonan Duan, Zichen Wang, Simon Koplev, Sherry L Jenkins, Kathleen M Jagodnik, Alexander Lachmann, et al. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic acids research, 44(W1):W90–W97, 2016

  26. [26]

    Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models

    Tiffany H Kung, Morgan Cheatham, Arielle Medenilla, Czarina Sillos, Lorie De Leon, Camille Elepaño, Maria Madriaga, Rimel Aggabao, Giezel Diaz-Candido, James Maningo, and Victor Tseng. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit. Health, 2(2):e0000198, February 2023

  27. [27]

    Clinvar: public archive of interpretations of clinically relevant variants

    Melissa J Landrum, Jennifer M Lee, Mark Benson, Garth Brown, Chen Chao, Shanmuga Chitipiralla, Baoshan Gu, Jennifer Hart, Douglas Hoffman, Jeffrey Hoover, et al. Clinvar: public archive of interpretations of clinically relevant variants. Nucleic acids research, 44(D1): D862–D868, 2016

  28. [28]

    A structure- informed atlas of human-virus interactions

    Gorka Lasso, Sandra V Mayer, Evandro R Winkelmann, Tim Chu, Oliver Elliot, Juan Angel Patino-Galindo, Kernyu Park, Raul Rabadan, Barry Honig, and Sagi D Shapira. A structure- informed atlas of human-virus interactions. Cell, 178(6):1526–1541, 2019

  29. [29]

    Holistic evaluation of language models

    Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D Manning, Christopher Ré, Diana Acosta-Navas, Drew A Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu R...

  30. [30]

    Molecular signatures database (msigdb) 3.0

    Arthur Liberzon, Aravind Subramanian, Reid Pinchback, Helga Thorvaldsdóttir, Pablo Tamayo, and Jill P Mesirov. Molecular signatures database (msigdb) 3.0. Bioinformatics, 27(12): 1739–1740, 2011

  31. [31]

    Rodriques, and Andrew D

    Jakub Lála, Odhran O’Donoghue, Aleksandar Shtedritski, Sam Cox, Samuel G. Rodriques, and Andrew D. White. Paperqa: Retrieval-augmented generative agent for scientific research, 2023. 14 LAB-Bench: Measuring Capabilities of Large Language Models for Biology Research

  32. [32]

    Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D

    Andres M. Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D. White, and Philippe Schwaller. Augmenting large language models with chemistry tools. Nature Machine Intelli- gence, 6(5):525–535, May 2024. ISSN 2522-5839. doi: 10.1038/s42256-024-00832-8. URL https://doi.org/10.1038/s42256-024-00832-8

  33. [33]

    Scaling deep learning for materials discovery

    Amil Merchant, Simon Batzner, Samuel S Schoenholz, Muratahan Aykol, Gowoon Cheon, and Ekin Dogus Cubuk. Scaling deep learning for materials discovery. Nature, 624(7990):80–85, December 2023

  34. [34]

    Chat: meta-llama/meta-llama-3-70b-instruct, April 2024

    Meta. Chat: meta-llama/meta-llama-3-70b-instruct, April 2024. URL https: //docs.anyscale.com/endpoints/text-generation/supported-models/ meta-llama-Meta-Llama-3-70B-Instruct . Accessed: 2024-06-11

  35. [35]

    Introducing meta llama 3: The most capable openly available llm to date, April 2024

    Meta. Introducing meta llama 3: The most capable openly available llm to date, April 2024. URL https://ai.meta.com/blog/meta-llama-3/. Accessed: 2024-06-11

  36. [36]

    Holick, Tanya Gupta, Mehrdad Asgari, Christina Glaubitz, Lea C

    Adrian Mirza, Nawaf Alampara, Sreekanth Kunchapu, Benedict Emoekabu, Aswanth Krishnan, Mara Wilhelmi, Macjonathan Okereke, Juliane Eberhardt, Amir Mohammad Elahi, Maximilian Greiner, Caroline T. Holick, Tanya Gupta, Mehrdad Asgari, Christina Glaubitz, Lea C. Klepsch, Yannik Köster, Jakob Meyer, Santiago Miret, Tim Hoffmann, Fabian Alexander Kreth, Michael...

  37. [37]

    Durrant, Armin W

    Eric Nguyen, Michael Poli, Matthew G. Durrant, Armin W. Thomas, Brian Kang, Jeremy Sullivan, Madelena Y . Ng, Ashley Lewis, Aman Patel, Aaron Lou, Stefano Ermon, Stephen A. Baccus, Tina Hernandez-Boussard, Christopher Ré, Patrick D. Hsu, and Brian L. Hie. Sequence modeling and design from molecular to genome scale with evo. bioRxiv, 2024. doi: 10. 1101/20...

  38. [38]

    Nicholson, Milo Mordaunt, Patrice Lopez, Ashish Uppala, Domenic Rosati, Neves P

    Josh M. Nicholson, Milo Mordaunt, Patrice Lopez, Ashish Uppala, Domenic Rosati, Neves P. Rodrigues, Peter Grabitz, and Sean C. Rife. scite: A smart citation index that displays the context of citations and classifies their intent using deep learning. Quantitative Science Studies, 2(3):882–898, 11 2021. ISSN 2641-3337. doi: 10.1162/qss_a_00146. URL https:/...

  39. [39]

    Proteingym: Large- scale benchmarks for protein fitness prediction and design

    Pascal Notin, Aaron Kollasch, Daniel Ritter, Lood van Niekerk, Steffanie Paul, Han Spinner, Nathan Rollins, Ada Shaw, Rose Orenbuch, Ruben Weitzman, Jonathan Frazer, Mafalda Dias, Dinko Franceschi, Yarin Gal, and Debora Marks. Proteingym: Large- scale benchmarks for protein fitness prediction and design. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. H...

  40. [40]

    Hello gpt-4o, May 2024

    OpenAI. Hello gpt-4o, May 2024. URL https://openai.com/index/hello-gpt-4o/. Accessed: 2024-06-11

  41. [41]

    Gpt-4 and gpt-4 turbo, 2024

    OpenAI. Gpt-4 and gpt-4 turbo, 2024. URL https://platform.openai.com/docs/ models/gpt-4-and-gpt-4-turbo . Accessed: 2024-06-11

  42. [42]

    OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Flo- rencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Bern...

  43. [43]

    Disgenet: a com- prehensive platform integrating information on human disease-associated genes and variants

    Janet Piñero, Àlex Bravo, Núria Queralt-Rosinach, Alba Gutiérrez-Sacristán, Jordi Deu-Pons, Emilio Centeno, Javier García-García, Ferran Sanz, and Laura I Furlong. Disgenet: a com- prehensive platform integrating information on human disease-associated genes and variants. Nucleic acids research, page gkw943, 2016

  44. [44]

    Ai-assisted coding: Experiments with gpt-4, 2023

    Russell A Poldrack, Thomas Lu, and Gašper Beguš. Ai-assisted coding: Experiments with gpt-4, 2023

  45. [45]

    The hugo gene nomenclature committee (hgnc)

    Sue Povey, Ruth Lovering, Elspeth Bruford, Mathew Wright, Michael Lush, and Hester Wain. The hugo gene nomenclature committee (hgnc). Human genetics, 109:678–680, 2001

  46. [46]

    Can good benchmarks contain mistakes? https://wp.nyu.edu/arg/ can-good-benchmarks-contain-mistakes/ , 2024

    David Rein. Can good benchmarks contain mistakes? https://wp.nyu.edu/arg/ can-good-benchmarks-contain-mistakes/ , 2024. Accessed: 2024-05-20

  47. [47]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark, 2023. URL https://arxiv.org/abs/2311.12022

  48. [48]

    Biollmbench: A comprehensive benchmarking of large language models in bioinformatics

    Varuni Sarwal, Viorel Munteanu, Timur Suhodolschi, Dumitru Ciorba, Eleazar Eskin, Wei Wang, and Serghei Mangul. Biollmbench: A comprehensive benchmarking of large language models in bioinformatics. bioRxiv, 2023. doi: 10.1101/2023.12.19.572483. URL https: //www.biorxiv.org/content/early/2023/12/20/2023.12.19.572483. 16 LAB-Bench: Measuring Capabilities of...

  49. [49]

    Large language models encode clinical knowledge

    Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Senevi- ratne, Paul Gamble, Chris Kelly, Abubakr Babiker, Nathanael Schärli, Aakanksha Chowdhery, Philip Mansfield, Dina Demner-Fushman, Blaise Agüera Y Arcas, Dale Webster, Greg S Corrado,...

  50. [50]

    The mammalian phenotype ontology: enabling robust annotation and comparative analysis

    Cynthia L Smith and Janan T Eppig. The mammalian phenotype ontology: enabling robust annotation and comparative analysis. Wiley Interdisciplinary Reviews: Systems Biology and Medicine, 1(3):390–399, 2009

  51. [51]

    Beyond the imitation game: Quantifying and extrapolating the capabilities of language models

    Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Ask...

  52. [52]

    Delocalized, asynchronous, closed-loop discovery of organic laser emitters

    Felix Strieth-Kalthoff, Han Hao, Vandana Rathore, Joshua Derasp, Théophile Gaudin, Nicholas H Angello, Martin Seifrid, Ekaterina Trushina, Mason Guy, Junliang Liu, Xun Tang, Masashi Mamada, Wesley Wang, Tuul Tsagaantsooj, Cyrille Lavigne, Robert Pollice, Tony C Wu, Kazuhiro Hotta, Leticia Bodo, Shangyu Li, Mohammad Haddadnia, Agnieszka Wołos, Rafał Roszak...

  53. [53]

    Galactica: A large language model for science

    Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. Galactica: A large language model for science. arXiv, 2022

  54. [54]

    De novo design of protein structure and function with RFdiffusion

    Joseph L Watson, David Juergens, Nathaniel R Bennett, Brian L Trippe, Jason Yim, Helen E Eisenach, Woody Ahern, Andrew J Borst, Robert J Ragotte, Lukas F Milles, Basile I M Wicky, Nikita Hanikel, Samuel J Pellock, Alexis Courbet, William Sheffler, Jue Wang, Preetham Venkatesh, Isaac Sappington, Susana Vázquez Torres, Anna Lauko, Valentin De Bortoli, Emile...

  55. [55]

    Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

    Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. arXiv preprint arXiv:2306.13063, 2023. 18 LAB-Bench: Measuring Capabilities of Large Language Models for Biology Research

  56. [56]

    Gtrd: a database of transcription factor binding sites identified by chip-seq experiments

    Ivan Yevshin, Ruslan Sharipov, Tagir Valeev, Alexander Kel, and Fedor Kolpakov. Gtrd: a database of transcription factor binding sites identified by chip-seq experiments. Nucleic acids research, page gkw951, 2016

  57. [57]

    Large language models are not robust multiple choice selectors

    Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. Large language models are not robust multiple choice selectors. In The Twelfth International Conference on Learning Representations, 2023

  58. [58]

    Benchmarking large language models for molecule prediction tasks, 2024

    Zhiqiang Zhong, Kuangyu Zhou, and Davide Mottin. Benchmarking large language models for molecule prediction tasks, 2024. A Appendix A.1 Example Questions A.1.1 DbQA DGA Input: Which of the following genes is associated with Distal renal tubular acidosis according to DisGeNet but not according to OMIM? Ideal: WDR72, Distractors: SLC4A1, LSM14B, WDR72, ATP6...

  59. [59]

    which of the following

    Distractors: Add 20 µL of Lipofectamine in step 6., Increase cell count to 3.0 x 10e7 in step 2., Aspirate the media after 2 hr in step 9. 21 LAB-Bench: Measuring Capabilities of Large Language Models for Biology Research A.1.7 FigQA Input: Where does the half waveplate occur relative to the SLM? Ideal: Before, Distractors: After, Different light path A.1...