Recognition: 3 theorem links
· Lean TheoremLAB-Bench: Measuring Capabilities of Language Models for Biology Research
Pith reviewed 2026-05-16 15:54 UTC · model grok-4.3
The pith
LAB-Bench introduces over 2,400 questions to test AI on practical biology research tasks such as literature search and sequence manipulation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LAB-Bench is introduced as a broad dataset of over 2,400 multiple-choice questions for evaluating AI systems on practical biology research capabilities, including recall and reasoning over literature, interpretation of figures, access and navigation of databases, and comprehension and manipulation of DNA and protein sequences, with the expectation that high scores on difficult tasks would mean the system serves as a useful assistant for researchers.
What carries the argument
The LAB-Bench dataset of multiple-choice questions organized into categories for literature, figures, databases, and sequences.
If this is right
- Models that score high on difficult tasks could assist researchers with literature search.
- High performance may indicate readiness to help with molecular cloning and protocol planning.
- The benchmark provides a direct comparison between current AI systems and human biology experts.
- Ongoing updates to the dataset will track improvements in AI scientific task performance.
- Developers can use the results to prioritize capabilities needed for automated research tools.
Where Pith is reading between the lines
- If the benchmark holds up, it could steer AI development toward specific lab-support functions.
- Comparable benchmarks could be built for chemistry or physics using the same task categories.
- Real validation would require checking whether AI-generated plans succeed in physical experiments.
- Expanding beyond multiple-choice to open-ended tasks would better test full research workflows.
Load-bearing premise
Success on these multiple-choice questions reflects the practical skills needed for real biology research rather than surface-level pattern matching.
What would settle it
A language model that scores high on LAB-Bench but fails to produce workable results when used to plan an actual molecular cloning experiment would show the benchmark does not capture practical capability.
read the original abstract
There is widespread optimism that frontier Large Language Models (LLMs) and LLM-augmented systems have the potential to rapidly accelerate scientific discovery across disciplines. Today, many benchmarks exist to measure LLM knowledge and reasoning on textbook-style science questions, but few if any benchmarks are designed to evaluate language model performance on practical tasks required for scientific research, such as literature search, protocol planning, and data analysis. As a step toward building such benchmarks, we introduce the Language Agent Biology Benchmark (LAB-Bench), a broad dataset of over 2,400 multiple choice questions for evaluating AI systems on a range of practical biology research capabilities, including recall and reasoning over literature, interpretation of figures, access and navigation of databases, and comprehension and manipulation of DNA and protein sequences. Importantly, in contrast to previous scientific benchmarks, we expect that an AI system that can achieve consistently high scores on the more difficult LAB-Bench tasks would serve as a useful assistant for researchers in areas such as literature search and molecular cloning. As an initial assessment of the emergent scientific task capabilities of frontier language models, we measure performance of several against our benchmark and report results compared to human expert biology researchers. We will continue to update and expand LAB-Bench over time, and expect it to serve as a useful tool in the development of automated research systems going forward. A public subset of LAB-Bench is available for use at the following URL: https://huggingface.co/datasets/futurehouse/lab-bench
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LAB-Bench, a dataset of over 2,400 multiple-choice questions targeting practical biology research capabilities such as literature recall and reasoning, figure interpretation, database navigation, and DNA/protein sequence manipulation. It provides initial evaluations of several frontier LLMs, compares them to human expert baselines, and claims that consistently high performance on the harder tasks would indicate an AI system could serve as a useful research assistant for tasks like literature search and molecular cloning. A public subset is released on Hugging Face.
Significance. If the benchmark questions prove to be faithful proxies for real research workflows, the work would address a clear gap between existing textbook-style science benchmarks and applied scientific tasks. The public data release supports reproducibility and future extensions. However, the significance is constrained by the lack of any evidence that benchmark scores transfer to open-ended research performance.
major comments (2)
- [Abstract] Abstract: The central claim that 'an AI system that can achieve consistently high scores on the more difficult LAB-Bench tasks would serve as a useful assistant for researchers in areas such as literature search and molecular cloning' is unsupported. No transfer experiments, correlation studies, or open-ended task evaluations are reported to link MCQ accuracy to actual research utility.
- [Evaluation and Results (inferred from abstract and dataset description)] The manuscript provides no details on question validation procedures, inter-rater reliability for the human expert baselines, or controls for potential data leakage from training corpora. These omissions directly affect the interpretability of the reported model-versus-human comparisons.
minor comments (1)
- [Abstract] The abstract states 'we will continue to update and expand LAB-Bench over time' without specifying a versioning or maintenance plan; a brief statement on update criteria would improve clarity.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address each major comment below and outline revisions that will be incorporated to improve clarity and rigor.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that 'an AI system that can achieve consistently high scores on the more difficult LAB-Bench tasks would serve as a useful assistant for researchers in areas such as literature search and molecular cloning' is unsupported. No transfer experiments, correlation studies, or open-ended task evaluations are reported to link MCQ accuracy to actual research utility.
Authors: We agree that the manuscript provides no direct empirical evidence of transfer from LAB-Bench performance to open-ended research tasks. The claim is presented as a motivating hypothesis grounded in the benchmark's design, which draws tasks from authentic research workflows such as literature navigation and sequence manipulation. To address this concern without overstatement, we will revise the abstract to frame the statement explicitly as a hypothesis for future validation rather than an established implication. We will also expand the discussion section to acknowledge the current lack of transfer studies and to outline planned follow-up work on correlating benchmark scores with real-world research outcomes. revision: yes
-
Referee: [Evaluation and Results (inferred from abstract and dataset description)] The manuscript provides no details on question validation procedures, inter-rater reliability for the human expert baselines, or controls for potential data leakage from training corpora. These omissions directly affect the interpretability of the reported model-versus-human comparisons.
Authors: We appreciate this observation and acknowledge that these methodological details were insufficiently described. In the revised manuscript we will add a dedicated subsection on benchmark construction that details the question validation workflow, including multi-expert review processes. We will report inter-rater reliability statistics (e.g., Cohen's kappa or percentage agreement) for the human expert baselines and describe controls implemented to limit data leakage, such as the use of post-2023 literature sources and explicit checks against common pre-training corpora. These additions will enhance transparency and allow readers to better interpret the model-versus-human comparisons. revision: yes
Circularity Check
No circularity: LAB-Bench is an independent benchmark introduction with no derived claims reducing to inputs
full rationale
The paper introduces a new dataset of over 2,400 multiple-choice questions targeting practical biology tasks such as literature recall, figure interpretation, database navigation, and sequence manipulation. No equations, fitted parameters, or derivation chains exist in the manuscript. The statement that high scores on difficult tasks would indicate utility as a research assistant is presented explicitly as an expectation rather than a result derived from prior elements of the work. No self-citations are load-bearing for any central premise, and the evaluation protocol is defined directly from the collected questions without reduction to fitted inputs or renamed known results. The work is therefore self-contained as an empirical benchmark release.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Multiple-choice question performance correlates with ability to perform open-ended research tasks
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we introduce the Language Agent Biology Benchmark (LAB-Bench), a broad dataset of over 2,400 multiple choice questions for evaluating AI systems on a range of practical biology research capabilities, including recall and reasoning over literature, interpretation of figures, access and navigation of databases, and comprehension and manipulation of DNA and protein sequences
-
IndisputableMonolith.Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we expect that an AI system that can achieve consistently high scores on the more difficult LAB-Bench tasks would serve as a useful assistant for researchers in areas such as literature search and molecular cloning
-
IndisputableMonolith.Foundation.LedgerCanonicalityuniform_scaling_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
As an initial assessment of the emergent scientific task capabilities of frontier language models, we measure performance of several against our benchmark and report results compared to human expert biology researchers
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 18 Pith papers
-
Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction
Collider-Bench is a new benchmark showing that current LLM agents cannot reliably reproduce LHC analyses at the level of a physicist-in-the-loop.
-
Jailbroken Frontier Models Retain Their Capabilities
Jailbreak-induced performance loss shrinks as model capability grows, with the strongest models showing almost no degradation on benchmarks.
-
AI scientists produce results without reasoning scientifically
LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.
-
SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?
LLMs predict outcomes of real scientific experiments at 14-26% accuracy, comparable to human experts, but lack calibration on prediction reliability while humans demonstrate strong calibration.
-
The limits of bio-molecular modeling with large language models : a cross-scale evaluation
LLMs perform adequately on bio-molecular classification tasks but remain weak on regression, with hybrid architectures outperforming others on long sequences and fine-tuning hurting generalization.
-
PolyReal: A Benchmark for Real-World Polymer Science Workflows
PolyReal benchmark shows leading MLLMs perform well on polymer knowledge reasoning but drop sharply on practical tasks like lab safety analysis and raw data extraction.
-
BioAgent Bench: An AI Agent Evaluation Suite for Bioinformatics
BioAgent Bench is a new evaluation suite that tests AI agents on end-to-end bioinformatics pipelines and finds that frontier models often complete tasks reliably but fail under controlled perturbations like corrupted ...
-
BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents
BioMedArena releases a standardized toolkit with 147 biomedical benchmarks, 75 tools, and six harnesses that achieve SOTA results on eight tasks with a +15.03 percentage point average lift.
-
Prescriptive Scaling Laws for Data Constrained Training
A one-parameter scaling law models excess loss from data repetition as an additive overfitting penalty, recommending model capacity increases over excessive repetition and showing that strong weight decay reduces the ...
-
An Independent Safety Evaluation of Kimi K2.5
Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.
-
LABBench2: An Improved Benchmark for AI Systems Performing Biology Research
LABBench2 is a more challenging benchmark than LAB-Bench for assessing AI performance on biology research tasks, with frontier models showing accuracy drops of 26-46% across subtasks.
-
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.
-
DeepER-Med: Advancing Deep Evidence-Based Research in Medicine Through Agentic AI
DeepER-Med introduces a three-module agentic AI workflow for evidence-based medical research that outperforms production platforms on a new expert-curated dataset of 100 questions and matches clinical recommendations ...
-
AgentCE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments
AgentCE-Bench is a lightweight grid-planning benchmark that controls task horizon via hidden slots H and difficulty via decoy budget B, validated across 13 models for consistent and discriminative evaluation.
-
Humanity's Last Exam
Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.
-
Useful for Exploration, Risky for Precision: Evaluating AI Tools in Academic Research
AI Q&A tools give useful overviews but fail at precise information extraction and source tracing, while literature review tools aid exploration yet lack reproducibility and transparency, making them unsuitable for sys...
-
Useful for Exploration, Risky for Precision: Evaluating AI Tools in Academic Research
AI tools deliver useful overviews for research exploration but prove unreliable for precise information extraction and systematic reviews due to low explainability, reproducibility, and transparency.
-
Risk Reporting for Developers' Internal AI Model Use
A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.
Reference graph
Works this paper leans on
-
[1]
Joanna S Amberger, Carol A Bocchini, François Schiettecatte, Alan F Scott, and Ada Hamosh. Omim. org: Online mendelian inheritance in man (omim®), an online catalog of human genes and genetic disorders. Nucleic acids research, 43(D1):D789–D798, 2015
work page 2015
-
[2]
Introducing the next generation of claude, March 2024
Anthropic. Introducing the next generation of claude, March 2024. URL https://www. anthropic.com/news/claude-3-family. Accessed: 2024-06-11
work page 2024
-
[3]
Introducing the next generation of claude, March 2024
Anthropic. Introducing the next generation of claude, March 2024. URL https://www. anthropic.com/news/claude-3-5-sonnet . Accessed: 2024-07-11
work page 2024
-
[4]
Lessons from the Trenches on Reproducible Evaluation of Language Models
Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Ab- basi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, et al. Lessons from the trenches on reproducible evaluation of language models. arXiv preprint arXiv:2405.14782, 2024
work page internal anchor Pith review arXiv 2024
-
[5]
Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes
Daniil A. Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemi- cal research with large language models. Nature, 624(7992):570–578, Dec 2023. ISSN 1476-4687. doi: 10.1038/s41586-023-06792-0. URL https://doi.org/10.1038/ s41586-023-06792-0
-
[6]
Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S
Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stef...
work page 2022
-
[7]
Sparks of artificial general intelligence: Early experiments with gpt-4, 2023
Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments with gpt-4, 2023
work page 2023
-
[8]
mirdb: an online database for prediction of functional microrna targets
Yuhao Chen and Xiaowei Wang. mirdb: an online database for prediction of functional microrna targets. Nucleic acids research, 48(D1):D127–D131, 2020
work page 2020
-
[9]
Peter J. A. Cock, Tiago Antao, Jeffrey T. Chang, Brad A. Chapman, Cymon J. Cox, Andrew Dalke, Iddo Friedberg, Thomas Hamelryck, Frank Kauff, Bartek Wilczynski, and Michiel J. L. de Hoon. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 25(11):1422–1423, 03 2009. ISSN 1367-4803. doi: 10.1093/ ...
-
[10]
M. Wayne Davis and Erik M. Jorgensen. Ape, a plasmid editor: A freely available dna manipulation and visualization program. Frontiers in Bioinformatics, 2, 2022. ISSN 2673-
work page 2022
-
[11]
URL https://www.frontiersin.org/articles/ 10.3389/fbinf.2022.818619
doi: 10.3389/fbinf.2022.818619. URL https://www.frontiersin.org/articles/ 10.3389/fbinf.2022.818619
-
[12]
What’s going on with the open llm leaderboard? Hugging Face, June 2023
Clémentine Fourrier, Najoung Habib, Julien Launay, and Thomas Wolf. What’s going on with the open llm leaderboard? Hugging Face, June 2023. URL https://huggingface.co/ blog/open-llm-leaderboard-mmlu
work page 2023
-
[13]
Introducing gemini 1.5, google’s next-generation ai model, February 2024
Google. Introducing gemini 1.5, google’s next-generation ai model, February 2024. URL https://blog.google/technology/ai/ google-gemini-next-generation-model-february-2024/ . Accessed: 2024-06- 11
work page 2024
-
[14]
Peter W Harrison, M Ridwan Amode, Olanrewaju Austine-Orimoloye, Andrey G Azov, Matthieu Barba, If Barnes, Arne Becker, Ruth Bennett, Andrew Berry, Jyothish Bhai, et al. Ensembl 2024. Nucleic Acids Research, 52(D1):D891–D899, 2024
work page 2024
-
[15]
Machine learning with a reject option: A survey
Kilian Hendrickx, Lorenzo Perini, Dries Van der Plas, Wannes Meert, and Jesse Davis. Machine learning with a reject option: A survey. Machine Learning, 113(5):3073–3110, 2024
work page 2024
-
[16]
Measuring massive multitask language understanding, 2021
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021
work page 2021
-
[17]
Geneturing tests gpt models in genomics
Wenpin Hou and Zhicheng Ji. Geneturing tests gpt models in genomics. BioRxiv, 2023
work page 2023
-
[18]
Leverag- ing large language models for predictive chemistry
Kevin Maik Jablonka, Philippe Schwaller, Andres Ortega-Guerrero, and Berend Smit. Leverag- ing large language models for predictive chemistry. Nat. Mach. Intell., 6(2):161–169, February 2024
work page 2024
-
[19]
Samy Jelassi, David Brandfonbrener, Sham M. Kakade, and Eran Malach. Repeat after me: Transformers are better than state space models at copying, 2024. 13 LAB-Bench: Measuring Capabilities of Large Language Models for Biology Research
work page 2024
-
[20]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024
work page 2024
-
[21]
Qiao Jin, Yifan Yang, Qingyu Chen, and Zhiyong Lu. GeneGPT: augmenting large language models with domain tools for improved access to biomedical information. Bioinformatics, 40 (2):btae075, 02 2024. ISSN 1367-4811. doi: 10.1093/bioinformatics/btae075. URL https: //doi.org/10.1093/bioinformatics/btae075
-
[22]
Highly accurate protein structure predic- tion with AlphaFold
John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ron- neberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A A Kohl, Andrew J Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Trevor Back, Stig Petersen, David Reiman,...
work page 2021
-
[23]
Daniel Martin Katz, Michael James Bommarito, Shang Gao, and Pablo Arredondo. GPT-4 passes the bar exam. SSRN Electron. J., 2023
work page 2023
-
[24]
BioASQ-QA: A manually curated corpus for biomedical question answering
Anastasia Krithara, Anastasios Nentidis, Konstantinos Bougiatiotis, and Georgios Paliouras. BioASQ-QA: A manually curated corpus for biomedical question answering. Sci. Data, 10(1): 170, March 2023
work page 2023
-
[25]
Enrichr: a comprehensive gene set enrichment analysis web server 2016 update
Maxim V Kuleshov, Matthew R Jones, Andrew D Rouillard, Nicolas F Fernandez, Qiaonan Duan, Zichen Wang, Simon Koplev, Sherry L Jenkins, Kathleen M Jagodnik, Alexander Lachmann, et al. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic acids research, 44(W1):W90–W97, 2016
work page 2016
-
[26]
Tiffany H Kung, Morgan Cheatham, Arielle Medenilla, Czarina Sillos, Lorie De Leon, Camille Elepaño, Maria Madriaga, Rimel Aggabao, Giezel Diaz-Candido, James Maningo, and Victor Tseng. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit. Health, 2(2):e0000198, February 2023
work page 2023
-
[27]
Clinvar: public archive of interpretations of clinically relevant variants
Melissa J Landrum, Jennifer M Lee, Mark Benson, Garth Brown, Chen Chao, Shanmuga Chitipiralla, Baoshan Gu, Jennifer Hart, Douglas Hoffman, Jeffrey Hoover, et al. Clinvar: public archive of interpretations of clinically relevant variants. Nucleic acids research, 44(D1): D862–D868, 2016
work page 2016
-
[28]
A structure- informed atlas of human-virus interactions
Gorka Lasso, Sandra V Mayer, Evandro R Winkelmann, Tim Chu, Oliver Elliot, Juan Angel Patino-Galindo, Kernyu Park, Raul Rabadan, Barry Honig, and Sagi D Shapira. A structure- informed atlas of human-virus interactions. Cell, 178(6):1526–1541, 2019
work page 2019
-
[29]
Holistic evaluation of language models
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D Manning, Christopher Ré, Diana Acosta-Navas, Drew A Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu R...
work page 2022
-
[30]
Molecular signatures database (msigdb) 3.0
Arthur Liberzon, Aravind Subramanian, Reid Pinchback, Helga Thorvaldsdóttir, Pablo Tamayo, and Jill P Mesirov. Molecular signatures database (msigdb) 3.0. Bioinformatics, 27(12): 1739–1740, 2011
work page 2011
-
[31]
Jakub Lála, Odhran O’Donoghue, Aleksandar Shtedritski, Sam Cox, Samuel G. Rodriques, and Andrew D. White. Paperqa: Retrieval-augmented generative agent for scientific research, 2023. 14 LAB-Bench: Measuring Capabilities of Large Language Models for Biology Research
work page 2023
-
[32]
Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D
Andres M. Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D. White, and Philippe Schwaller. Augmenting large language models with chemistry tools. Nature Machine Intelli- gence, 6(5):525–535, May 2024. ISSN 2522-5839. doi: 10.1038/s42256-024-00832-8. URL https://doi.org/10.1038/s42256-024-00832-8
-
[33]
Scaling deep learning for materials discovery
Amil Merchant, Simon Batzner, Samuel S Schoenholz, Muratahan Aykol, Gowoon Cheon, and Ekin Dogus Cubuk. Scaling deep learning for materials discovery. Nature, 624(7990):80–85, December 2023
work page 2023
-
[34]
Chat: meta-llama/meta-llama-3-70b-instruct, April 2024
Meta. Chat: meta-llama/meta-llama-3-70b-instruct, April 2024. URL https: //docs.anyscale.com/endpoints/text-generation/supported-models/ meta-llama-Meta-Llama-3-70B-Instruct . Accessed: 2024-06-11
work page 2024
-
[35]
Introducing meta llama 3: The most capable openly available llm to date, April 2024
Meta. Introducing meta llama 3: The most capable openly available llm to date, April 2024. URL https://ai.meta.com/blog/meta-llama-3/. Accessed: 2024-06-11
work page 2024
-
[36]
Holick, Tanya Gupta, Mehrdad Asgari, Christina Glaubitz, Lea C
Adrian Mirza, Nawaf Alampara, Sreekanth Kunchapu, Benedict Emoekabu, Aswanth Krishnan, Mara Wilhelmi, Macjonathan Okereke, Juliane Eberhardt, Amir Mohammad Elahi, Maximilian Greiner, Caroline T. Holick, Tanya Gupta, Mehrdad Asgari, Christina Glaubitz, Lea C. Klepsch, Yannik Köster, Jakob Meyer, Santiago Miret, Tim Hoffmann, Fabian Alexander Kreth, Michael...
work page 2024
-
[37]
Eric Nguyen, Michael Poli, Matthew G. Durrant, Armin W. Thomas, Brian Kang, Jeremy Sullivan, Madelena Y . Ng, Ashley Lewis, Aman Patel, Aaron Lou, Stefano Ermon, Stephen A. Baccus, Tina Hernandez-Boussard, Christopher Ré, Patrick D. Hsu, and Brian L. Hie. Sequence modeling and design from molecular to genome scale with evo. bioRxiv, 2024. doi: 10. 1101/20...
work page 2024
-
[38]
Nicholson, Milo Mordaunt, Patrice Lopez, Ashish Uppala, Domenic Rosati, Neves P
Josh M. Nicholson, Milo Mordaunt, Patrice Lopez, Ashish Uppala, Domenic Rosati, Neves P. Rodrigues, Peter Grabitz, and Sean C. Rife. scite: A smart citation index that displays the context of citations and classifies their intent using deep learning. Quantitative Science Studies, 2(3):882–898, 11 2021. ISSN 2641-3337. doi: 10.1162/qss_a_00146. URL https:/...
-
[39]
Proteingym: Large- scale benchmarks for protein fitness prediction and design
Pascal Notin, Aaron Kollasch, Daniel Ritter, Lood van Niekerk, Steffanie Paul, Han Spinner, Nathan Rollins, Ada Shaw, Rose Orenbuch, Ruben Weitzman, Jonathan Frazer, Mafalda Dias, Dinko Franceschi, Yarin Gal, and Debora Marks. Proteingym: Large- scale benchmarks for protein fitness prediction and design. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. H...
work page 2023
-
[40]
OpenAI. Hello gpt-4o, May 2024. URL https://openai.com/index/hello-gpt-4o/. Accessed: 2024-06-11
work page 2024
-
[41]
OpenAI. Gpt-4 and gpt-4 turbo, 2024. URL https://platform.openai.com/docs/ models/gpt-4-and-gpt-4-turbo . Accessed: 2024-06-11
work page 2024
-
[42]
OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Flo- rencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Bern...
work page 2024
-
[43]
Janet Piñero, Àlex Bravo, Núria Queralt-Rosinach, Alba Gutiérrez-Sacristán, Jordi Deu-Pons, Emilio Centeno, Javier García-García, Ferran Sanz, and Laura I Furlong. Disgenet: a com- prehensive platform integrating information on human disease-associated genes and variants. Nucleic acids research, page gkw943, 2016
work page 2016
-
[44]
Ai-assisted coding: Experiments with gpt-4, 2023
Russell A Poldrack, Thomas Lu, and Gašper Beguš. Ai-assisted coding: Experiments with gpt-4, 2023
work page 2023
-
[45]
The hugo gene nomenclature committee (hgnc)
Sue Povey, Ruth Lovering, Elspeth Bruford, Mathew Wright, Michael Lush, and Hester Wain. The hugo gene nomenclature committee (hgnc). Human genetics, 109:678–680, 2001
work page 2001
-
[46]
David Rein. Can good benchmarks contain mistakes? https://wp.nyu.edu/arg/ can-good-benchmarks-contain-mistakes/ , 2024. Accessed: 2024-05-20
work page 2024
-
[47]
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark, 2023. URL https://arxiv.org/abs/2311.12022
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
Biollmbench: A comprehensive benchmarking of large language models in bioinformatics
Varuni Sarwal, Viorel Munteanu, Timur Suhodolschi, Dumitru Ciorba, Eleazar Eskin, Wei Wang, and Serghei Mangul. Biollmbench: A comprehensive benchmarking of large language models in bioinformatics. bioRxiv, 2023. doi: 10.1101/2023.12.19.572483. URL https: //www.biorxiv.org/content/early/2023/12/20/2023.12.19.572483. 16 LAB-Bench: Measuring Capabilities of...
-
[49]
Large language models encode clinical knowledge
Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Senevi- ratne, Paul Gamble, Chris Kelly, Abubakr Babiker, Nathanael Schärli, Aakanksha Chowdhery, Philip Mansfield, Dina Demner-Fushman, Blaise Agüera Y Arcas, Dale Webster, Greg S Corrado,...
work page 2023
-
[50]
The mammalian phenotype ontology: enabling robust annotation and comparative analysis
Cynthia L Smith and Janan T Eppig. The mammalian phenotype ontology: enabling robust annotation and comparative analysis. Wiley Interdisciplinary Reviews: Systems Biology and Medicine, 1(3):390–399, 2009
work page 2009
-
[51]
Beyond the imitation game: Quantifying and extrapolating the capabilities of language models
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Ask...
work page 2022
-
[52]
Delocalized, asynchronous, closed-loop discovery of organic laser emitters
Felix Strieth-Kalthoff, Han Hao, Vandana Rathore, Joshua Derasp, Théophile Gaudin, Nicholas H Angello, Martin Seifrid, Ekaterina Trushina, Mason Guy, Junliang Liu, Xun Tang, Masashi Mamada, Wesley Wang, Tuul Tsagaantsooj, Cyrille Lavigne, Robert Pollice, Tony C Wu, Kazuhiro Hotta, Leticia Bodo, Shangyu Li, Mohammad Haddadnia, Agnieszka Wołos, Rafał Roszak...
work page 2024
-
[53]
Galactica: A large language model for science
Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. Galactica: A large language model for science. arXiv, 2022
work page 2022
-
[54]
De novo design of protein structure and function with RFdiffusion
Joseph L Watson, David Juergens, Nathaniel R Bennett, Brian L Trippe, Jason Yim, Helen E Eisenach, Woody Ahern, Andrew J Borst, Robert J Ragotte, Lukas F Milles, Basile I M Wicky, Nikita Hanikel, Samuel J Pellock, Alexis Courbet, William Sheffler, Jue Wang, Preetham Venkatesh, Isaac Sappington, Susana Vázquez Torres, Anna Lauko, Valentin De Bortoli, Emile...
work page 2023
-
[55]
Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs
Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. arXiv preprint arXiv:2306.13063, 2023. 18 LAB-Bench: Measuring Capabilities of Large Language Models for Biology Research
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[56]
Gtrd: a database of transcription factor binding sites identified by chip-seq experiments
Ivan Yevshin, Ruslan Sharipov, Tagir Valeev, Alexander Kel, and Fedor Kolpakov. Gtrd: a database of transcription factor binding sites identified by chip-seq experiments. Nucleic acids research, page gkw951, 2016
work page 2016
-
[57]
Large language models are not robust multiple choice selectors
Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. Large language models are not robust multiple choice selectors. In The Twelfth International Conference on Learning Representations, 2023
work page 2023
-
[58]
Benchmarking large language models for molecule prediction tasks, 2024
Zhiqiang Zhong, Kuangyu Zhou, and Davide Mottin. Benchmarking large language models for molecule prediction tasks, 2024. A Appendix A.1 Example Questions A.1.1 DbQA DGA Input: Which of the following genes is associated with Distal renal tubular acidosis according to DisGeNet but not according to OMIM? Ideal: WDR72, Distractors: SLC4A1, LSM14B, WDR72, ATP6...
-
[59]
Distractors: Add 20 µL of Lipofectamine in step 6., Increase cell count to 3.0 x 10e7 in step 2., Aspirate the media after 2 hr in step 9. 21 LAB-Bench: Measuring Capabilities of Large Language Models for Biology Research A.1.7 FigQA Input: Where does the half waveplate occur relative to the SLM? Ideal: Before, Distractors: After, Different light path A.1...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.