hub Mixed citations

LAB-Bench: Measuring Capabilities of Language Models for Biology Research

Jon M. Laurent, Joseph D. Janizek, Michael Ruzo, Michaela M. Hinks, Michael J. Hammerling, Siddharth Narayanan · 2024 · cs.AI · arXiv 2407.10362

Mixed citation behavior. Most common role is background (67%).

25 Pith papers citing it

Background 67% of classified citations

open full Pith review browse 25 citing papers arXiv PDF

abstract

There is widespread optimism that frontier Large Language Models (LLMs) and LLM-augmented systems have the potential to rapidly accelerate scientific discovery across disciplines. Today, many benchmarks exist to measure LLM knowledge and reasoning on textbook-style science questions, but few if any benchmarks are designed to evaluate language model performance on practical tasks required for scientific research, such as literature search, protocol planning, and data analysis. As a step toward building such benchmarks, we introduce the Language Agent Biology Benchmark (LAB-Bench), a broad dataset of over 2,400 multiple choice questions for evaluating AI systems on a range of practical biology research capabilities, including recall and reasoning over literature, interpretation of figures, access and navigation of databases, and comprehension and manipulation of DNA and protein sequences. Importantly, in contrast to previous scientific benchmarks, we expect that an AI system that can achieve consistently high scores on the more difficult LAB-Bench tasks would serve as a useful assistant for researchers in areas such as literature search and molecular cloning. As an initial assessment of the emergent scientific task capabilities of frontier language models, we measure performance of several against our benchmark and report results compared to human expert biology researchers. We will continue to update and expand LAB-Bench over time, and expect it to serve as a useful tool in the development of automated research systems going forward. A public subset of LAB-Bench is available for use at the following URL: https://huggingface.co/datasets/futurehouse/lab-bench

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6 dataset 3

citation-polarity summary

background 6 use dataset 3

representative citing papers

BioXArena: Benchmarking LLM Agents on Multi-Modal Biomedical Machine Learning Tasks

cs.CE · 2026-05-15 · unverdicted · novelty 7.0

BioXArena benchmarks LLM agents on generating end-to-end ML pipelines for 76 multi-modal biomedical tasks, with MLEvolve plus Gemini-3.1-Pro scoring highest at 0.666.

Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

Collider-Bench is a new benchmark showing that current LLM agents cannot reliably reproduce LHC analyses at the level of a physicist-in-the-loop.

Jailbroken Frontier Models Retain Their Capabilities

cs.LG · 2026-04-30 · unverdicted · novelty 7.0

Jailbreak-induced performance loss shrinks as model capability grows, with the strongest models showing almost no degradation on benchmarks.

AI scientists produce results without reasoning scientifically

cs.AI · 2026-04-20 · conditional · novelty 7.0

LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.

SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?

cs.AI · 2026-04-12 · unverdicted · novelty 7.0

LLMs predict outcomes of real scientific experiments at 14-26% accuracy, comparable to human experts, but lack calibration on prediction reliability while humans demonstrate strong calibration.

The limits of bio-molecular modeling with large language models : a cross-scale evaluation

cs.LG · 2026-04-03 · unverdicted · novelty 7.0

LLMs perform adequately on bio-molecular classification tasks but remain weak on regression, with hybrid architectures outperforming others on long sequences and fine-tuning hurting generalization.

PolyReal: A Benchmark for Real-World Polymer Science Workflows

cs.CV · 2026-04-03 · unverdicted · novelty 7.0

PolyReal benchmark shows leading MLLMs perform well on polymer knowledge reasoning but drop sharply on practical tasks like lab safety analysis and raw data extraction.

BioAgent Bench: An AI Agent Evaluation Suite for Bioinformatics

cs.AI · 2026-01-29 · accept · novelty 7.0

BioAgent Bench is a new evaluation suite that tests AI agents on end-to-end bioinformatics pipelines and finds that frontier models often complete tasks reliably but fail under controlled perturbations like corrupted inputs or prompt bloat.

FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics

cs.LG · 2026-05-17 · accept · novelty 6.0

FML-Bench shows that a simple greedy hill-climber performs nearly as well as complex tree-search agents on ML research tasks, with an adaptive strategy that switches exploration modes outperforming all tested agents.

LEAP: Trajectory-Level Evaluation of LLMs in Iterative Scientific Design

cs.LG · 2026-05-14 · unverdicted · novelty 6.0

LEAPBench shows trajectory scoring changes best-model rankings on 53% of tasks, LLMs do not beat Bayesian optimization, and domain-aware prompting underperforms domain-agnostic on biology tasks aligned with published literature.

BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents

cs.AI · 2026-05-07 · conditional · novelty 6.0

BioMedArena releases a standardized toolkit with 147 biomedical benchmarks, 75 tools, and six harnesses that achieve SOTA results on eight tasks with a +15.03 percentage point average lift.

Prescriptive Scaling Laws for Data Constrained Training

cs.LG · 2026-05-02 · unverdicted · novelty 6.0

A one-parameter scaling law models excess loss from data repetition as an additive overfitting penalty, recommending model capacity increases over excessive repetition and showing that strong weight decay reduces the penalty coefficient by ~70%.

An Independent Safety Evaluation of Kimi K2.5

cs.CR · 2026-04-03 · conditional · novelty 6.0

Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.

LABBench2: An Improved Benchmark for AI Systems Performing Biology Research

cs.AI · 2026-02-04 · unverdicted · novelty 6.0

LABBench2 is a more challenging benchmark than LAB-Bench for assessing AI performance on biology research tasks, with frontier models showing accuracy drops of 26-46% across subtasks.

RExBench: Can coding agents autonomously implement AI research extensions?

cs.CL · 2025-06-27 · unverdicted · novelty 6.0

RExBench is a new benchmark showing that LLM coding agents fail to autonomously implement most realistic research extensions to prior AI papers.

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

cs.CL · 2025-06-13 · conditional · novelty 6.0

DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.

Benchmarking Misuse Mitigation Against Covert Adversaries

cs.CR · 2025-06-06 · unverdicted · novelty 6.0

Develops the BSD data generation pipeline and two new datasets to evaluate decomposition attacks as effective misuse enablers and stateful defenses as a countermeasure in language model safety.

DeepER-Med: Advancing Deep Evidence-Based Research in Medicine Through Agentic AI

cs.AI · 2026-04-16 · unverdicted · novelty 5.0

DeepER-Med introduces a three-module agentic AI workflow for evidence-based medical research that outperforms production platforms on a new expert-curated dataset of 100 questions and matches clinical recommendations in seven of eight real-world cases.

AgentCE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments

cs.AI · 2026-04-07 · unverdicted · novelty 5.0

AgentCE-Bench is a lightweight grid-planning benchmark that controls task horizon via hidden slots H and difficulty via decoy budget B, validated across 13 models for consistent and discriminative evaluation.

Humanity's Last Exam

cs.LG · 2025-01-24 · unverdicted · novelty 5.0

Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.

AI for Auto-Research: Roadmap & User Guide

cs.AI · 2026-05-18 · unverdicted · novelty 4.0

The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.

Useful for Exploration, Risky for Precision: Evaluating AI Tools in Academic Research

cs.AI · 2026-05-11 · unverdicted · novelty 4.0 · 2 refs

AI tools deliver useful overviews for research exploration but prove unreliable for precise information extraction and systematic reviews due to low explainability, reproducibility, and transparency.

Risk Reporting for Developers' Internal AI Model Use

cs.CY · 2026-04-27 · unverdicted · novelty 4.0

A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.

Evolving Roles of LLMs in Scientific Innovation: Assistant, Collaborator, Scientist, and Evaluator

cs.DL · 2025-07-16 · unverdicted · novelty 4.0

The paper proposes a four-role framework for LLMs in scientific innovation and reviews methods, benchmarks, and limitations across Assistant, Collaborator, Scientist, and Evaluator roles.

citing papers explorer

Showing 25 of 25 citing papers.

BioXArena: Benchmarking LLM Agents on Multi-Modal Biomedical Machine Learning Tasks cs.CE · 2026-05-15 · unverdicted · none · ref 7 · internal anchor
BioXArena benchmarks LLM agents on generating end-to-end ML pipelines for 76 multi-modal biomedical tasks, with MLEvolve plus Gemini-3.1-Pro scoring highest at 0.666.
Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction cs.LG · 2026-05-13 · unverdicted · none · ref 14 · internal anchor
Collider-Bench is a new benchmark showing that current LLM agents cannot reliably reproduce LHC analyses at the level of a physicist-in-the-loop.
Jailbroken Frontier Models Retain Their Capabilities cs.LG · 2026-04-30 · unverdicted · none · ref 15 · internal anchor
Jailbreak-induced performance loss shrinks as model capability grows, with the strongest models showing almost no degradation on benchmarks.
AI scientists produce results without reasoning scientifically cs.AI · 2026-04-20 · conditional · none · ref 3 · internal anchor
LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.
SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences? cs.AI · 2026-04-12 · unverdicted · none · ref 26 · internal anchor
LLMs predict outcomes of real scientific experiments at 14-26% accuracy, comparable to human experts, but lack calibration on prediction reliability while humans demonstrate strong calibration.
The limits of bio-molecular modeling with large language models : a cross-scale evaluation cs.LG · 2026-04-03 · unverdicted · none · ref 27 · internal anchor
LLMs perform adequately on bio-molecular classification tasks but remain weak on regression, with hybrid architectures outperforming others on long sequences and fine-tuning hurting generalization.
PolyReal: A Benchmark for Real-World Polymer Science Workflows cs.CV · 2026-04-03 · unverdicted · none · ref 26 · internal anchor
PolyReal benchmark shows leading MLLMs perform well on polymer knowledge reasoning but drop sharply on practical tasks like lab safety analysis and raw data extraction.
BioAgent Bench: An AI Agent Evaluation Suite for Bioinformatics cs.AI · 2026-01-29 · accept · none · ref 1 · internal anchor
BioAgent Bench is a new evaluation suite that tests AI agents on end-to-end bioinformatics pipelines and finds that frontier models often complete tasks reliably but fail under controlled perturbations like corrupted inputs or prompt bloat.
FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics cs.LG · 2026-05-17 · accept · none · ref 26 · internal anchor
FML-Bench shows that a simple greedy hill-climber performs nearly as well as complex tree-search agents on ML research tasks, with an adaptive strategy that switches exploration modes outperforming all tested agents.
LEAP: Trajectory-Level Evaluation of LLMs in Iterative Scientific Design cs.LG · 2026-05-14 · unverdicted · none · ref 7 · internal anchor
LEAPBench shows trajectory scoring changes best-model rankings on 53% of tasks, LLMs do not beat Bayesian optimization, and domain-aware prompting underperforms domain-agnostic on biology tasks aligned with published literature.
BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents cs.AI · 2026-05-07 · conditional · none · ref 21 · internal anchor
BioMedArena releases a standardized toolkit with 147 biomedical benchmarks, 75 tools, and six harnesses that achieve SOTA results on eight tasks with a +15.03 percentage point average lift.
Prescriptive Scaling Laws for Data Constrained Training cs.LG · 2026-05-02 · unverdicted · none · ref 27 · internal anchor
A one-parameter scaling law models excess loss from data repetition as an additive overfitting penalty, recommending model capacity increases over excessive repetition and showing that strong weight decay reduces the penalty coefficient by ~70%.
An Independent Safety Evaluation of Kimi K2.5 cs.CR · 2026-04-03 · conditional · none · ref 22 · internal anchor
Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.
LABBench2: An Improved Benchmark for AI Systems Performing Biology Research cs.AI · 2026-02-04 · unverdicted · none · ref 13 · internal anchor
LABBench2 is a more challenging benchmark than LAB-Bench for assessing AI performance on biology research tasks, with frontier models showing accuracy drops of 26-46% across subtasks.
RExBench: Can coding agents autonomously implement AI research extensions? cs.CL · 2025-06-27 · unverdicted · none · ref 27 · internal anchor
RExBench is a new benchmark showing that LLM coding agents fail to autonomously implement most realistic research extensions to prior AI papers.
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents cs.CL · 2025-06-13 · conditional · none · ref 12 · internal anchor
DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.
Benchmarking Misuse Mitigation Against Covert Adversaries cs.CR · 2025-06-06 · unverdicted · none · ref 45 · internal anchor
Develops the BSD data generation pipeline and two new datasets to evaluate decomposition attacks as effective misuse enablers and stateful defenses as a countermeasure in language model safety.
DeepER-Med: Advancing Deep Evidence-Based Research in Medicine Through Agentic AI cs.AI · 2026-04-16 · unverdicted · none · ref 14 · internal anchor
DeepER-Med introduces a three-module agentic AI workflow for evidence-based medical research that outperforms production platforms on a new expert-curated dataset of 100 questions and matches clinical recommendations in seven of eight real-world cases.
AgentCE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments cs.AI · 2026-04-07 · unverdicted · none · ref 9 · internal anchor
AgentCE-Bench is a lightweight grid-planning benchmark that controls task horizon via hidden slots H and difficulty via decoy budget B, validated across 13 models for consistent and discriminative evaluation.
Humanity's Last Exam cs.LG · 2025-01-24 · unverdicted · none · ref 32 · internal anchor
Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.
AI for Auto-Research: Roadmap & User Guide cs.AI · 2026-05-18 · unverdicted · none · ref 96 · internal anchor
The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.
Useful for Exploration, Risky for Precision: Evaluating AI Tools in Academic Research cs.AI · 2026-05-11 · unverdicted · none · ref 19 · 2 links · internal anchor
AI tools deliver useful overviews for research exploration but prove unreliable for precise information extraction and systematic reviews due to low explainability, reproducibility, and transparency.
Risk Reporting for Developers' Internal AI Model Use cs.CY · 2026-04-27 · unverdicted · none · ref 24 · internal anchor
A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.
Evolving Roles of LLMs in Scientific Innovation: Assistant, Collaborator, Scientist, and Evaluator cs.DL · 2025-07-16 · unverdicted · none · ref 82 · internal anchor
The paper proposes a four-role framework for LLMs in scientific innovation and reviews methods, benchmarks, and limitations across Assistant, Collaborator, Scientist, and Evaluator roles.
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities cs.CL · 2025-07-07 · unverdicted · none · ref 45 · internal anchor
Gemini 2.5 Pro and Flash models are presented as achieving frontier performance in reasoning, coding, and long-context multimodal tasks while spanning a cost-capability Pareto curve.

LAB-Bench: Measuring Capabilities of Language Models for Biology Research

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer