BioXArena benchmarks LLM agents on generating end-to-end ML pipelines for 76 multi-modal biomedical tasks, with MLEvolve plus Gemini-3.1-Pro scoring highest at 0.666.
hub Mixed citations
LAB-Bench: Measuring Capabilities of Language Models for Biology Research
Mixed citation behavior. Most common role is background (67%).
abstract
There is widespread optimism that frontier Large Language Models (LLMs) and LLM-augmented systems have the potential to rapidly accelerate scientific discovery across disciplines. Today, many benchmarks exist to measure LLM knowledge and reasoning on textbook-style science questions, but few if any benchmarks are designed to evaluate language model performance on practical tasks required for scientific research, such as literature search, protocol planning, and data analysis. As a step toward building such benchmarks, we introduce the Language Agent Biology Benchmark (LAB-Bench), a broad dataset of over 2,400 multiple choice questions for evaluating AI systems on a range of practical biology research capabilities, including recall and reasoning over literature, interpretation of figures, access and navigation of databases, and comprehension and manipulation of DNA and protein sequences. Importantly, in contrast to previous scientific benchmarks, we expect that an AI system that can achieve consistently high scores on the more difficult LAB-Bench tasks would serve as a useful assistant for researchers in areas such as literature search and molecular cloning. As an initial assessment of the emergent scientific task capabilities of frontier language models, we measure performance of several against our benchmark and report results compared to human expert biology researchers. We will continue to update and expand LAB-Bench over time, and expect it to serve as a useful tool in the development of automated research systems going forward. A public subset of LAB-Bench is available for use at the following URL: https://huggingface.co/datasets/futurehouse/lab-bench
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Collider-Bench is a new benchmark showing that current LLM agents cannot reliably reproduce LHC analyses at the level of a physicist-in-the-loop.
Jailbreak-induced performance loss shrinks as model capability grows, with the strongest models showing almost no degradation on benchmarks.
LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.
LLMs predict outcomes of real scientific experiments at 14-26% accuracy, comparable to human experts, but lack calibration on prediction reliability while humans demonstrate strong calibration.
LLMs perform adequately on bio-molecular classification tasks but remain weak on regression, with hybrid architectures outperforming others on long sequences and fine-tuning hurting generalization.
PolyReal benchmark shows leading MLLMs perform well on polymer knowledge reasoning but drop sharply on practical tasks like lab safety analysis and raw data extraction.
BioAgent Bench is a new evaluation suite that tests AI agents on end-to-end bioinformatics pipelines and finds that frontier models often complete tasks reliably but fail under controlled perturbations like corrupted inputs or prompt bloat.
FML-Bench shows that a simple greedy hill-climber performs nearly as well as complex tree-search agents on ML research tasks, with an adaptive strategy that switches exploration modes outperforming all tested agents.
LEAPBench shows trajectory scoring changes best-model rankings on 53% of tasks, LLMs do not beat Bayesian optimization, and domain-aware prompting underperforms domain-agnostic on biology tasks aligned with published literature.
BioMedArena releases a standardized toolkit with 147 biomedical benchmarks, 75 tools, and six harnesses that achieve SOTA results on eight tasks with a +15.03 percentage point average lift.
A one-parameter scaling law models excess loss from data repetition as an additive overfitting penalty, recommending model capacity increases over excessive repetition and showing that strong weight decay reduces the penalty coefficient by ~70%.
Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.
LABBench2 is a more challenging benchmark than LAB-Bench for assessing AI performance on biology research tasks, with frontier models showing accuracy drops of 26-46% across subtasks.
RExBench is a new benchmark showing that LLM coding agents fail to autonomously implement most realistic research extensions to prior AI papers.
DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.
Develops the BSD data generation pipeline and two new datasets to evaluate decomposition attacks as effective misuse enablers and stateful defenses as a countermeasure in language model safety.
DeepER-Med introduces a three-module agentic AI workflow for evidence-based medical research that outperforms production platforms on a new expert-curated dataset of 100 questions and matches clinical recommendations in seven of eight real-world cases.
AgentCE-Bench is a lightweight grid-planning benchmark that controls task horizon via hidden slots H and difficulty via decoy budget B, validated across 13 models for consistent and discriminative evaluation.
Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.
The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.
AI tools deliver useful overviews for research exploration but prove unreliable for precise information extraction and systematic reviews due to low explainability, reproducibility, and transparency.
A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.
The paper proposes a four-role framework for LLMs in scientific innovation and reviews methods, benchmarks, and limitations across Assistant, Collaborator, Scientist, and Evaluator roles.
citing papers explorer
-
BioXArena: Benchmarking LLM Agents on Multi-Modal Biomedical Machine Learning Tasks
BioXArena benchmarks LLM agents on generating end-to-end ML pipelines for 76 multi-modal biomedical tasks, with MLEvolve plus Gemini-3.1-Pro scoring highest at 0.666.
-
Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction
Collider-Bench is a new benchmark showing that current LLM agents cannot reliably reproduce LHC analyses at the level of a physicist-in-the-loop.
-
Jailbroken Frontier Models Retain Their Capabilities
Jailbreak-induced performance loss shrinks as model capability grows, with the strongest models showing almost no degradation on benchmarks.
-
AI scientists produce results without reasoning scientifically
LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.
-
SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?
LLMs predict outcomes of real scientific experiments at 14-26% accuracy, comparable to human experts, but lack calibration on prediction reliability while humans demonstrate strong calibration.
-
The limits of bio-molecular modeling with large language models : a cross-scale evaluation
LLMs perform adequately on bio-molecular classification tasks but remain weak on regression, with hybrid architectures outperforming others on long sequences and fine-tuning hurting generalization.
-
PolyReal: A Benchmark for Real-World Polymer Science Workflows
PolyReal benchmark shows leading MLLMs perform well on polymer knowledge reasoning but drop sharply on practical tasks like lab safety analysis and raw data extraction.
-
BioAgent Bench: An AI Agent Evaluation Suite for Bioinformatics
BioAgent Bench is a new evaluation suite that tests AI agents on end-to-end bioinformatics pipelines and finds that frontier models often complete tasks reliably but fail under controlled perturbations like corrupted inputs or prompt bloat.
-
FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics
FML-Bench shows that a simple greedy hill-climber performs nearly as well as complex tree-search agents on ML research tasks, with an adaptive strategy that switches exploration modes outperforming all tested agents.
-
LEAP: Trajectory-Level Evaluation of LLMs in Iterative Scientific Design
LEAPBench shows trajectory scoring changes best-model rankings on 53% of tasks, LLMs do not beat Bayesian optimization, and domain-aware prompting underperforms domain-agnostic on biology tasks aligned with published literature.
-
BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents
BioMedArena releases a standardized toolkit with 147 biomedical benchmarks, 75 tools, and six harnesses that achieve SOTA results on eight tasks with a +15.03 percentage point average lift.
-
Prescriptive Scaling Laws for Data Constrained Training
A one-parameter scaling law models excess loss from data repetition as an additive overfitting penalty, recommending model capacity increases over excessive repetition and showing that strong weight decay reduces the penalty coefficient by ~70%.
-
An Independent Safety Evaluation of Kimi K2.5
Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.
-
LABBench2: An Improved Benchmark for AI Systems Performing Biology Research
LABBench2 is a more challenging benchmark than LAB-Bench for assessing AI performance on biology research tasks, with frontier models showing accuracy drops of 26-46% across subtasks.
-
RExBench: Can coding agents autonomously implement AI research extensions?
RExBench is a new benchmark showing that LLM coding agents fail to autonomously implement most realistic research extensions to prior AI papers.
-
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.
-
Benchmarking Misuse Mitigation Against Covert Adversaries
Develops the BSD data generation pipeline and two new datasets to evaluate decomposition attacks as effective misuse enablers and stateful defenses as a countermeasure in language model safety.
-
DeepER-Med: Advancing Deep Evidence-Based Research in Medicine Through Agentic AI
DeepER-Med introduces a three-module agentic AI workflow for evidence-based medical research that outperforms production platforms on a new expert-curated dataset of 100 questions and matches clinical recommendations in seven of eight real-world cases.
-
AgentCE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments
AgentCE-Bench is a lightweight grid-planning benchmark that controls task horizon via hidden slots H and difficulty via decoy budget B, validated across 13 models for consistent and discriminative evaluation.
-
Humanity's Last Exam
Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.
-
AI for Auto-Research: Roadmap & User Guide
The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.
-
Useful for Exploration, Risky for Precision: Evaluating AI Tools in Academic Research
AI tools deliver useful overviews for research exploration but prove unreliable for precise information extraction and systematic reviews due to low explainability, reproducibility, and transparency.
-
Risk Reporting for Developers' Internal AI Model Use
A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.
-
Evolving Roles of LLMs in Scientific Innovation: Assistant, Collaborator, Scientist, and Evaluator
The paper proposes a four-role framework for LLMs in scientific innovation and reviews methods, benchmarks, and limitations across Assistant, Collaborator, Scientist, and Evaluator roles.
-
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
Gemini 2.5 Pro and Flash models are presented as achieving frontier performance in reasoning, coding, and long-context multimodal tasks while spanning a cost-capability Pareto curve.