LAB-Bench provides over 2,400 multiple-choice questions to measure LLM performance on real biology research tasks like literature recall, figure reading, database access, and sequence manipulation, with initial results compared against human expert biologists.
Large language models are not robust multiple choice selectors
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
years
2024 2roles
background 1polarities
background 1representative citing papers
MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.
citing papers explorer
-
LAB-Bench: Measuring Capabilities of Language Models for Biology Research
LAB-Bench provides over 2,400 multiple-choice questions to measure LLM performance on real biology research tasks like literature recall, figure reading, database access, and sequence manipulation, with initial results compared against human expert biologists.
-
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.