Large language models are not robust multiple choice selectors

Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, Minlie Huang · 2023

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

browse 2 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

LAB-Bench: Measuring Capabilities of Language Models for Biology Research

cs.AI · 2024-07-14 · accept · novelty 8.0

LAB-Bench provides over 2,400 multiple-choice questions to measure LLM performance on real biology research tasks like literature recall, figure reading, database access, and sequence manipulation, with initial results compared against human expert biologists.

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

cs.CL · 2024-06-03 · conditional · novelty 6.0

MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.

citing papers explorer

Showing 2 of 2 citing papers.

LAB-Bench: Measuring Capabilities of Language Models for Biology Research cs.AI · 2024-07-14 · accept · none · ref 57
LAB-Bench provides over 2,400 multiple-choice questions to measure LLM performance on real biology research tasks like literature recall, figure reading, database access, and sequence manipulation, with initial results compared against human expert biologists.
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark cs.CL · 2024-06-03 · conditional · none · ref 47
MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.

Large language models are not robust multiple choice selectors

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer