Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
LT2 introduces looped transformers with linear-time attention (linear, sparse, and hybrid variants) that match or exceed standard looped transformer quality at linear complexity, including a converted 1.4B model competitive with larger industry models.
DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
citing papers explorer
-
Measuring Massive Multitask Language Understanding
Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.