Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.
Title resolution pending
4 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
SNLP enables layer-parallel Transformer inference by replacing sequential layer execution with structured Newton corrections and SNLP-aware training regularization, yielding up to 2.3x wall-clock speedup on 0.5B models while improving perplexity.
A two-stage framework learns a world graph of pivotal states task-agnostically via joint training of a latent model and curiosity-driven policy, then uses the graph to accelerate hierarchical RL on maze tasks.
Proposes low-rank orthogonalization and derives low-rank Muon and MSGD variants that outperform standard Muon on GPT-2 and LLaMA pretraining while providing iteration complexity bounds.
citing papers explorer
-
Measuring Massive Multitask Language Understanding
Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.
-
SNLP: Layer-Parallel Inference via Structured Newton Corrections
SNLP enables layer-parallel Transformer inference by replacing sequential layer execution with structured Newton corrections and SNLP-aware training regularization, yielding up to 2.3x wall-clock speedup on 0.5B models while improving perplexity.
-
Learning World Graphs to Accelerate Hierarchical Reinforcement Learning
A two-stage framework learns a world graph of pivotal states task-agnostically via joint training of a latent model and curiosity-driven policy, then uses the graph to accelerate hierarchical RL on maze tasks.
-
Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training
Proposes low-rank orthogonalization and derives low-rank Muon and MSGD variants that outperform standard Muon on GPT-2 and LLaMA pretraining while providing iteration complexity bounds.