super hub Mixed citations

Measuring Massive Multitask Language Understanding

Andy Zou, Collin Burns, Dan Hendrycks, Dawn Song, Mantas Mazeika, Steven Basart · 2020 · cs.CY · arXiv 2009.03300

Mixed citation behavior. Most common role is background (45%).

392 Pith papers citing it

Background 45% of classified citations

open full Pith review browse 392 citing papers more from Andy Zou arXiv PDF

abstract

We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model's academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 30 dataset 29 method 5 baseline 3

citation-polarity summary

background 30 use dataset 27 use method 5 baseline 3 unclear 2

claims ledger

abstract We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models

authors

Andy Zou Collin Burns Dan Hendrycks Dawn Song Mantas Mazeika Steven Basart

co-cited works

representative citing papers

Unsteady Metrics and Benchmarking Cultures of AI Model Builders

cs.AI · 2026-05-13 · accept · novelty 8.0

AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.

HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.

EnergyAgentBench: Benchmarking LLM Agents on Live Energy Infrastructure Data

econ.EM · 2026-05-13 · accept · novelty 8.0

EnergyAgentBench is a new benchmark with 70 task variants that evaluates LLM agents on live energy data for datacenter siting, long-horizon optimization, and causal grid diagnosis.

Crafting Reversible SFT Behaviors in Large Language Models

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.

ArgBench: Benchmarking LLMs on Computational Argumentation Tasks

cs.CL · 2026-04-19 · unverdicted · novelty 8.0

ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.

MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers

cs.SE · 2026-01-31 · accept · novelty 8.0 · 2 refs

MCP-Atlas is a new benchmark with 1000 tasks on production MCP servers that uses claim-level scoring to evaluate LLM agents on realistic multi-step tool-use competency.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

cs.HC · 2024-05-13 · conditional · novelty 8.0

AgentClinic is a multimodal agent benchmark demonstrating that LLM diagnostic accuracy on MedQA drops to below one-tenth in sequential clinical simulations, with Claude-3.5 leading and large tool-use differences across models.

Toward Calibrated, Fair, and accurate Deepfake Detection

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

Face-Feature Tuning is a label-free logit remapping method that reduces FPR/TPR gaps across groups in deepfake detection while preserving overall accuracy.

From Talking Words to Sharing Thoughts: Scalable Multi-LLM Aggregation via Structured Message Passing

cs.GT · 2026-05-29 · unverdicted · novelty 7.0

A bipartite factor graph with message-passing protocol and asymmetric damping aggregates multi-LLM predictions, cutting token use by 97% and API calls by 6X while outperforming baselines on MMLU, MMLU-Pro, GPQA, and MedMCQA.

Beyond Static Dialogues: Benchmarking Realistic, Heterogeneous, and Evolving Long-Term Memory

cs.CL · 2026-05-29 · unverdicted · novelty 7.0

RHELM is a benchmark for LLM long-term memory with dynamic profiles, heterogeneous sources, and 27 memory characteristics that reveals weaknesses in existing models for multi-source aggregation and contextual reasoning.

ReactBench: A Cause-Driven Benchmark for Multimodal Hallucination via Systematic Evaluation

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

ReactBench is a new benchmark with four cause-targeted tasks that uses adversarial images, hallucination-inducing queries, and Chain-of-Thought analysis to expose specific failure modes in current multimodal large language models.

K-FinHallu: A Hallucination Detection Benchmark for Multi-Turn RAG in Korean Finance

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

K-FinHallu is the first multi-turn Korean financial RAG hallucination benchmark; frontier LLMs struggle especially on justified abstention while an 8B fine-tuned model reaches competitive performance.

ConMoE: Expert-Pool Consolidation via Prototype Reassignment for MoE Compression

cs.AI · 2026-05-28 · unverdicted · novelty 7.0

ConMoE consolidates MoE experts into a smaller prototype pool via deterministic remapping based on contribution and replaceability, matching or beating pruning/merging baselines at 25-50% reduction on three models.

Can LLMs Use Linguistic Uncertainty Markers to Reliably Reflect Intrinsic Confidence?

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

LLMs struggle to associate epistemic markers with stable internal confidence levels across distributions, even under model-centric interpretations, while maintaining somewhat consistent marker rankings.

SiDP: Memory-Efficient Data Parallelism for Offline LLM Inference

cs.DC · 2026-05-27 · unverdicted · novelty 7.0

SiDP distributes model weights across a DP group with WaS and CaS modes to increase KV cache capacity by up to 1.8x and end-to-end throughput by up to 1.5x over vLLM on H20/H200/B200 GPUs for offline LLM inference.

Self-Policy Distillation via Capability-Selective Subspace Projection

cs.CL · 2026-05-21 · unverdicted · novelty 7.0

Self-Policy Distillation extracts a capability subspace from model gradients on correctness tokens, projects KV activations into it for self-generation, and fine-tunes LLMs to achieve up to 13-16% gains over baselines without external signals.

X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation

cs.LG · 2026-05-20 · conditional · novelty 7.0

X-Token proposes projection-guided P-KL and H-KL losses to fix uncommon-token suppression and over-conservative matching in logit-based cross-tokenizer distillation, yielding gains over GOLD on Llama-3.2-1B.

RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

RxEval benchmark shows frontier LLMs reach at most 46.10% exact match on prescription-level medication, dose, and route selection from real patient trajectories.

Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation

cs.CL · 2026-05-14 · unverdicted · novelty 7.0

New metrics KSS and KPS are introduced to evaluate multilingual machine unlearning quality and cross-language consistency in LLMs, addressing limitations of single-language evaluation protocols.

Good Agentic Friends Do Not Just Give Verbal Advice: They Can Update Your Weights

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

TFlow enables multi-agent LLMs to collaborate via transient low-rank LoRA perturbations derived from sender activations, yielding up to 8.5 accuracy gains and 83% token reduction versus text-based baselines on Qwen3-4B models.

Inducing Artificial Uncertainty in Language Models

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.

Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

PyRAG turns multi-hop reasoning into executable Python code over retrieval tools for explicit, verifiable step-by-step RAG.

Beyond Parameter Aggregation: Semantic Consensus for Federated Fine-Tuning of LLMs

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Semantic consensus on model outputs for public prompts enables federated LLM fine-tuning that matches parameter-aggregation baselines with orders-of-magnitude lower communication.

citing papers explorer

Showing 5 of 5 citing papers after filters.

Task-Aware Calibration: Provably Optimal Decoding in LLMs cs.LG · 2026-05-11 · unverdicted · none · ref 19 · internal anchor
Task calibration aligns LLM distributions in latent task spaces to make MBR decoding provably optimal and improve generation quality.
Temporally Extended Mixture-of-Experts Models cs.LG · 2026-04-22 · unverdicted · none · ref 18 · internal anchor
Temporally extended MoE layers using the option-critic framework with deliberation costs cut switching rates below 5% while retaining most capability on MATH, MMLU, and MMMLU.
Kimi Linear: An Expressive, Efficient Attention Architecture cs.CL · 2025-10-30 · unverdicted · none · ref 37 · internal anchor
Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
Efficient Pre-Training with Token Superposition cs.CL · 2026-05-07 · unverdicted · none · ref 24 · 2 links · internal anchor
Token-Superposition Training combines multiple tokens into bags for multi-hot cross-entropy pre-training followed by a recovery phase, yielding up to 2.5x reduction in training time at 10B scale under equal-loss conditions.
Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models cs.CL · 2026-05-12 · unverdicted · none · ref 18 · internal anchor
Qwen-Scope provides open-source sparse autoencoders for Qwen models that function as practical interfaces for steering, evaluating, data workflows, and optimizing large language models.

Measuring Massive Multitask Language Understanding

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer