super hub Mixed citations

Measuring Massive Multitask Language Understanding

Andy Zou, Collin Burns, Dan Hendrycks, Dawn Song, Mantas Mazeika, Steven Basart · 2020 · cs.CY · arXiv 2009.03300

Mixed citation behavior. Most common role is background (45%).

533 Pith papers citing it

Background 45% of classified citations

open full Pith review browse 533 citing papers more from Andy Zou arXiv PDF

abstract

We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model's academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 31 dataset 30 method 5 baseline 3

citation-polarity summary

background 31 use dataset 28 use method 5 baseline 3 unclear 2

claims ledger

abstract We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models

authors

Andy Zou Collin Burns Dan Hendrycks Dawn Song Mantas Mazeika Steven Basart

co-cited works

representative citing papers

Bad company corrupts good morals: Understanding and Measuring Narrative-Induced Moral Reasoning Degradation in LLMs

cs.CY · 2026-06-27 · unverdicted · novelty 8.0

Negative narrative immersion causes 12-31% drops in LLM moral accuracy and produces structured shifts that appear in downstream applications.

DataComp-VLM: Improved Open Datasets for Vision-Language Models

cs.CV · 2026-06-26 · conditional · novelty 8.0 · 2 refs

DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).

Sumi: Open Uniform Diffusion Language Model from Scratch

cs.CL · 2026-06-17 · unverdicted · novelty 8.0

Sumi is an openly released 7B parameter uniform diffusion language model pretrained from scratch on 1.5T tokens that matches autoregressive models on several benchmarks.

Fully Open Meditron: An Auditable Pipeline for Clinical LLMs

cs.AI · 2026-05-15 · unverdicted · novelty 8.0 · 2 refs

Presents the first fully open pipeline for clinical LLMs by unifying eight public QA datasets with three clinician-vetted synthetic extensions and applying it to five base models to achieve benchmark gains while maintaining auditability.

Unsteady Metrics and Benchmarking Cultures of AI Model Builders

cs.AI · 2026-05-13 · accept · novelty 8.0

AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.

HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.

EnergyAgentBench: Benchmarking LLM Agents on Live Energy Infrastructure Data

econ.EM · 2026-05-13 · accept · novelty 8.0

EnergyAgentBench is a new benchmark with 70 task variants that evaluates LLM agents on live energy data for datacenter siting, long-horizon optimization, and causal grid diagnosis.

Crafting Reversible SFT Behaviors in Large Language Models

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.

ArgBench: Benchmarking LLMs on Computational Argumentation Tasks

cs.CL · 2026-04-19 · unverdicted · novelty 8.0

ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.

MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers

cs.SE · 2026-01-31 · accept · novelty 8.0 · 2 refs

MCP-Atlas is a new benchmark with 1000 tasks on production MCP servers that uses claim-level scoring to evaluate LLM agents on realistic multi-step tool-use competency.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

cs.HC · 2024-05-13 · conditional · novelty 8.0

AgentClinic is a multimodal agent benchmark demonstrating that LLM diagnostic accuracy on MedQA drops to below one-tenth in sequential clinical simulations, with Claude-3.5 leading and large tool-use differences across models.

Will Scaling Improve Social Simulation with LLMs?

cs.CL · 2026-07-02 · conditional · novelty 7.0

Scaling improves LLM social simulation fidelity in most opinion and behavior tasks but not for human cognitive bias calibration or low-resource domains.

Meta-Benchmarks for Financial-Services LLM Evaluation

cs.AI · 2026-07-02 · unverdicted · novelty 7.0

A meta-benchmarking framework organizes 452 LLM benchmarks into 41 O*NET Generalized Work Activities and 38 BIAN domains, using discrimination-coverage-recency weights to scale K-factors in an Elo tournament for comparable financial-services scores.

ELDR: Expert-Locality-Aware Decode Routing for PD-Disaggregated MoE Serving

cs.DC · 2026-07-01 · unverdicted · novelty 7.0 · 2 refs

ELDR reduces median TPOT by 5.9-13.9% in PD-disaggregated MoE serving via expert signatures from prefill, K-means partitioning, and locality-band routing with KV-co-indexed signature cache.

FlipGuard: Defending Large Language Models Against Quantization-Conditioned Backdoor Attacks

cs.CR · 2026-06-27 · unverdicted · novelty 7.0

FlipGuard perturbs LLM weights prior to quantization to neutralize quantization-conditioned backdoor attacks, evaluated via the Defense Effectiveness Ratio on multiple models and quantization schemes.

Agentic Abstention: Do Agents Know When to Stop Instead of Act?

cs.AI · 2026-06-27 · unverdicted · novelty 7.0

LLM agents often fail to abstain at the right time in uncertain multi-turn tasks, and the CONVOLVE context engineering method raises timely abstention rates on WebShop from 26.7 to 57.4 without parameter updates.

GeMoE: Gating Entropy is All You Need for Uncertainty-aware Adaptive Routing in MoE-based Large Vision-Language Models

cs.CV · 2026-06-24 · unverdicted · novelty 7.0

GeMoE adaptively sets the number of experts per token via gating entropy, retaining 99.5% of static-routing performance while raising average sparsity by 36.5%.

Beyond Entropy: Learning from Token-Level Distributional Deviations for LLM Reasoning

cs.AI · 2026-06-18 · unverdicted · novelty 7.0

ICT framework applies JS divergence to token logits to select critical tokens for selective RLVR updates, claiming 4.58% average pass@4 gains on Qwen2.5 models across seven reasoning benchmarks.

SIGMA: Skill-Incidence Graphs for Compositional Multi-Agent Design

cs.MA · 2026-06-18 · unverdicted · novelty 7.0

SIGMA introduces skill-incidence graphs to compose agents from reusable skills, yielding higher average performance and robustness than topology-only baselines on reasoning and coding benchmarks.

Comparing Linear Probes with Mahalanobis Cosine Similarity

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

For balanced Gaussian class projections, OOD AUROC is a linear function of MCS to the reference probe because both are sigmoid-shaped functions of the probe SNR on test data.

Predicting Mergeability of Parameter-Efficient Fine-Tuning Updates

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

MergeProbe forecasts LoRA adapter mergeability from first-few-percent training signals and outperforms interference-aware baselines on retention while adding low overhead on a five-domain benchmark.

DreamReasoner-8B: Block-Size Curriculum Learning for Diffusion Reasoning Models

cs.CL · 2026-06-17 · unverdicted · novelty 7.0

Block-size curriculum learning trains an 8B diffusion model to achieve competitive reasoning performance on math and code benchmarks by transitioning from small to large training block sizes.

Beyond Prediction: Tail-Aware Scheduling for LLM Inference

cs.LG · 2026-06-16 · unverdicted · novelty 7.0

Presents a distribution-aware scheduling framework for LLM inference that reduces P99 TTLT by 35-50% and TTFT by 34-47% versus SRPT with perfect length knowledge using statistical signals instead of predictions.

citing papers explorer

Showing 33 of 533 citing papers.

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism cs.CL · 2024-01-05 · unverdicted · none · ref 109 · internal anchor
DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
Think in English, Answer in Korean: Efficient Adaptation of Multilingual Tool-Using Agents cs.AI · 2026-06-30 · unverdicted · none · ref 22 · internal anchor
LuckyStar 111B adapts Cohere's Command A model with four scaling techniques to improve tool-use, math reasoning, and NL2SQL in Korean-English while preserving general instruction following.
Customized Generative AI Agent for Transportation Engineering Practice: A Development and Continued Pre-training Guideline cs.AI · 2026-06-27 · unverdicted · none · ref 15 · internal anchor
A framework is described for adapting six LLMs to transportation engineering via LoRA-based continued pretraining on domain documents, with two models showing strongest results on BLEU-4 and ROUGE metrics.
Token-Operations-Oriented Inference Optimization Techniques for Large Models cs.SE · 2026-06-18 · unverdicted · none · ref 4 · internal anchor
The paper introduces a four-layer technical architecture for token-operations-oriented inference optimization in large models and reviews key technologies and industry status at each layer.
Mellum2 Technical Report cs.CL · 2026-05-29 · unverdicted · none · ref 26 · internal anchor
Mellum 2 is a 12B MoE model with 2.5B active parameters, trained on 10.6T tokens with MoE, GQA, SWA, and MTP, then post-trained into Instruct and Thinking variants, claimed competitive with 4B-14B models at 2.5B compute.
Apertus LLM Family Expansion via Distillation and Quantization cs.LG · 2026-05-27 · unverdicted · none · ref 6 · internal anchor
Distillation and quantization expand the Apertus 8B LLM into a family of models up to 4B parameters with claimed strong accuracy and cost efficiency.
OpenCompass: A Universal Evaluation Platform for Large Language Models cs.CL · 2026-05-19 · unverdicted · none · ref 7 · 2 links · internal anchor
OpenCompass is presented as a one-stop, scalable, high-concurrency LLM evaluation platform with modular architecture supporting multiple domains and evaluator types.
Phoenix-VL 1.5 Medium Technical Report cs.CL · 2026-05-11 · unverdicted · none · ref 9 · internal anchor
Phoenix-VL 1.5 Medium is a 123B-parameter natively multimodal model that reaches state-of-the-art results on Singapore multimodal, legal, and policy benchmarks after localized training on 1T+ tokens while staying competitive on global benchmarks.
Development and Preliminary Evaluation of a Domain-Specific Large Language Model for Tuberculosis Care in South Africa cs.CL · 2026-03-28 · unverdicted · none · ref 28 · internal anchor
A domain-specific LLM for TB care in South Africa, created by fine-tuning BioMistral-7B with QLoRA and GraphRAG on local guidelines, shows improved contextual alignment over the base model.
When control meets large language models: From words to dynamics eess.SY · 2026-02-03 · unverdicted · none · ref 218 · internal anchor
The paper proposes a bidirectional continuum between LLMs and control systems, covering LLM-assisted controller design, control-based LLM steering, and state-space modeling of LLMs.
LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems cs.LG · 2026-01-20 · unverdicted · none · ref 64 · internal anchor
A survey taxonomy of LLMs identifies three scaling crises and six efficiency paradigms while tracing the shift from generation to tool-using agents.
Survey on Evaluation of LLM-based Agents cs.AI · 2025-03-20 · unverdicted · none · ref 3 · internal anchor
A survey of evaluation methods for LLM-based agents from five perspectives, identifying trends toward realistic benchmarks and gaps in safety, cost-efficiency, and robustness.
Gemma 2: Improving Open Language Models at a Practical Size cs.CL · 2024-07-31 · conditional · none · ref 80 · internal anchor
Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.
The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey cs.AI · 2024-04-17 · unverdicted · none · ref 11 · internal anchor
A survey of emerging AI agent architectures that organizes single and multi-agent designs around reasoning, planning, tool use, communication, and reflection phases.
Retrieval-Augmented Generation for Large Language Models: A Survey cs.CL · 2023-12-18 · unverdicted · none · ref 146 · internal anchor
A survey of RAG paradigms, components, benchmarks, and challenges for improving LLMs on knowledge-intensive tasks.
From GPT-3 to GPT-5: Mapping their capabilities, scope, limitations, and consequences cs.AI · 2026-04-11 · unverdicted · none · ref 18 · internal anchor
The GPT family has shifted from scaled text predictors to aligned multimodal tool-oriented systems, with persistent limitations like hallucination and prompt sensitivity remaining unchanged.
The nextAI Solution to the NeurIPS 2023 LLM Efficiency Challenge cs.LG · 2026-04-10 · unverdicted · none · ref 14 · internal anchor
A competition entry achieved efficient fine-tuning of LLaMa2 70B on one GPU in 24 hours with competitive QA benchmark performance.
EXAONE 4.5 Technical Report cs.CL · 2026-04-09 · unverdicted · none · ref 22 · internal anchor
EXAONE 4.5 is a new open-weight multimodal model that matches general benchmarks and outperforms similar-scale models on document understanding and Korean contextual reasoning.
SAGE Celer 2.6 Technical Card cs.CL · 2026-03-24 · unverdicted · none · ref 6 · internal anchor
SAGE Celer 2.6 is a new line of language models with inverse reasoning training, integrated vision, and strong performance on math, coding, and South Asian language benchmarks.
Bridging Brains and Machines: A Unified Frontier in Neuroscience, Artificial Intelligence, and Neuromorphic Systems q-bio.NC · 2025-07-14 · unverdicted · none · ref 161 · internal anchor
A position and survey paper that identifies convergence between neuroscience, AGI, and neuromorphic computing and outlines four key integration challenges.
Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning cs.CL · 2025-02-05 · unverdicted · none · ref 63 · internal anchor
Position paper claims multimodal LLMs can significantly advance scientific reasoning and proposes a four-stage roadmap plus challenges and suggestions.
Small Language Models (SLMs) Can Still Pack a Punch: A survey (updated 2026) cs.CL · 2025-01-03 · unverdicted · none · ref 50 · internal anchor
A literature survey of Small Language Models (1-8B parameters) that can perform comparably or better than larger models, covering general-purpose and task-specific approaches plus creation techniques.
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey cs.CR · 2024-09-26 · unverdicted · none · ref 53 · internal anchor
Survey of harmful fine-tuning attacks on LLMs, their variants, defense strategies, mechanical analysis, and evaluation methodologies.
A Practice of Post-Training on Llama-3 70B with Optimal Selection of Additional Language Mixture Ratio cs.CL · 2024-09-10 · unverdicted · none · ref 6 · internal anchor
Empirical practice of continual pre-training Llama-3 models with optimized additional language mixture ratios to enhance Chinese capabilities, showing gains in benchmarks and domains like math and coding.
River-LLM: Large Language Model Seamless Exit Based on KV Share cs.CL · 2026-04-20 · unreviewed · ref 9 · internal anchor
MemoSight: Unifying Context Compression and Multi Token Prediction for Reasoning Acceleration cs.AI · 2026-04-16 · unreviewed · ref 1 · internal anchor
Nexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima cs.LG · 2026-04-10 · unreviewed · ref 13 · internal anchor
Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding cs.CV · 2026-04-09 · unreviewed · ref 16 · internal anchor
Inclusion-of-Thoughts: Mitigating Preference Instability via Purifying the Decision Space cs.CL · 2026-03-15 · unreviewed · ref 12 · internal anchor
Towards Efficient Large Language Reasoning Models via Extreme-Ratio Chain-of-Thought Compression cs.LG · 2026-02-09 · unreviewed · ref 14 · internal anchor
Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse cs.CL · 2026-02-01 · unreviewed · ref 17 · internal anchor
HE-SNR: Uncovering Latent Logic via Entropy for Guiding Mid-Training on SWE-bench cs.LG · 2026-01-28 · unreviewed · ref 8 · internal anchor
DecompSR: A dataset for decomposed analyses of compositional multihop spatial reasoning cs.AI · 2025-11-04 · unreviewed · ref 9 · internal anchor

Measuring Massive Multitask Language Understanding

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer