arxiv: 2009.03300 · v3 · submitted 2020-09-07 · 💻 cs.CY · cs.AI· cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Measuring Massive Multitask Language Understanding

Dan Hendrycks , Collin Burns , Steven Basart , Andy Zou , Mantas Mazeika , Dawn Song , Jacob Steinhardt

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:39 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.CLcs.LG

keywords language modelsmultitask evaluationworld knowledgeproblem solvingbenchmarksGPT-3model capabilities

0 comments

The pith

Current language models, including the largest GPT-3, still require substantial improvements to reach expert-level accuracy on a new 57-task test of knowledge and problem solving.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes a new test covering 57 tasks from fields such as elementary mathematics, US history, computer science, and law to measure a text model's multitask accuracy. Models must show extensive world knowledge and problem-solving ability to score highly on the test. Most recent models perform near random chance, while the very largest GPT-3 model improves by almost 20 percentage points on average. Yet the best models remain well below expert levels on every single task. The test also reveals lopsided results, frequent failure to recognize errors, and near-random accuracy on topics like morality and law.

Core claim

The paper establishes that a test with 57 tasks is needed to assess models' extensive world knowledge and problem solving ability, and that even the most advanced models fall short of expert performance across all these tasks, with particular weaknesses in socially important domains.

What carries the argument

A new test consisting of 57 multiple-choice tasks covering subjects from elementary mathematics to professional levels in areas such as history, computer science, and law.

If this is right

Models exhibit lopsided performance across the different tasks.
Models frequently do not know when they are wrong.
Models achieve near-random accuracy on socially important subjects such as morality and law.
The test can be used to analyze models across many tasks and identify important shortcomings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This kind of broad test allows tracking of how model performance changes as models increase in size.
Task-by-task results could help focus additional training on areas where models are weakest.
The approach provides a consistent way to compare models on a shared set of academic and professional questions.

Load-bearing premise

The 57 chosen tasks and their expert-level thresholds accurately capture extensive world knowledge and problem solving ability without selection bias or overly narrow definitions of expertise.

What would settle it

A model that attains expert-level accuracy on all 57 tasks would show that current best models do not need further substantial improvements.

read the original abstract

We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model's academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MMLU gives a practical, broad benchmark that shows even GPT-3 still falls well short of expert performance across dozens of domains.

read the letter

The paper's main contribution is the MMLU test set itself: 57 tasks drawn from real exams and professional tests in areas like math, history, law, and computer science. They evaluate a range of models, including GPT-3, and show that while the largest version beats random chance by roughly 20 points on average, it remains far below the expert thresholds they set on every single task. The dataset is released, the sourcing is documented, and the results are consistent enough to be useful for tracking progress. That combination has made it a standard reference point for later work on model capabilities. What stands out is the scale and the direct comparison to human expert levels rather than just accuracy on narrow tasks. The measurements line up with the data they provide, and there is no obvious circularity in how the scores are computed. Task selection is the clearest soft spot. The 57 subjects are a reasonable spread but still reflect what was easy to source from existing tests, so they may over- or under-weight certain kinds of knowledge. The expert thresholds are also somewhat arbitrary, though the paper is clear that they are meant as rough targets rather than precise cutoffs. These issues affect interpretation more than the raw accuracy numbers. The paper is aimed at researchers building or evaluating large language models who need a single number that reflects breadth rather than depth on one domain. Anyone working on scaling or safety evaluations would get value from the baseline it sets. It is solid enough to deserve a serious referee, even if later papers have refined how the benchmark is used.

Referee Report

0 major / 3 minor

Summary. The paper proposes the Massive Multitask Language Understanding (MMLU) benchmark consisting of 57 multiple-choice tasks drawn from academic and professional domains such as mathematics, history, computer science, and law. The authors evaluate a range of language models and report that most achieve near-random accuracy of approximately 25%, while the largest GPT-3 model reaches an average of 43.9% (a nearly 20-point gain over random). All evaluated models remain substantially below the stated expert-level accuracy of roughly 89% on every task, with notably weak performance on morality and law; models also exhibit lopsided subject performance and poor calibration regarding their own errors. The benchmark is positioned as a tool for measuring breadth of world knowledge and problem-solving ability.

Significance. If the reported measurements hold, this work supplies a valuable, broad-coverage benchmark that enables systematic tracking of language-model progress across many domains simultaneously. Notable strengths include the careful sourcing of questions from real exams and textbooks, the public release of the full dataset for reproducibility, the consistent evaluation protocol applied to multiple model families, and the inclusion of clear random-chance baselines. These elements allow the community to replicate and extend the results, and the empirical gaps documented have already shaped subsequent scaling and evaluation research.

minor comments (3)

[Section 3] Section 3: Additional quantitative detail on how expert-level accuracy thresholds were estimated for each task (e.g., number of experts, agreement statistics) would help readers evaluate the size of the reported gaps to expert performance.
[Table 1] Table 1 and Section 4: The random baseline is uniformly listed near 25%; explicitly confirming that every task uses four options and noting any exceptions would remove minor ambiguity.
[Section 5] Section 5: The discussion of lopsided performance and poor self-knowledge would be strengthened by reporting per-subject standard deviations or statistical tests of imbalance.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of our work and for recommending minor revision. The referee's summary accurately captures the MMLU benchmark, its construction from real academic and professional sources, the evaluation results across model families, and the documented gaps relative to expert performance. We are pleased that the strengths of careful sourcing, public data release, consistent protocols, and random baselines are highlighted. As the report lists no specific major comments or requested changes, we have no points requiring detailed rebuttal or disagreement.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper constructs a new benchmark (MMLU) with 57 tasks drawn from existing exams and reports direct empirical accuracy measurements for language models against random-chance baselines and stated expert thresholds. No equations, derivations, or first-principles predictions appear; the central claims consist solely of observed performance numbers on the released test set. Self-citations, if present, are incidental background references and do not serve as load-bearing justification for any result. The evaluation chain is therefore self-contained and externally verifiable via the dataset.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, invented entities, or non-standard axioms; the work rests on the standard assumption that multiple-choice accuracy on curated exam questions measures relevant knowledge.

pith-pipeline@v0.9.0 · 5472 in / 955 out tokens · 34780 ms · 2026-05-10T12:39:02.575757+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/PhiForcing.lean phi_equation unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
IndisputableMonolith/Foundation/DimensionForcing.lean dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Unsteady Metrics and Benchmarking Cultures of AI Model Builders
cs.AI 2026-05 accept novelty 8.0

AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.
HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts
cs.LG 2026-05 unverdicted novelty 8.0

HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-wei...
REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
cs.CL 2026-05 unverdicted novelty 8.0

REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...
Crafting Reversible SFT Behaviors in Large Language Models
cs.LG 2026-05 unverdicted novelty 8.0

LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.
ArgBench: Benchmarking LLMs on Computational Argumentation Tasks
cs.CL 2026-04 unverdicted novelty 8.0

ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
Large Language Diffusion Models
cs.CL 2025-02 unverdicted novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation
cs.LG 2026-05 unverdicted novelty 7.0

RxEval benchmark shows frontier LLMs reach at most 46.10% exact match on prescription-level medication, dose, and route selection from real patient trajectories.
Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation
cs.CL 2026-05 unverdicted novelty 7.0

New metrics KSS and KPS are introduced to evaluate multilingual machine unlearning quality and cross-language consistency in LLMs, addressing limitations of single-language evaluation protocols.
Good Agentic Friends Do Not Just Give Verbal Advice: They Can Update Your Weights
cs.CL 2026-05 unverdicted novelty 7.0

TFlow enables multi-agent LLMs to collaborate via transient low-rank LoRA perturbations derived from sender activations, yielding up to 8.5 accuracy gains and 83% token reduction versus text-based baselines on Qwen3-4...
Inducing Artificial Uncertainty in Language Models
cs.CL 2026-05 unverdicted novelty 7.0

Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.
Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation
cs.AI 2026-05 unverdicted novelty 7.0

PyRAG turns multi-hop reasoning into executable Python code over retrieval tools for explicit, verifiable step-by-step RAG.
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
cs.CL 2026-05 unverdicted novelty 7.0

TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.
Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models
cs.CL 2026-05 unverdicted novelty 7.0

dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.
Beyond Parameter Aggregation: Semantic Consensus for Federated Fine-Tuning of LLMs
cs.LG 2026-05 unverdicted novelty 7.0

Semantic consensus on model outputs for public prompts enables federated LLM fine-tuning that matches parameter-aggregation baselines with orders-of-magnitude lower communication.
Task-Aware Calibration: Provably Optimal Decoding in LLMs
cs.LG 2026-05 unverdicted novelty 7.0

Task calibration aligns LLM distributions in latent task spaces to make MBR decoding provably optimal and improve generation quality.
Skill Description Deception Attack against Task Routing in Internet of Agents
cs.MA 2026-05 conditional novelty 7.0

Malicious agents can deceive LLM-based task routers in Internet of Agents systems by generating fake skill descriptions, achieving up to 98% success rate across nine domains.
EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild
cs.AI 2026-05 conditional novelty 7.0

EpiGraph creates a heterogeneous epilepsy knowledge graph that boosts LLM performance on clinical reasoning tasks by 30-41% in pharmacogenomics when used with Graph-RAG.
EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild
cs.AI 2026-05 unverdicted novelty 7.0

EpiGraph is a new epilepsy knowledge graph with 24,324 entities and 32,009 triplets that improves LLM performance on clinical tasks by up to 41% when used in Graph-RAG.
LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models
cs.LG 2026-05 unverdicted novelty 7.0

LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.
BadDLM: Backdooring Diffusion Language Models with Diverse Targets
cs.CR 2026-05 unverdicted novelty 7.0

BadDLM implants effective backdoors in diffusion language models across concept, attribute, alignment, and payload targets by exploiting denoising dynamics while preserving clean performance.
DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules
cs.AI 2026-05 unverdicted novelty 7.0

DiagnosticIQ benchmark shows frontier LLMs perform similarly on standard rule-to-action tasks but lose substantial accuracy under distractor expansion and condition inversion, pointing to calibration as the key deploy...
Chain-based Distillation for Effective Initialization of Variable-Sized Small Language Models
cs.CL 2026-05 unverdicted novelty 7.0

Chain-based Distillation constructs a sequence of anchor models to enable efficient initialization of variable-sized SLMs through interpolation, with bridge distillation for cross-architecture transfer, yielding bette...
Mitigating Many-shot Jailbreak Attacks with One Single Demonstration
cs.CR 2026-05 conditional novelty 7.0

A single safety demonstration appended at inference time mitigates many-shot jailbreak attacks by counteracting implicit malicious fine-tuning on harmful examples.
Instruction Tuning Changes How Upstream State Conditions Late Readout: A Cross-Patching Diagnostic
cs.LG 2026-05 unverdicted novelty 7.0

Instruction tuning makes late-layer computation depend more on the model's own post-trained upstream state than on base-model upstream state, producing a consistent +1.68 logit interaction effect across five model families.
Dataset Watermarking for Closed LLMs with Provable Detection
cs.LG 2026-05 unverdicted novelty 7.0

A new watermarking method for closed LLMs boosts random word-pair co-occurrences via rephrasing and detects the signal statistically in outputs, working reliably even when the watermarked data is only 1% of fine-tunin...
Nsanku: Evaluating Zero-Shot Translation Performance of LLMs for Ghanaian Languages
cs.CL 2026-05 accept novelty 7.0

Nsanku benchmark shows current LLMs achieve only modest zero-shot translation scores on 43 Ghanaian languages, with no model reaching both high average performance and high cross-language consistency.
SplitZip: Ultra Fast Lossless KV Compression for Disaggregated LLM Serving
cs.DC 2026-05 unverdicted novelty 7.0

SplitZip is a new GPU-friendly lossless compressor for KV cache tensors that exploits exponent redundancy to achieve over 600 GB/s compression throughput and up to 1.32x faster transfers in disaggregated LLM serving.
SciEval: A Benchmark for Automatic Evaluation of K-12 Science Instructional Materials
cs.AI 2026-04 unverdicted novelty 7.0

SciEval is a new benchmark of expert-annotated K-12 science lessons for LLM-based automatic evaluation, where zero-shot models perform poorly but fine-tuning yields up to 11% gains.
Analysis and Explainability of LLMs Via Evolutionary Methods
cs.NE 2026-04 unverdicted novelty 7.0

Evolutionary trees from LLM weights recover ground-truth training topologies and identify key datasets and layers through phenotypic analysis.
Green Shielding: A User-Centric Approach Towards Trustworthy AI
cs.CL 2026-04 unverdicted novelty 7.0

Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like t...
Breaking the Secret: Economic Interventions for Combating Collusion in Embodied Multi-Agent Systems
cs.CR 2026-04 unverdicted novelty 7.0

A mutagenic incentive mechanism reshapes payoffs in embodied MAS to induce strategic defection from collusion, achieving performance comparable to non-collusion baselines in simulations and real-world tests.
Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders
cs.LG 2026-04 unverdicted novelty 7.0

Uncertainty and correctness in LLMs are encoded by distinct feature populations, with suppression of confounded features improving accuracy and reducing entropy.
How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them
cs.CL 2026-04 unverdicted novelty 7.0

Subword tokenization impairs phonological knowledge encoding in LMs, but an IPA-based fine-tuning method restores it with minimal impact on other capabilities.
Pushing the Boundaries of Multiple Choice Evaluation to One Hundred Options
cs.CL 2026-04 unverdicted novelty 7.0

Scaling multiple-choice questions to 100 options on a Korean error detection task shows that LLM performance on conventional benchmarks overstates true competence due to shortcut strategies.
AgileLog: A Forkable Shared Log for Agents on Data Streams
cs.DC 2026-04 unverdicted novelty 7.0

AgileLog introduces forkable shared logs with cheap forking and isolation to support AI agents on data streams.
Awakening Dormant Experts:Counterfactual Routing to Mitigate MoE Hallucinations
cs.LG 2026-04 unverdicted novelty 7.0

Counterfactual Routing awakens dormant experts in MoE models via layer-wise perturbation and a new CEI metric, raising factual accuracy 3.1% on average across TruthfulQA, FACTOR, and TriviaQA without extra inference cost.
Retrieval as Generation: A Unified Framework with Self-Triggered Information Planning
cs.CL 2026-04 unverdicted novelty 7.0

GRIP integrates retrieval into autoregressive generation through self-triggered control tokens for dynamic query planning, outperforming RAG baselines on QA benchmarks with fewer parameters than GPT-4o.
EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks
cs.CV 2026-04 unverdicted novelty 7.0

EgoTL provides a new egocentric dataset with think-aloud chains and metric labels that benchmarks VLMs on long-horizon tasks and improves their planning, reasoning, and spatial grounding after finetuning.
GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing
cs.CV 2026-04 unverdicted novelty 7.0

GeoMMBench reveals deficiencies in current multimodal LLMs for geoscience tasks while GeoMMAgent demonstrates that tool-integrated agents achieve significantly higher performance.
Every Response Counts: Quantifying Uncertainty of LLM-based Multi-Agent Systems through Tensor Decomposition
cs.LG 2026-04 unverdicted novelty 7.0

MATU quantifies uncertainty in LLM multi-agent systems by turning reasoning trajectories into embedding matrices, stacking runs into a tensor, and decomposing it to separate sources of variability.
Social Dynamics as Critical Vulnerabilities that Undermine Objective Decision-Making in LLM Collectives
cs.CL 2026-04 unverdicted novelty 7.0

Social dynamics in LLM collectives cause representative agents to make less accurate decisions as peer pressure increases through larger adversarial groups, more capable peers, longer arguments, and persuasive styles.
FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks
cs.CL 2026-04 unverdicted novelty 7.0

FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.
DeonticBench: A Benchmark for Reasoning over Rules
cs.CL 2026-04 unverdicted novelty 7.0

DEONTICBENCH is a new benchmark of 6,232 deontic reasoning tasks from U.S. legal domains where frontier LLMs reach only ~45% accuracy and symbolic Prolog assistance plus RL training still fail to solve tasks reliably.
PolyReal: A Benchmark for Real-World Polymer Science Workflows
cs.CV 2026-04 unverdicted novelty 7.0

PolyReal benchmark shows leading MLLMs perform well on polymer knowledge reasoning but drop sharply on practical tasks like lab safety analysis and raw data extraction.
Bridging the Missing-Modality Gap: Improving Text-Only Calibration of Vision Language Models
cs.CL 2026-04 unverdicted novelty 7.0

A new Latent Imagination Module uses cross-attention to predict latent visual embeddings from text, improving accuracy and calibration of vision-language models on text-only inputs.
A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network
cs.AR 2026-03 unverdicted novelty 7.0

SCIN uses an in-switch accelerator for direct memory access and 8-bit in-network quantization during All-Reduce, delivering up to 8.7x faster small-message reduction and 1.74x TTFT speedup on LLaMA-2 models.
Path-Constrained Mixture-of-Experts
cs.LG 2026-03 unverdicted novelty 7.0

PathMoE constrains expert paths in MoE models by sharing router parameters across layer blocks, yielding more concentrated paths, better performance on perplexity and tasks, and no need for auxiliary losses.
KMMMU: Evaluation of Massive Multi-discipline Multimodal Understanding in Korean Language and Context
cs.CL 2026-03 conditional novelty 7.0

KMMMU benchmark demonstrates that leading multimodal models achieve at most 52.42% accuracy on hard Korean exam questions, highlighting limitations in non-English multimodal understanding.
PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses
cs.CL 2026-03 unverdicted novelty 7.0

PEEM is a multi-criteria LLM-based evaluator for prompts and responses that aligns with standard accuracy while enabling zero-shot prompt optimization via feedback.
EvoESAP: Non-Uniform Expert Pruning for Sparse MoE
cs.LG 2026-03 conditional novelty 7.0

EvoESAP uses evolutionary search guided by a speculative-decoding-inspired ESAP metric to discover non-uniform layer-wise sparsity allocations for MoE expert pruning, improving generation accuracy up to 19.6% at 50% sparsity.
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
cs.CL 2025-11 unverdicted novelty 7.0

Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
Moshi: a speech-text foundation model for real-time dialogue
eess.AS 2024-09 accept novelty 7.0

Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
Refusal in Language Models Is Mediated by a Single Direction
cs.LG 2024-06 accept novelty 7.0

Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
cs.CL 2024-05 unverdicted novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens
cs.CL 2024-02 unverdicted novelty 7.0

LongRoPE extends LLM context windows to 2048k tokens via search for non-uniform positional interpolation, progressive fine-tuning from 256k, and short-context readjustment.
Self-Rewarding Language Models
cs.CL 2024-01 conditional novelty 7.0

Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
WizardLM: Empowering large pre-trained language models to follow complex instructions
cs.CL 2023-04 conditional novelty 7.0

WizardLM uses LLM-driven iterative rewriting to generate complex instruction data and fine-tunes LLaMA to reach over 90% of ChatGPT capacity on 17 of 29 evaluated skills.
Capabilities of GPT-4 on Medical Challenge Problems
cs.CL 2023-03 unverdicted novelty 7.0

GPT-4 exceeds the USMLE passing score by more than 20 points and outperforms both GPT-3.5 and the medically fine-tuned Med-PaLM on the MultiMedQA benchmarks.
LEMON: Learning Executable Multi-Agent Orchestration via Counterfactual Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

LEMON trains an LLM orchestrator with counterfactual-augmented GRPO to produce deployable multi-agent specifications that reach state-of-the-art results on six reasoning and coding benchmarks.
Know When To Fold 'Em: Token-Efficient LLM Synthetic Data Generation via Multi-Stage In-Flight Rejection
cs.AI 2026-05 unverdicted novelty 6.0

MSIFR stops faulty LLM generations early via staged rule-based checks, reducing token consumption 11-78% with no accuracy loss.

Reference graph

Works this paper leans on

300 extracted references · 200 canonical work pages · cited by 197 Pith papers · 9 internal anchors

[1]

M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents (extended abstract). J. Artif. Intell. Res., 47: 0 253--279, 2013

2013
[2]

Y. Bisk, R. Zellers, R. L. Bras, J. Gao, and Y. Choi. Piqa: Reasoning about physical commonsense in natural language, 2019

2019
[3]

Y. Bisk, A. Holtzman, J. Thomason, J. Andreas, Y. Bengio, J. Chai, M. Lapata, A. Lazaridou, J. May, A. Nisnevich, N. Pinto, and J. Turian. Experience grounds language, 2020

2020
[4]

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amod...

2020
[5]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[6]

Clark, O

P. Clark, O. Etzioni, D. Khashabi, T. Khot, B. D. Mishra, K. Richardson, A. Sabharwal, C. Schoenick, O. Tafjord, N. Tandon, S. Bhakthavatsalam, D. Groeneveld, M. Guerquin, and M. Schmitz. From 'f' to 'a' on the n.y. regents science exams: An overview of the aristo project. ArXiv, abs/1909.01958, 2019

work page arXiv 1909
[7]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv, abs/1810.04805, 2019

work page internal anchor Pith review Pith/arXiv arXiv 2019
[8]

Geirhos, J.-H

R. Geirhos, J.-H. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, and F. A. Wichmann. Shortcut learning in deep neural networks, 2020

2020
[9]

C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks. ICML, 2017

2017
[10]

Hendrycks, M

D. Hendrycks, M. Mazeika, and T. Dietterich. Deep anomaly detection with outlier exposure. ICLR, 2019 a

2019
[11]

Natural adversarial examples

D. Hendrycks, K. Zhao, S. Basart, J. Steinhardt, and D. Song. Natural adversarial examples. ArXiv, abs/1907.07174, 2019 b

work page arXiv 1907
[12]

Hendrycks, C

D. Hendrycks, C. Burns, S. Basart, A. Critch, J. Li, D. Song, and J. Steinhardt. Aligning ai with shared human values, 2020

2020
[13]

Huang, R

L. Huang, R. L. Bras, C. Bhagavatula, and Y. Choi. Cosmos qa: Machine reading comprehension with contextual commonsense reasoning, 2019

2019
[14]

Kaplan, S

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language models, 2020

2020
[15]

Khashabi, T

D. Khashabi, T. Khot, A. Sabharwal, O. Tafjord, P. Clark, and H. Hajishirzi. Unifiedqa: Crossing format boundaries with a single qa system, 2020

2020
[16]

T. Khot, P. Clark, M. Guerquin, P. Jansen, and A. Sabharwal. Qasc: A dataset for question answering via sentence composition, 2019

2019
[17]

Kumar, P

A. Kumar, P. Liang, and T. Ma. Verified uncertainty calibration, 2019

2019
[18]

G. Lai, Q. Xie, H. Liu, Y. Yang, and E. Hovy. Race: Large-scale reading comprehension dataset from examinations, 2017

2017
[19]

Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut. Albert: A lite bert for self-supervised learning of language representations. ArXiv, abs/1909.11942, 2020

work page internal anchor Pith review arXiv 1909
[20]

Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. Roberta: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[21]

Mihaylov, P

T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP, 2018

2018
[22]

Ovadia, E

Y. Ovadia, E. Fertig, J. Ren, Z. Nado, D. Sculley, S. Nowozin, J. V. Dillon, B. Lakshminarayanan, and J. Snoek. Can you trust your model's uncertainty? E valuating predictive uncertainty under dataset shift. NeurIPS, 2019

2019
[23]

Petroni, T

F. Petroni, T. Rocktäschel, P. Lewis, A. Bakhtin, Y. Wu, A. H. Miller, and S. Riedel. Language models as knowledge bases?, 2019

2019
[24]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. 2019

2019
[25]

Raffel, N

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer, 2019

2019
[26]

Richardson, C

M. Richardson, C. J. Burges, and E. Renshaw. MCT est: A challenge dataset for the open-domain machine comprehension of text. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 193--203, Seattle, Washington, USA, Oct. 2013. Association for Computational Linguistics

2013
[27]

A. B. Sai, A. K. Mohankumar, and M. M. Khapra. A survey of evaluation metrics used for nlg systems. 2020

2020
[28]

A. Turing. Computing machinery and intelligence. 1950

work page 1950
[29]

A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding, 2018

work page 2018
[30]

A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems, 2019

work page 2019
[31]

Zellers, A

R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi. Hellaswag: Can a machine really finish your sentence?, 2019

work page 2019
[32]

Zellers, A

R. Zellers, A. Holtzman, E. Clark, L. Qin, A. Farhadi, and Y. Choi. Evaluating machines by their real-world language use, 2020

work page 2020
[33]

Realizable and unrealizable specifications of reactive systems , author =

work page
[34]

Kingma and Jimmy Ba , year = 2014, journal =

Diederik P. Kingma and Jimmy Ba , year = 2014, journal =. Adam:

work page 2014
[35]

Generative Adverarial Metric [sic] , author =

work page
[36]

Feature Denoising for Improving Adversarial Robustness , author =

work page
[37]

Intriguing properties of neural networks , author =

work page
[38]

ArXiv , volume =

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , author =. ArXiv , volume =

work page
[39]

Alex Krizhevsky and Sutskever, Ilya and Hinton, Geoffrey E , year = 2012, journal =

work page 2012
[40]

CoRR , volume =

Striving for Simplicity: The All Convolutional Net , author =. CoRR , volume =

work page
[41]

Adversarial Logit Pairing , author =

work page
[42]

Evaluating and Understanding the Robustness of Adversarial Logit Pairing , author =

work page
[43]

Concrete problems in

Amodei, Dario and Olah, Chris and Steinhardt, Jacob and Christiano, Paul and Schulman, John and Man. Concrete problems in

work page
[44]

Computer , publisher =

Google street view: Capturing the world at street level , author =. Computer , publisher =

work page
[45]

Nicomachean Ethics , author =

work page
[46]

General Purpose Intelligence: Arguing the Orthogonality Thesis , author =

work page
[47]

Army of None: Autonomous Weapons and the Future of War , author =

work page
[48]

Synthesizing robust adversarial examples , author =

work page
[49]

CoRR , volume =

Adversarial Transformation Networks: Learning to Generate Adversarial Examples , author =. CoRR , volume =

work page
[50]

The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , author =. J. Artif. Intell. Res. , volume = 47, pages =

work page
[51]

doi: 10.18653/v1/2020.acl-main.463

Bender, Emily M. and Koller, Alexander , year = 2020, month = jul, booktitle =. Climbing towards. doi:10.18653/v1/2020.acl-main.463 , url =

work page doi:10.18653/v1/2020.acl-main.463 2020
[52]

An Introduction to the Principles of Morals and Legislation , author =

work page
[53]

Exploiting weakly-labeled Web images to improve object classification: a domain adaptation approach , author =

work page
[54]

ArXiv , volume =

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author =. ArXiv , volume =

work page
[55]

Mixmatch: A holistic approach to semi-supervised learning , author =

work page
[56]

CoRR , volume =

Big but Imperceptible Adversarial Perturbations via Semantic Manipulation , author =. CoRR , volume =

work page
[57]

Support Vector Machines Under Adversarial Label Noise , author =

work page
[58]

Piqa: Reasoning about physical commonsense in natural language

PIQA: Reasoning about Physical Commonsense in Natural Language , author =. 1911.11641 , archiveprefix =

work page arXiv 1911
[59]

PIQA: Reasoning about Physical Commonsense in Natural Language , author =

work page
[60]

2004.10151 , archiveprefix =

Experience Grounds Language , author =. 2004.10151 , archiveprefix =

work page arXiv 2004
[61]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , author =

work page
[62]

Superintelligence: Paths, Dangers, Strategies , author =

work page
[63]

Adversarial Filters of Dataset Biases , author =

work page
[64]

Approximating

Wieland Brendel and Matthias Bethge , year = 2018, journal =. Approximating

work page 2018
[65]

Adversarial patch , author =

work page
[66]

CoRR , volume =

Unrestricted Adversarial Examples , author =. CoRR , volume =

work page
[67]

Language Models are Few-Shot Learners

Language Models are Few-Shot Learners , author =. 2005.14165 , archiveprefix =

work page internal anchor Pith review Pith/arXiv arXiv 2005
[68]

ArXiv , volume =

Language Models are Few-Shot Learners , author =. ArXiv , volume =

work page
[69]

Learning to rank using gradient descent , author =

work page
[70]

Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods , author =

work page
[71]

Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security , pages =

Adversarial examples are not easily detected: Bypassing ten detection methods , author =. Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security , pages =

work page
[72]

2017 ieee symposium on security and privacy (sp) , pages =

Towards evaluating the robustness of neural networks , author =. 2017 ieee symposium on security and privacy (sp) , pages =

work page 2017
[73]

Chapelle, Olivier and Schlkopf, Bernhard and Zien, Alexander , year = 2010, publisher =

work page 2010
[74]

Chawla, Nitesh V and Bowyer, Kevin W and Hall, Lawrence O and Kegelmeyer, W Philip , year = 2002, journal =

work page 2002
[75]

Dual Path Networks , author =

work page
[76]

Deep Reinforcement Learning from Human Preferences , author =

work page
[77]

Learning multiple layers of features from tiny images , author =

work page
[78]

Describing Textures in the Wild , author =

work page
[79]

Lawrence Zitnick and Piotr Dollar , year = 2014, journal =

Tsung-Yi Lin and Michael Maire and Serge Belongie and Lubomir Bourdev and Ross Girshick and James Hays and Pietro Perona and Deva Ramanan and C. Lawrence Zitnick and Piotr Dollar , year = 2014, journal =. Microsoft

work page 2014
[80]

Certified adversarial robustness via randomized smoothing , author =

work page

Showing first 80 references.