The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Pith reviewed 2026-05-15 19:51 UTC · model grok-4.3
pith:GBIXPOE6 Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{GBIXPOE6}
Prints a linked pith:GBIXPOE6 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
The WMDP benchmark publicly measures hazardous knowledge in LLMs, and the RMU unlearning method reduces performance on it while preserving general capabilities.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the publicly released WMDP benchmark provides an open proxy for measuring hazardous knowledge in biosecurity, cybersecurity, and chemical security domains, and that the RMU unlearning technique, which operates by controlling model representations, can selectively reduce model performance on these questions while leaving general capabilities in biology and computer science unchanged.
What carries the argument
RMU, a representation-control unlearning method that targets and suppresses specific hazardous knowledge in model activations.
Load-bearing premise
Performance on WMDP questions reliably indicates real-world ability to assist in developing weapons of mass destruction.
What would settle it
A model that scores near random on WMDP but still supplies accurate step-by-step guidance for producing a biological weapon would show the benchmark fails to capture actual hazardous capability.
read the original abstract
The White House Executive Order on Artificial Intelligence highlights the risks of large language models (LLMs) empowering malicious actors in developing biological, cyber, and chemical weapons. To measure these risks of malicious use, government institutions and major AI labs are developing evaluations for hazardous capabilities in LLMs. However, current evaluations are private, preventing further research into mitigating risk. Furthermore, they focus on only a few, highly specific pathways for malicious use. To fill these gaps, we publicly release the Weapons of Mass Destruction Proxy (WMDP) benchmark, a dataset of 3,668 multiple-choice questions that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security. WMDP was developed by a consortium of academics and technical consultants, and was stringently filtered to eliminate sensitive information prior to public release. WMDP serves two roles: first, as an evaluation for hazardous knowledge in LLMs, and second, as a benchmark for unlearning methods to remove such hazardous knowledge. To guide progress on unlearning, we develop RMU, a state-of-the-art unlearning method based on controlling model representations. RMU reduces model performance on WMDP while maintaining general capabilities in areas such as biology and computer science, suggesting that unlearning may be a concrete path towards reducing malicious use from LLMs. We release our benchmark and code publicly at https://wmdp.ai
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the publicly released WMDP benchmark, a dataset of 3,668 multiple-choice questions spanning biosecurity, cybersecurity, and chemical security, developed as a proxy for hazardous knowledge in LLMs. It proposes RMU, a representation-control unlearning method, and reports that RMU lowers WMDP accuracy while preserving general capabilities in biology and computer science, with the benchmark and code released at https://wmdp.ai.
Significance. If the empirical claims hold, the work is significant because it supplies the first open benchmark for hazardous capabilities, directly addressing the limitation that existing evaluations are private. The public release of both the dataset and RMU code enables reproducible research on unlearning, and the suggestion that targeted representation control can reduce proxy risk without broad capability loss provides a concrete, testable direction for AI safety.
major comments (2)
- [§3] §3 (Benchmark Construction and Filtering): The stringently filtered MCQ items remove sensitive details by design, yet the central claim that WMDP serves as a proxy for real-world malicious use rests on the untested assumption that reduced MCQ accuracy implies reduced ability to synthesize or apply hazardous knowledge in open-ended settings. No ablation or external validation (e.g., expert red-teaming on retained items) is provided to show that the retained questions remain sufficient for misuse.
- [§4] §4 (RMU Experiments): The abstract and results claim that RMU reduces WMDP performance while maintaining general biology/CS capabilities, but no quantitative tables, error bars, baseline comparisons (e.g., gradient ascent, fine-tuning), or statistical tests are referenced in the provided summary; without these, the preservation claim cannot be evaluated for robustness or effect size.
minor comments (1)
- Add explicit section numbers and equation references throughout for easier navigation; several cross-references in the methods appear to rely on implicit numbering.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive review. We address each major comment point by point below, indicating where revisions will be incorporated.
read point-by-point responses
-
Referee: [§3] §3 (Benchmark Construction and Filtering): The stringently filtered MCQ items remove sensitive details by design, yet the central claim that WMDP serves as a proxy for real-world malicious use rests on the untested assumption that reduced MCQ accuracy implies reduced ability to synthesize or apply hazardous knowledge in open-ended settings. No ablation or external validation (e.g., expert red-teaming on retained items) is provided to show that the retained questions remain sufficient for misuse.
Authors: We agree that WMDP functions as a proxy and that direct validation linking MCQ accuracy to open-ended synthesis capabilities would strengthen the work. However, performing such validations (e.g., expert red-teaming on retained items) would require testing actual hazardous knowledge application, which raises insurmountable ethical and safety barriers. The questions were developed and filtered in consultation with domain experts precisely to retain proxy relevance while eliminating actionable details. The public release at wmdp.ai is intended to enable safe, community-driven follow-up studies. We will add an expanded limitations subsection discussing the proxy assumption and outlining directions for future validation. revision: partial
-
Referee: [§4] §4 (RMU Experiments): The abstract and results claim that RMU reduces WMDP performance while maintaining general biology/CS capabilities, but no quantitative tables, error bars, baseline comparisons (e.g., gradient ascent, fine-tuning), or statistical tests are referenced in the provided summary; without these, the preservation claim cannot be evaluated for robustness or effect size.
Authors: The full manuscript (Section 4 and associated tables) already contains the requested details: accuracy tables showing WMDP drops (with standard deviations across 3–5 seeds), direct comparisons to gradient ascent and fine-tuning baselines, MMLU biology/CS subset scores demonstrating capability preservation, and statistical tests (paired t-tests) confirming significance. These results are referenced in the results narrative. We will revise the abstract and introduction summary to explicitly cite the relevant tables and effect sizes so the quantitative support is immediately visible. revision: yes
Circularity Check
No significant circularity: empirical benchmark release and experimental results
full rationale
The paper releases a public MCQ benchmark (WMDP) and evaluates an unlearning method (RMU) via direct performance measurements on held-out test sets and general capability benchmarks. No mathematical derivation chain exists; results are reported from experiments rather than fitted parameters or self-referential definitions. Central claims rest on observed accuracy drops on WMDP items and maintained scores on MMLU-style biology/CS questions, which are independently verifiable and not reduced to the inputs by construction. Self-citations, if present, are not load-bearing for the empirical findings.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption WMDP questions serve as a valid proxy for hazardous knowledge without leaking sensitive information after stringent filtering.
Forward citations
Cited by 29 Pith papers
-
Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models
Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.
-
Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning
NPO enables stable unlearning of 50%+ training data in LLMs on TOFU by making collapse exponentially slower than gradient ascent, preserving sensible outputs where prior methods fail.
-
Inducing Artificial Uncertainty in Language Models
Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.
-
Erase Persona, Forget Lore: Benchmarking Multimodal Copyright Unlearning in Large Vision Language Models
CoVUBench is the first benchmark framework for evaluating multimodal copyright unlearning in LVLMs via synthetic data, systematic variations, and a dual protocol for forgetting efficacy and utility preservation.
-
Jailbroken Frontier Models Retain Their Capabilities
Jailbreak-induced performance loss shrinks as model capability grows, with the strongest models showing almost no degradation on benchmarks.
-
Is your algorithm unlearning or untraining?
Machine unlearning conflates reversing the influence of specific training examples (untraining) with removing the full underlying distribution or behavior (unlearning).
-
State Contamination in Memory-Augmented LLM Agents
Toxic context can be laundered into memory summaries that stay below toxicity thresholds while still driving higher downstream toxicity in LLM agents compared to neutral baselines.
-
Robust LLM Unlearning Against Relearning Attacks: The Minor Components in Representations Matter
Targeting minor components in LLM representations during unlearning yields substantially better resistance to relearning attacks than prior methods.
-
Separable Expert Architecture: Toward Privacy-Preserving LLM Personalization via Composable Adapters and Deletable User Proxies
A separable expert architecture uses base models, LoRA adapters, and deletable per-user proxies to enable privacy-preserving personalization and deterministic unlearning in LLMs.
-
CAP: Controllable Alignment Prompting for Unlearning in LLMs
CAP is a reinforcement-learning-driven prompt optimization framework that suppresses target knowledge in LLMs while preserving general capabilities, enabling reversible unlearning without any parameter updates.
-
CAP: Controllable Alignment Prompting for Unlearning in LLMs
CAP optimizes prompts via reinforcement learning to selectively unlearn target knowledge in LLMs while preserving general capabilities, without any parameter updates and with reversible revocation.
-
CAP: Controllable Alignment Prompting for Unlearning in LLMs
CAP enables reversible unlearning of targeted knowledge in LLMs through optimized prompts generated via reinforcement learning, without any parameter updates.
-
Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks
Different LLM jailbreak techniques achieve similar harmful compliance but lead to distinct behavioral side effects and mechanistic changes.
-
WIN-U: Woodbury-Informed Newton-Unlearning as a retain-free Machine Unlearning Framework
WIN-U delivers a retain-free unlearning update that approximates the gold-standard retrained model via a Woodbury-informed Newton step using only forget-set curvature information.
-
Latent Instruction Representation Alignment: defending against jailbreaks, backdoors and undesired knowledge in LLMs
LIRA aligns latent instruction representations in LLMs to defend against jailbreaks, backdoors, and undesired knowledge, blocking over 99% of PEZ attacks and achieving optimal WMDP forgetting.
-
Swiss-Bench 003: Evaluating LLM Reliability and Adversarial Security for Swiss Regulatory Contexts
Swiss-Bench 003 extends an existing Swiss LLM assessment with two new dimensions and evaluates ten models on 808 items, finding high self-graded reliability scores but low adversarial security scores.
-
Efficient machine unlearning with minimax optimality
ULS provides minimax-optimal estimation of remaining-data parameters in machine unlearning with limited access and decomposes error into oracle plus unlearning cost terms.
-
Chain-of-Authorization: Embedding authorization into large language models
LLMs fine-tuned to output authorization trajectories as a prerequisite for responses achieve high rejection rates for unauthorized prompts while preserving utility in allowed scenarios.
-
SafeSci: Safety Evaluation of Large Language Models in Science Domains and Beyond
SafeSci creates a large objective benchmark and training resource that reveals safety weaknesses in current LLMs for science and demonstrates measurable improvement through targeted fine-tuning.
-
The Impact of Off-Policy Training Data on Probe Generalisation
Off-policy training data for LLM behavior probes causes significant generalization failures especially for intent-based behaviors like deception, and performance on coerced incentivised data correlates with real on-po...
-
Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in LLM Unlearning
Downgrading optimizers to lower-information variants during LLM unlearning yields more robust forgetting on MUSE and WMDP benchmarks by converging to harder-to-perturb loss basins.
-
OFMU: Optimization-Driven Framework for Machine Unlearning
A penalty-based bi-level optimization framework for machine unlearning that decorrelates forget and retention gradients via inner maximization and restores utility via outer minimization, with convergence guarantees a...
-
Benchmarking Misuse Mitigation Against Covert Adversaries
Develops the BSD data generation pipeline and two new datasets to evaluate decomposition attacks as effective misuse enablers and stateful defenses as a countermeasure in language model safety.
-
Do Linear Probes Generalize Better in Persona Coordinates?
Probes on persona principal components from contrastive prompts generalize better than raw activation probes for harmful behaviors across 10 datasets.
-
Do Linear Probes Generalize Better in Persona Coordinates?
Persona axes derived from contrastive prompts and PCA yield linear probes that generalize better than raw-activation probes across 10 datasets for deception and sycophancy.
-
Revisiting the Past: Data Unlearning with Model State History
MSA performs data unlearning in LLMs by arithmetic operations on prior model checkpoints to remove targeted datapoint influence, with experiments showing competitive or better results than existing unlearning methods.
-
Humanity's Last Exam
Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.
-
Risk Reporting for Developers' Internal AI Model Use
A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.
-
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
Gemini 2.5 Pro and Flash models are presented as achieving frontier performance in reasoning, coding, and long-context multimodal tasks while spanning a cost-capability Pareto curve.
Reference graph
Works this paper leans on
-
[1]
doi: 10 .48550/arXiv.2402.10058. Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, and Dylan Hadfield-Menell. Eight methods to evaluate robust unlearning in llms, 2024. Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C. Lipton, and J. Zico Kolter. Tofu: A task of fictitious unlearning for llms, 2024. Mantas Mazeika, Long Phan, Xuwang Yin, And...
-
[2]
Overview. How is this work intended to reduce existential risks from advanced AI systems? Answer: This work aims to mitigate existential risks posed by the malicious use of LLMs in developing bioweapons and cyber weapons. WMDP serves both as a metric for evaluating the presence of hazardous knowledge, and as a benchmark for testing unlearning methods. We ...
work page 2023
-
[3]
Direct Effects. If this work directly reduces existential risks, what are the main hazards, vulnerabilities, or failure modes that it directly affects? 29 Answer: WMDP increases the barrier of entry for malicious actors to cause catastrophic harm. It decreases access to models with hazardous biological or cyber capabilities, reducing the number of malicio...
-
[4]
Diffuse Effects. If this work reduces existential risks indirectly or diffusely, what are the main contributing factors that it affects? Answer: Unlearning on WMDP reduces the risks of language model aided cyberattacks, particularly from low-skilled malicious actors. Cyberattacks, particularly on critical infras- tructure, could be catastrophic. They are ...
work page 2024
-
[5]
What’s at Stake?What is a future scenario in which this research direction could prevent the sudden, large-scale loss of life? If not applicable, what is a future scenario in which this research direction be highly beneficial? Answer: This directly reduces x-risks associated with the malicious use of language models in developing weapons of mass destructi...
work page 2022
-
[6]
Result Fragility. Do the findings rest on strong theoretical assumptions; are they not demonstrated using leading-edge tasks or models; or are the findings highly sensitive to hyperparameters? □
-
[7]
Is it implausible that any practical system could ever markedly outper- form humans at this task? ⊠
Problem Difficulty. Is it implausible that any practical system could ever markedly outper- form humans at this task? ⊠
-
[8]
Human Unreliability. Does this approach strongly depend on handcrafted features, expert supervision, or human reliability? □
-
[9]
Competitive Pressures. Does work towards this approach strongly trade off against raw intelligence, other general capabilities, or economic utility? □ E.2 Safety-Capabilities Balance In this section, please analyze how this work relates to general capabilities and how it affects the balance between safety and hazards from general capabilities
-
[10]
Overview. How does this improve safety more than it improves general capabilities? Answer: Unlearning does not improve general capabilities; rather, it removes specific model capabilities while improving inherent model safety
-
[11]
Red Teaming. What is a way in which this hastens general capabilities or the onset of x-risks? Answer: Although WMDP is constructed as a benchmark for measuring and reducing inherent model hazards, it may inadvertently serve as a roadmap for malicious use, hastening the onset of x-risks by lowering the barrier for causing catastrophe. To reduce these risk...
-
[12]
General Tasks. Does this work advance progress on tasks that have been previously considered the subject of usual capabilities research? □
-
[13]
General Goals. Does this improve or facilitate research towards general prediction, clas- sification, state estimation, efficiency, scalability, generation, data compression, executing clear instructions, helpfulness, informativeness, reasoning, planning, researching, optimiza- tion, (self-)supervised learning, sequential decision making, recursive self-i...
-
[14]
Correlation with General Aptitude.Is the analyzed capability known to be highly predicted by general cognitive ability or educational attainment? □
-
[15]
Safety via Capabilities. Does this advance safety along with, or as a consequence of, advancing other capabilities or the study of AI? □ 30 E.3 Elaborations and Other Considerations
-
[16]
Other. What clarifications or uncertainties about this work and x-risk are worth mentioning? Answer: While unlearning is an important intervention for reducing model hazards, un- learning with may reduce the defensive, or beneficial, applications in those areas. unlearning should be complemented with other interventions that reduce risk (Appendix D). 31
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.