The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Adam A. Hunt; Adam Khoja; Alexander Pan; Alexandr Wang; Alex Levinson; Alice Gatti; Andrew B. Liu; Andy Zou; Anjali Gopal; Ann-Kathrin Dombrowski

arxiv: 2403.03218 · v7 · pith:GBIXPOE6 · submitted 2024-03-05 · cs.LG · cs.AI· cs.CL· cs.CY

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Nathaniel Li , Alexander Pan , Anjali Gopal , Summer Yue , Daniel Berrios , Alice Gatti , Justin D. Li , Ann-Kathrin Dombrowski

show 49 more authors

Shashwat Goel Long Phan Gabriel Mukobi Nathan Helm-Burger Rassin Lababidi Lennart Justen Andrew B. Liu Michael Chen Isabelle Barrass Oliver Zhang Xiaoyuan Zhu Rishub Tamirisa Bhrugu Bharathi Adam Khoja Zhenqi Zhao Ariel Herbert-Voss Cort B. Breuer Samuel Marks Oam Patel Andy Zou Mantas Mazeika Zifan Wang Palash Oswal Weiran Lin Adam A. Hunt Justin Tienken-Harder Kevin Y. Shih Kemper Talley John Guan Russell Kaplan Ian Steneker David Campbell Brad Jokubaitis Alex Levinson Jean Wang William Qian Kallol Krishna Karmakar Steven Basart Stephen Fitz Mindy Levine Ponnurangam Kumaraguru Uday Tupakula Vijay Varadharajan Ruoyu Wang Yan Shoshitaishvili Jimmy Ba Kevin M. Esvelt Alexandr Wang Dan Hendrycks

This is my paper

Reviewed by Pith T0 review T1 audit T2 compute T3 formal T4 kernel 2026-05-15 19:51 UTCgrok-4.3pith:GBIXPOE6 record.json open to challenge →

classification cs.LG cs.AIcs.CLcs.CY

keywords WMDP benchmarkunlearninghazardous knowledgeLLM safetybiosecuritycybersecuritychemical securitymalicious use

0 comments

The pith

The WMDP benchmark publicly measures hazardous knowledge in LLMs, and the RMU unlearning method reduces performance on it while preserving general capabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper releases the Weapons of Mass Destruction Proxy benchmark, a public set of 3,668 multiple-choice questions on biosecurity, cybersecurity, and chemical security topics. The questions act as a proxy to evaluate how much dangerous knowledge large language models contain. The authors introduce RMU, a method that adjusts internal model representations to unlearn this knowledge. Tests show RMU lowers scores on WMDP questions yet leaves performance intact on standard biology and computer science benchmarks. The results point to unlearning as one workable route to limit AI risks of enabling weapons development.

Core claim

The central claim is that the publicly released WMDP benchmark provides an open proxy for measuring hazardous knowledge in biosecurity, cybersecurity, and chemical security domains, and that the RMU unlearning technique, which operates by controlling model representations, can selectively reduce model performance on these questions while leaving general capabilities in biology and computer science unchanged.

What carries the argument

RMU, a representation-control unlearning method that targets and suppresses specific hazardous knowledge in model activations.

Load-bearing premise

Performance on WMDP questions reliably indicates real-world ability to assist in developing weapons of mass destruction.

What would settle it

A model that scores near random on WMDP but still supplies accurate step-by-step guidance for producing a biological weapon would show the benchmark fails to capture actual hazardous capability.

read the original abstract

The White House Executive Order on Artificial Intelligence highlights the risks of large language models (LLMs) empowering malicious actors in developing biological, cyber, and chemical weapons. To measure these risks of malicious use, government institutions and major AI labs are developing evaluations for hazardous capabilities in LLMs. However, current evaluations are private, preventing further research into mitigating risk. Furthermore, they focus on only a few, highly specific pathways for malicious use. To fill these gaps, we publicly release the Weapons of Mass Destruction Proxy (WMDP) benchmark, a dataset of 3,668 multiple-choice questions that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security. WMDP was developed by a consortium of academics and technical consultants, and was stringently filtered to eliminate sensitive information prior to public release. WMDP serves two roles: first, as an evaluation for hazardous knowledge in LLMs, and second, as a benchmark for unlearning methods to remove such hazardous knowledge. To guide progress on unlearning, we develop RMU, a state-of-the-art unlearning method based on controlling model representations. RMU reduces model performance on WMDP while maintaining general capabilities in areas such as biology and computer science, suggesting that unlearning may be a concrete path towards reducing malicious use from LLMs. We release our benchmark and code publicly at https://wmdp.ai

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WMDP gives the field a public benchmark for hazardous knowledge, but the MCQ format and filtering make it unclear how well it tracks actual malicious-use risk.

read the letter

The paper's main contribution is releasing WMDP, a set of 3,668 filtered multiple-choice questions on biosecurity, cybersecurity, and chemical security, plus the RMU unlearning method that tries to suppress performance on those questions. Making this public is the useful part, since prior evaluations stayed private and blocked follow-on work. The authors also show RMU can lower WMDP scores while keeping general biology and computer science performance roughly intact, which at least gives people a concrete starting point for testing unlearning ideas. That combination of benchmark and method is new enough to matter for anyone working on capability measurement or mitigation. The soft spot is the proxy question itself. The questions went through heavy filtering to remove sensitive details, and the format is multiple choice rather than open-ended synthesis or protocol design. If a model can still reason about the underlying material but just fails the test items, then lower WMDP numbers do not necessarily mean lower real-world risk. The abstract does not give the quantitative tables or error bars, so the strength of the preservation claim is hard to judge without the full results. For readers focused on AI safety benchmarks or unlearning techniques, the paper is worth reading and citing for the dataset alone. It is coherent on its own terms and engages the right literature, so it deserves a serious referee rather than a desk reject. I would bring it to a reading group to discuss the proxy validity issue.

Referee Report

2 major / 1 minor

Summary. The paper introduces the publicly released WMDP benchmark, a dataset of 3,668 multiple-choice questions spanning biosecurity, cybersecurity, and chemical security, developed as a proxy for hazardous knowledge in LLMs. It proposes RMU, a representation-control unlearning method, and reports that RMU lowers WMDP accuracy while preserving general capabilities in biology and computer science, with the benchmark and code released at https://wmdp.ai.

Significance. If the empirical claims hold, the work is significant because it supplies the first open benchmark for hazardous capabilities, directly addressing the limitation that existing evaluations are private. The public release of both the dataset and RMU code enables reproducible research on unlearning, and the suggestion that targeted representation control can reduce proxy risk without broad capability loss provides a concrete, testable direction for AI safety.

major comments (2)

[§3] §3 (Benchmark Construction and Filtering): The stringently filtered MCQ items remove sensitive details by design, yet the central claim that WMDP serves as a proxy for real-world malicious use rests on the untested assumption that reduced MCQ accuracy implies reduced ability to synthesize or apply hazardous knowledge in open-ended settings. No ablation or external validation (e.g., expert red-teaming on retained items) is provided to show that the retained questions remain sufficient for misuse.
[§4] §4 (RMU Experiments): The abstract and results claim that RMU reduces WMDP performance while maintaining general biology/CS capabilities, but no quantitative tables, error bars, baseline comparisons (e.g., gradient ascent, fine-tuning), or statistical tests are referenced in the provided summary; without these, the preservation claim cannot be evaluated for robustness or effect size.

minor comments (1)

Add explicit section numbers and equation references throughout for easier navigation; several cross-references in the methods appear to rely on implicit numbering.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. We address each major comment point by point below, indicating where revisions will be incorporated.

read point-by-point responses

Referee: [§3] §3 (Benchmark Construction and Filtering): The stringently filtered MCQ items remove sensitive details by design, yet the central claim that WMDP serves as a proxy for real-world malicious use rests on the untested assumption that reduced MCQ accuracy implies reduced ability to synthesize or apply hazardous knowledge in open-ended settings. No ablation or external validation (e.g., expert red-teaming on retained items) is provided to show that the retained questions remain sufficient for misuse.

Authors: We agree that WMDP functions as a proxy and that direct validation linking MCQ accuracy to open-ended synthesis capabilities would strengthen the work. However, performing such validations (e.g., expert red-teaming on retained items) would require testing actual hazardous knowledge application, which raises insurmountable ethical and safety barriers. The questions were developed and filtered in consultation with domain experts precisely to retain proxy relevance while eliminating actionable details. The public release at wmdp.ai is intended to enable safe, community-driven follow-up studies. We will add an expanded limitations subsection discussing the proxy assumption and outlining directions for future validation. revision: partial
Referee: [§4] §4 (RMU Experiments): The abstract and results claim that RMU reduces WMDP performance while maintaining general biology/CS capabilities, but no quantitative tables, error bars, baseline comparisons (e.g., gradient ascent, fine-tuning), or statistical tests are referenced in the provided summary; without these, the preservation claim cannot be evaluated for robustness or effect size.

Authors: The full manuscript (Section 4 and associated tables) already contains the requested details: accuracy tables showing WMDP drops (with standard deviations across 3–5 seeds), direct comparisons to gradient ascent and fine-tuning baselines, MMLU biology/CS subset scores demonstrating capability preservation, and statistical tests (paired t-tests) confirming significance. These results are referenced in the results narrative. We will revise the abstract and introduction summary to explicitly cite the relevant tables and effect sizes so the quantitative support is immediately visible. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical benchmark release and experimental results

full rationale

The paper releases a public MCQ benchmark (WMDP) and evaluates an unlearning method (RMU) via direct performance measurements on held-out test sets and general capability benchmarks. No mathematical derivation chain exists; results are reported from experiments rather than fitted parameters or self-referential definitions. Central claims rest on observed accuracy drops on WMDP items and maintained scores on MMLU-style biology/CS questions, which are independently verifiable and not reduced to the inputs by construction. Self-citations, if present, are not load-bearing for the empirical findings.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on the domain assumption that WMDP accurately proxies real hazardous capabilities after filtering and that representation control in RMU produces durable unlearning without side effects.

axioms (1)

domain assumption WMDP questions serve as a valid proxy for hazardous knowledge without leaking sensitive information after stringent filtering.
Stated in abstract as developed by consortium and stringently filtered prior to public release.

pith-pipeline@v0.9.0 · 5807 in / 1195 out tokens · 34589 ms · 2026-05-15T19:51:44.059609+00:00 · methodology

discussion (0)

Forward citations

Cited by 47 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RefusalBench: Why Refusal Rate Misranks Frontier LLMs on Biological Research Prompts
cs.SE 2026-05 conditional novelty 8.0

RefusalBench shows strict refusal rates fail to rank frontier LLMs correctly on biological safety, with provider effects and partial-compliance patterns that binary metrics miss.
Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models
cs.CR 2026-05 conditional novelty 8.0

Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.
Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning
cs.LG 2024-04 conditional novelty 8.0

NPO enables stable unlearning of 50%+ training data in LLMs on TOFU by making collapse exponentially slower than gradient ascent, preserving sensible outputs where prior methods fail.
TimeROME-DLM: Temporal Causal Tracing and Low-Rank Inference-Time Knowledge Editing for Masked Diffusion Language Models
cs.LG 2026-06 unverdicted novelty 7.0

TimeROME-DLM enables training-free knowledge editing in masked diffusion language models via temporal causal tracing and low-rank residual edit memory applied at inference time.
TRACER: Token ReAssignment for Concept ERasure in Generative Recommendation
cs.IR 2026-06 unverdicted novelty 7.0

TRACER uses token reassignment for concept-related items plus a coherence regularizer to unlearn specific concepts in generative recommendation while preserving utility better than baselines.
Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety
cs.CL 2026-05 unverdicted novelty 7.0

Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.
Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety
cs.CL 2026-05 unverdicted novelty 7.0

Boiling the Frog is a new stateful multi-turn benchmark for agentic safety that reports an aggregate strict attack success rate of 44.4% across nine models, with rates ranging from 20.5% to 92.9% depending on the mode...
Inducing Artificial Uncertainty in Language Models
cs.CL 2026-05 unverdicted novelty 7.0

Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.
Erase Persona, Forget Lore: Benchmarking Multimodal Copyright Unlearning in Large Vision Language Models
cs.CV 2026-05 unverdicted novelty 7.0

CoVUBench is the first benchmark framework for evaluating multimodal copyright unlearning in LVLMs via synthetic data, systematic variations, and a dual protocol for forgetting efficacy and utility preservation.
Jailbroken Frontier Models Retain Their Capabilities
cs.LG 2026-04 unverdicted novelty 7.0

Jailbreak-induced performance loss shrinks as model capability grows, with the strongest models showing almost no degradation on benchmarks.
Is your algorithm unlearning or untraining?
cs.LG 2026-04 conditional novelty 7.0

Machine unlearning conflates reversing the influence of specific training examples (untraining) with removing the full underlying distribution or behavior (unlearning).
Towards Knowledge Alignment in Code LLMs: Contrastive Unlearning for Evolving APIs
cs.SE 2026-06 unverdicted novelty 6.0

CURE applies contrastive unlearning to reduce deprecated API usage in code LLMs and improve correct replacements on a benchmark dataset while preserving general performance.
RepSelect: Robust LLM Unlearning via Representation Selectivity
cs.CL 2026-06 unverdicted novelty 6.0

RepSelect isolates forget-set-specific representations via gradient PCA collapse to achieve 4-50x better post-relearning robustness than baselines across multiple models and forget categories.
"Did you lie?" Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms
cs.AI 2026-06 unverdicted novelty 6.0

Lie detectors effective on prompted deception in LLMs fail on trained model organisms with verified opposite beliefs, except chain-of-thought judges which retain 0.82 balanced accuracy partly due to verification artifacts.
CogManip: Benchmarking Manipulative Behavior in Multi-Turn Interactions with Large Language Model
cs.AI 2026-06 unverdicted novelty 6.0

CogManip is a benchmark that tests 13 LLMs on 15 manipulation risks in 1,000 multi-turn dialogues, finding heterogeneous risks and prompt sensitivity in models like DeepSeek-V3.2.
Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs
cs.AI 2026-05 conditional novelty 6.0

Introduces MOOD benchmark for OOD LLM alignment failures and shows guard models plus Mahalanobis and perplexity OOD detectors improve recall from 39% to 45% with positive scaling.
Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs
cs.AI 2026-05 unverdicted novelty 6.0

MOOD benchmark shows guard models fail to generalize to OOD alignment failures in LLMs, but combining them with Mahalanobis and perplexity OOD detectors improves recall from 39% to 45% with better scaling than larger ...
State Contamination in Memory-Augmented LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

Toxic context can be laundered into memory summaries that stay below toxicity thresholds while still driving higher downstream toxicity in LLM agents compared to neutral baselines.
Robust LLM Unlearning Against Relearning Attacks: The Minor Components in Representations Matter
cs.CL 2026-05 unverdicted novelty 6.0

Targeting minor components in LLM representations during unlearning yields substantially better resistance to relearning attacks than prior methods.
Separable Expert Architecture: Toward Privacy-Preserving LLM Personalization via Composable Adapters and Deletable User Proxies
cs.AI 2026-04 unverdicted novelty 6.0

A separable expert architecture uses base models, LoRA adapters, and deletable per-user proxies to enable privacy-preserving personalization and deterministic unlearning in LLMs.
CAP: Controllable Alignment Prompting for Unlearning in LLMs
cs.LG 2026-04 unverdicted novelty 6.0

CAP enables reversible unlearning of targeted knowledge in LLMs through optimized prompts generated via reinforcement learning, without any parameter updates.
CAP: Controllable Alignment Prompting for Unlearning in LLMs
cs.LG 2026-04 unverdicted novelty 6.0

CAP is a reinforcement-learning-driven prompt optimization framework that suppresses target knowledge in LLMs while preserving general capabilities, enabling reversible unlearning without any parameter updates.
CAP: Controllable Alignment Prompting for Unlearning in LLMs
cs.LG 2026-04 unverdicted novelty 6.0

CAP optimizes prompts via reinforcement learning to selectively unlearn target knowledge in LLMs while preserving general capabilities, without any parameter updates and with reversible revocation.
Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks
cs.CR 2026-04 unverdicted novelty 6.0

Different LLM jailbreak techniques achieve similar harmful compliance but lead to distinct behavioral side effects and mechanistic changes.
WIN-U: Woodbury-Informed Newton-Unlearning as a retain-free Machine Unlearning Framework
cs.LG 2026-04 unverdicted novelty 6.0

WIN-U delivers a retain-free unlearning update that approximates the gold-standard retrained model via a Woodbury-informed Newton step using only forget-set curvature information.
Latent Instruction Representation Alignment: defending against jailbreaks, backdoors and undesired knowledge in LLMs
cs.LG 2026-04 unverdicted novelty 6.0

LIRA aligns latent instruction representations in LLMs to defend against jailbreaks, backdoors, and undesired knowledge, blocking over 99% of PEZ attacks and achieving optimal WMDP forgetting.
Swiss-Bench 003: Evaluating LLM Reliability and Adversarial Security for Swiss Regulatory Contexts
cs.CR 2026-04 unverdicted novelty 6.0

Swiss-Bench 003 extends an existing Swiss LLM assessment with two new dimensions and evaluates ten models on 808 items, finding high self-graded reliability scores but low adversarial security scores.
Efficient machine unlearning with minimax optimality
stat.ML 2026-04 unverdicted novelty 6.0

ULS provides minimax-optimal estimation of remaining-data parameters in machine unlearning with limited access and decomposes error into oracle plus unlearning cost terms.
Chain-of-Authorization: Embedding authorization into large language models
cs.AI 2026-03 unverdicted novelty 6.0

LLMs fine-tuned to output authorization trajectories as a prerequisite for responses achieve high rejection rates for unauthorized prompts while preserving utility in allowed scenarios.
SafeSci: Safety Evaluation of Large Language Models in Science Domains and Beyond
cs.LG 2026-03 conditional novelty 6.0

SafeSci creates a large objective benchmark and training resource that reveals safety weaknesses in current LLMs for science and demonstrates measurable improvement through targeted fine-tuning.
The Impact of Off-Policy Training Data on Probe Generalisation
cs.AI 2025-11 unverdicted novelty 6.0

Off-policy training data for LLM behavior probes causes significant generalization failures especially for intent-based behaviors like deception, and performance on coerced incentivised data correlates with real on-po...
Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in LLM Unlearning
cs.LG 2025-10 conditional novelty 6.0

Downgrading optimizers to lower-information variants during LLM unlearning yields more robust forgetting on MUSE and WMDP benchmarks by converging to harder-to-perturb loss basins.
OFMU: Optimization-Driven Framework for Machine Unlearning
cs.LG 2025-09 unverdicted novelty 6.0

A penalty-based bi-level optimization framework for machine unlearning that decorrelates forget and retention gradients via inner maximization and restores utility via outer minimization, with convergence guarantees a...
Benchmarking Misuse Mitigation Against Covert Adversaries
cs.CR 2025-06 unverdicted novelty 6.0

Develops the BSD data generation pipeline and two new datasets to evaluate decomposition attacks as effective misuse enablers and stateful defenses as a countermeasure in language model safety.
Mechanism-Guided Selective Unlearning for RLVR-Induced Reasoning
cs.LG 2026-06 unverdicted novelty 5.0

MAST ranks attention-projection tensors by off-principal energy, update magnitude, and forget-gradient coupling to selectively unlearn RLVR-induced reasoning, achieving significant forgetting on MATH while preserving ...
Safe-RULE: Safe Reinforcement UnLEarning
cs.LG 2026-06 unverdicted novelty 5.0

Safe-RULE introduces a reinforcement unlearning defense for offline safe RL that counters data poisoning by removing malicious data influence while preserving task performance and safety.
Palette: A Modular, Controllable, and Efficient Framework for On-demand Authorized Safety Alignment Relaxation in LLMs
cs.AI 2026-05 unverdicted novelty 5.0

Palette identifies refusal directions via multi-objective search, internalizes them through lightweight adaptation, and supports on-demand multi-domain authorization via independent learning and parameter merging.
Calibration vs Decision Making: Revisiting the Reliability Paradox in Unlearned Language Models
cs.CL 2026-05 unverdicted novelty 5.0

Unlearned language models retain low calibration error but show increased shortcut reliance on the TOFU benchmark, extending the reliability paradox to machine unlearning.
Do Linear Probes Generalize Better in Persona Coordinates?
cs.AI 2026-05 unverdicted novelty 5.0

Persona axes derived from contrastive prompts and PCA yield linear probes that generalize better than raw-activation probes across 10 datasets for deception and sycophancy.
Do Linear Probes Generalize Better in Persona Coordinates?
cs.AI 2026-05 unverdicted novelty 5.0

Probes on persona principal components from contrastive prompts generalize better than raw activation probes for harmful behaviors across 10 datasets.
Revisiting the Past: Data Unlearning with Model State History
cs.LG 2025-06 unverdicted novelty 5.0

MSA performs data unlearning in LLMs by arithmetic operations on prior model checkpoints to remove targeted datapoint influence, with experiments showing competitive or better results than existing unlearning methods.
Humanity's Last Exam
cs.LG 2025-01 unverdicted novelty 5.0

Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.
Risk Reporting for Developers' Internal AI Model Use
cs.CY 2026-04 unverdicted novelty 4.0

A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
cs.CL 2025-07 unverdicted novelty 4.0

Gemini 2.5 Pro and Flash models are presented as achieving frontier performance in reasoning, coding, and long-context multimodal tasks while spanning a cost-capability Pareto curve.
Prioritizing High-Consequence Biological Capabilities in Evaluations of Artificial Intelligence Models
cs.CY 2024-05 unverdicted novelty 4.0

AI model evaluations for biological capabilities should prioritize high-consequence risks like pandemics, informed by life sciences dual-use experience, and occur prior to deployment to enable biosafety measures.
The Agentic Web Requires New Normative Infrastructure
cs.CY 2026-06 unverdicted novelty 3.0

The agentic web requires new normative infrastructure of laws, norms, and practices to allow user-delegated AI agents to access online properties without being blocked as malicious bots.
Muse Spark Safety & Preparedness Report
cs.CY 2026-05 unverdicted novelty 2.0

Meta's safety report states that Muse Spark meets acceptable risk thresholds for release after mitigations reduced elevated pre-mitigation risks in chemical and biological domains.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · cited by 42 Pith papers

[1]

and Lucas, Caleb and Guest, Ella , institution =

doi: 10 .48550/arXiv.2402.10058. Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, and Dylan Hadfield-Menell. Eight methods to evaluate robust unlearning in llms, 2024. Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C. Lipton, and J. Zico Kolter. Tofu: A task of fictitious unlearning for llms, 2024. Mantas Mazeika, Long Phan, Xuwang Yin, And...

work page doi:10.7249/rra2977-2 2024
[2]

Overview. How is this work intended to reduce existential risks from advanced AI systems? Answer: This work aims to mitigate existential risks posed by the malicious use of LLMs in developing bioweapons and cyber weapons. WMDP serves both as a metric for evaluating the presence of hazardous knowledge, and as a benchmark for testing unlearning methods. We ...

work page 2023
[3]

Direct Effects. If this work directly reduces existential risks, what are the main hazards, vulnerabilities, or failure modes that it directly affects? 29 Answer: WMDP increases the barrier of entry for malicious actors to cause catastrophic harm. It decreases access to models with hazardous biological or cyber capabilities, reducing the number of malicio...

work page
[4]

Diffuse Effects. If this work reduces existential risks indirectly or diffusely, what are the main contributing factors that it affects? Answer: Unlearning on WMDP reduces the risks of language model aided cyberattacks, particularly from low-skilled malicious actors. Cyberattacks, particularly on critical infras- tructure, could be catastrophic. They are ...

work page 2024
[5]

What’s at Stake?What is a future scenario in which this research direction could prevent the sudden, large-scale loss of life? If not applicable, what is a future scenario in which this research direction be highly beneficial? Answer: This directly reduces x-risks associated with the malicious use of language models in developing weapons of mass destructi...

work page 2022
[6]

Do the findings rest on strong theoretical assumptions; are they not demonstrated using leading-edge tasks or models; or are the findings highly sensitive to hyperparameters? □

Result Fragility. Do the findings rest on strong theoretical assumptions; are they not demonstrated using leading-edge tasks or models; or are the findings highly sensitive to hyperparameters? □

work page
[7]

Is it implausible that any practical system could ever markedly outper- form humans at this task? ⊠

Problem Difficulty. Is it implausible that any practical system could ever markedly outper- form humans at this task? ⊠

work page
[8]

Does this approach strongly depend on handcrafted features, expert supervision, or human reliability? □

Human Unreliability. Does this approach strongly depend on handcrafted features, expert supervision, or human reliability? □

work page
[9]

Competitive Pressures. Does work towards this approach strongly trade off against raw intelligence, other general capabilities, or economic utility? □ E.2 Safety-Capabilities Balance In this section, please analyze how this work relates to general capabilities and how it affects the balance between safety and hazards from general capabilities

work page
[10]

Overview. How does this improve safety more than it improves general capabilities? Answer: Unlearning does not improve general capabilities; rather, it removes specific model capabilities while improving inherent model safety

work page
[11]

Red Teaming. What is a way in which this hastens general capabilities or the onset of x-risks? Answer: Although WMDP is constructed as a benchmark for measuring and reducing inherent model hazards, it may inadvertently serve as a roadmap for malicious use, hastening the onset of x-risks by lowering the barrier for causing catastrophe. To reduce these risk...

work page
[12]

Does this work advance progress on tasks that have been previously considered the subject of usual capabilities research? □

General Tasks. Does this work advance progress on tasks that have been previously considered the subject of usual capabilities research? □

work page
[13]

General Goals. Does this improve or facilitate research towards general prediction, clas- sification, state estimation, efficiency, scalability, generation, data compression, executing clear instructions, helpfulness, informativeness, reasoning, planning, researching, optimiza- tion, (self-)supervised learning, sequential decision making, recursive self-i...

work page
[14]

Correlation with General Aptitude.Is the analyzed capability known to be highly predicted by general cognitive ability or educational attainment? □

work page
[15]

Does this advance safety along with, or as a consequence of, advancing other capabilities or the study of AI? □ 30 E.3 Elaborations and Other Considerations

Safety via Capabilities. Does this advance safety along with, or as a consequence of, advancing other capabilities or the study of AI? □ 30 E.3 Elaborations and Other Considerations

work page
[16]

Other. What clarifications or uncertainties about this work and x-risk are worth mentioning? Answer: While unlearning is an important intervention for reducing model hazards, un- learning with may reduce the defensive, or beneficial, applications in those areas. unlearning should be complemented with other interventions that reduce risk (Appendix D). 31

work page

[1] [1]

and Lucas, Caleb and Guest, Ella , institution =

doi: 10 .48550/arXiv.2402.10058. Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, and Dylan Hadfield-Menell. Eight methods to evaluate robust unlearning in llms, 2024. Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C. Lipton, and J. Zico Kolter. Tofu: A task of fictitious unlearning for llms, 2024. Mantas Mazeika, Long Phan, Xuwang Yin, And...

work page doi:10.7249/rra2977-2 2024

[2] [2]

Overview. How is this work intended to reduce existential risks from advanced AI systems? Answer: This work aims to mitigate existential risks posed by the malicious use of LLMs in developing bioweapons and cyber weapons. WMDP serves both as a metric for evaluating the presence of hazardous knowledge, and as a benchmark for testing unlearning methods. We ...

work page 2023

[3] [3]

Direct Effects. If this work directly reduces existential risks, what are the main hazards, vulnerabilities, or failure modes that it directly affects? 29 Answer: WMDP increases the barrier of entry for malicious actors to cause catastrophic harm. It decreases access to models with hazardous biological or cyber capabilities, reducing the number of malicio...

work page

[4] [4]

Diffuse Effects. If this work reduces existential risks indirectly or diffusely, what are the main contributing factors that it affects? Answer: Unlearning on WMDP reduces the risks of language model aided cyberattacks, particularly from low-skilled malicious actors. Cyberattacks, particularly on critical infras- tructure, could be catastrophic. They are ...

work page 2024

[5] [5]

What’s at Stake?What is a future scenario in which this research direction could prevent the sudden, large-scale loss of life? If not applicable, what is a future scenario in which this research direction be highly beneficial? Answer: This directly reduces x-risks associated with the malicious use of language models in developing weapons of mass destructi...

work page 2022

[6] [6]

Do the findings rest on strong theoretical assumptions; are they not demonstrated using leading-edge tasks or models; or are the findings highly sensitive to hyperparameters? □

Result Fragility. Do the findings rest on strong theoretical assumptions; are they not demonstrated using leading-edge tasks or models; or are the findings highly sensitive to hyperparameters? □

work page

[7] [7]

Is it implausible that any practical system could ever markedly outper- form humans at this task? ⊠

Problem Difficulty. Is it implausible that any practical system could ever markedly outper- form humans at this task? ⊠

work page

[8] [8]

Does this approach strongly depend on handcrafted features, expert supervision, or human reliability? □

Human Unreliability. Does this approach strongly depend on handcrafted features, expert supervision, or human reliability? □

work page

[9] [9]

Competitive Pressures. Does work towards this approach strongly trade off against raw intelligence, other general capabilities, or economic utility? □ E.2 Safety-Capabilities Balance In this section, please analyze how this work relates to general capabilities and how it affects the balance between safety and hazards from general capabilities

work page

[10] [10]

Overview. How does this improve safety more than it improves general capabilities? Answer: Unlearning does not improve general capabilities; rather, it removes specific model capabilities while improving inherent model safety

work page

[11] [11]

Red Teaming. What is a way in which this hastens general capabilities or the onset of x-risks? Answer: Although WMDP is constructed as a benchmark for measuring and reducing inherent model hazards, it may inadvertently serve as a roadmap for malicious use, hastening the onset of x-risks by lowering the barrier for causing catastrophe. To reduce these risk...

work page

[12] [12]

Does this work advance progress on tasks that have been previously considered the subject of usual capabilities research? □

General Tasks. Does this work advance progress on tasks that have been previously considered the subject of usual capabilities research? □

work page

[13] [13]

General Goals. Does this improve or facilitate research towards general prediction, clas- sification, state estimation, efficiency, scalability, generation, data compression, executing clear instructions, helpfulness, informativeness, reasoning, planning, researching, optimiza- tion, (self-)supervised learning, sequential decision making, recursive self-i...

work page

[14] [14]

Correlation with General Aptitude.Is the analyzed capability known to be highly predicted by general cognitive ability or educational attainment? □

work page

[15] [15]

Does this advance safety along with, or as a consequence of, advancing other capabilities or the study of AI? □ 30 E.3 Elaborations and Other Considerations

Safety via Capabilities. Does this advance safety along with, or as a consequence of, advancing other capabilities or the study of AI? □ 30 E.3 Elaborations and Other Considerations

work page

[16] [16]

Other. What clarifications or uncertainties about this work and x-risk are worth mentioning? Answer: While unlearning is an important intervention for reducing model hazards, un- learning with may reduce the defensive, or beneficial, applications in those areas. unlearning should be complemented with other interventions that reduce risk (Appendix D). 31

work page