MUSE: Machine Unlearning Six-Way Evaluation for Language Models

Ari Holtzman; Chiyuan Zhang; Daogao Liu; Jaechan Lee; Jieyu Zhao; Luke Zettlemoyer; Noah A. Smith; Sadhika Malladi; Weijia Shi; Yangsibo Huang

arxiv: 2407.06460 · v2 · pith:NFZUJRGFnew · submitted 2024-07-08 · 💻 cs.CL · cs.AI

MUSE: Machine Unlearning Six-Way Evaluation for Language Models

Weijia Shi , Jaechan Lee , Yangsibo Huang , Sadhika Malladi , Jieyu Zhao , Ari Holtzman , Daogao Liu , Luke Zettlemoyer

show 2 more authors

Noah A. Smith Chiyuan Zhang

This is my paper

classification 💻 cs.CL cs.AI

keywords unlearningalgorithmsdatamodelsmemorizationremovalbenchmarkevaluation

0 comments

read the original abstract

Language models (LMs) are trained on vast amounts of text data, which may include private and copyrighted content. Data owners may request the removal of their data from a trained model due to privacy or copyright concerns. However, exactly unlearning only these datapoints (i.e., retraining with the data removed) is intractable in modern-day models. This has led to the development of many approximate unlearning algorithms. The evaluation of the efficacy of these algorithms has traditionally been narrow in scope, failing to precisely quantify the success and practicality of the algorithm from the perspectives of both the model deployers and the data owners. We address this issue by proposing MUSE, a comprehensive machine unlearning evaluation benchmark that enumerates six diverse desirable properties for unlearned models: (1) no verbatim memorization, (2) no knowledge memorization, (3) no privacy leakage, (4) utility preservation on data not intended for removal, (5) scalability with respect to the size of removal requests, and (6) sustainability over sequential unlearning requests. Using these criteria, we benchmark how effectively eight popular unlearning algorithms on 7B-parameter LMs can unlearn Harry Potter books and news articles. Our results demonstrate that most algorithms can prevent verbatim memorization and knowledge memorization to varying degrees, but only one algorithm does not lead to severe privacy leakage. Furthermore, existing algorithms fail to meet deployer's expectations because they often degrade general model utility and also cannot sustainably accommodate successive unlearning requests or large-scale content removal. Our findings identify key issues with the practicality of existing unlearning algorithms on language models, and we release our benchmark to facilitate further evaluations: muse-bench.github.io

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning
cs.CV 2026-04 conditional novelty 8.0

VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.
TimeROME-DLM: Temporal Causal Tracing and Low-Rank Inference-Time Knowledge Editing for Masked Diffusion Language Models
cs.LG 2026-06 unverdicted novelty 7.0

TimeROME-DLM enables training-free knowledge editing in masked diffusion language models via temporal causal tracing and low-rank residual edit memory applied at inference time.
REMEDI: A Benchmark for Retention and Unlearning Evaluation in Multi-label Clinical Disease Inference
cs.LG 2026-06 unverdicted novelty 7.0

REMEDI is a new benchmark for evaluating machine unlearning in multi-label clinical disease inference on MIMIC-III data that reveals trade-offs in existing methods.
Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation
cs.CL 2026-05 unverdicted novelty 7.0

New metrics KSS and KPS are introduced to evaluate multilingual machine unlearning quality and cross-language consistency in LLMs, addressing limitations of single-language evaluation protocols.
ICU-Bench:Benchmarking Continual Unlearning in Multimodal Large Language Models
cs.AI 2026-05 unverdicted novelty 7.0

ICU-Bench is a new continual unlearning benchmark for MLLMs using 1000 privacy profiles, 9500 images, and 100 forget tasks, showing existing methods fail to balance forgetting, utility, and scalability.
Metric Unreliability in Multimodal Machine Unlearning: A Systematic Analysis and Principled Unified Score
cs.CV 2026-05 unverdicted novelty 7.0

Standard metrics for multimodal machine unlearning conflict in rankings, addressed by a new oracle-correlated composite score that yields stable results.
Less is More: Geometric Unlearning for LLMs with Minimal Data Disclosure
cs.CL 2026-05 unverdicted novelty 7.0

Geometric Unlearning suppresses specific knowledge in LLMs by projecting hidden planning states onto a low-rank safe geometry derived from minimal reference prompts.
Revisiting Privacy Leakage in Machine Unlearning: Membership Inference Beyond the Forgotten Set
cs.CR 2026-05 unverdicted novelty 7.0

Unlearning increases privacy leakage for the retain set, and a new tri-class membership inference attack distinguishes forget, retain, and unseen data using pre- and post-unlearning model outputs.
Revisiting Privacy Leakage in Machine Unlearning: Membership Inference Beyond the Forgotten Set
cs.CR 2026-05 unverdicted novelty 7.0

TC-UMIA is a population-level attack using pre- and post-unlearning predictions to infer membership across forget, retain, and unseen sets, revealing added privacy leakage to retained data.
Is your algorithm unlearning or untraining?
cs.LG 2026-04 conditional novelty 7.0

Machine unlearning conflates reversing the influence of specific training examples (untraining) with removing the full underlying distribution or behavior (unlearning).
Heuresis: Search Strategies for Autonomous AI Research Agents Across Quality, Diversity and Novelty
cs.AI 2026-06 unverdicted novelty 6.0

Heuresis evaluates six search strategies for autonomous ML research agents and finds that novel ideas are rare, none rated original, and only one reaches top-10 quality while strategies steer axes but do not expand th...
Heuresis: Search Strategies for Autonomous AI Research Agents Across Quality, Diversity and Novelty
cs.AI 2026-06 accept novelty 6.0

Heuresis evaluates six search strategies for LLM research agents and shows they steer ideas along quality-diversity-novelty axes but fail to generate novel ideas that match or exceed known high-performing recipes.
RepSelect: Robust LLM Unlearning via Representation Selectivity
cs.CL 2026-06 unverdicted novelty 6.0

RepSelect isolates forget-set-specific representations via gradient PCA collapse to achieve 4-50x better post-relearning robustness than baselines across multiple models and forget categories.
Validity Threats for Foundation Model Research
cs.LG 2026-06 accept novelty 6.0

Maps common low-compute research strategies for foundation models onto statistical, internal, external, and construct validity threats via a causal-inference lens.
Fast Unlearning at Scale via Margin Self-Correction
cs.LG 2026-06 unverdicted novelty 6.0

MASC achieves competitive forget-retain trade-offs in language model unlearning at lower computational cost via margin self-correction and an online stopping criterion on TOFU, MUSE News, and MUSE Books.
De-attribute to Forget for LLM Unlearning
cs.LG 2026-05 unverdicted novelty 6.0

DareU reframes LLM unlearning as zeroing data attribution via RL rewards from an LLM classifier approximation, claiming better balance of forget quality and model utility than loss-based baselines.
Robust LLM Unlearning Against Relearning Attacks: The Minor Components in Representations Matter
cs.CL 2026-05 unverdicted novelty 6.0

Targeting minor components in LLM representations during unlearning yields substantially better resistance to relearning attacks than prior methods.
Less is More: Geometric Unlearning for LLMs with Minimal Data Disclosure
cs.CL 2026-05 unverdicted novelty 6.0

Geometric Unlearning distills a low-rank safe subspace from reference prompts and applies projection-based alignment on synthetic anchors to suppress target content while preserving non-target utility.
WIN-U: Woodbury-Informed Newton-Unlearning as a retain-free Machine Unlearning Framework
cs.LG 2026-04 unverdicted novelty 6.0

WIN-U delivers a retain-free unlearning update that approximates the gold-standard retrained model via a Woodbury-informed Newton step using only forget-set curvature information.
Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in LLM Unlearning
cs.LG 2025-10 conditional novelty 6.0

Downgrading optimizers to lower-information variants during LLM unlearning yields more robust forgetting on MUSE and WMDP benchmarks by converging to harder-to-perturb loss basins.
MAAT: Multi-phase Adapter-Aware Targeted Unlearning
cs.LG 2026-05 unverdicted novelty 5.0

Introduces 5WBENCH balanced benchmark across 5W categories and MAAT three-phase adapter unlearning method that targets causal Why-type knowledge.
Calibration vs Decision Making: Revisiting the Reliability Paradox in Unlearned Language Models
cs.CL 2026-05 unverdicted novelty 5.0

Unlearned language models retain low calibration error but show increased shortcut reliance on the TOFU benchmark, extending the reliability paradox to machine unlearning.
Metric Unreliability in Multimodal Machine Unlearning: A Systematic Analysis and Principled Unified Score
cs.CV 2026-05 unverdicted novelty 5.0

Standard unlearning metrics disagree in multimodal settings, but a correlation-weighted Unified Quality Score delivers consistent method rankings across benchmarks.
Revisiting the Past: Data Unlearning with Model State History
cs.LG 2025-06 unverdicted novelty 5.0

MSA performs data unlearning in LLMs by arithmetic operations on prior model checkpoints to remove targeted datapoint influence, with experiments showing competitive or better results than existing unlearning methods.
PreUnlearn: Auditing Collateral Knowledge Damage Before Large Language Model Unlearning
cs.CL 2026-06 unverdicted novelty 4.0

Collateral damage from LLM unlearning decays with semantic distance but persists across domains and can be predicted pre-unlearning from forget-evaluation set interaction features.