Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.
hub
TOFU: A Task of Fictitious Unlearning for LLMs
31 Pith papers cite this work. Polarity classification is still indexing.
abstract
Large language models trained on massive corpora of data from the web can memorize and reproduce sensitive or private data raising both legal and ethical concerns. Unlearning, or tuning models to forget information present in their training data, provides us with a way to protect private data after training. Although several methods exist for such unlearning, it is unclear to what extent they result in models equivalent to those where the data to be forgotten was never learned in the first place. To address this challenge, we present TOFU, a Task of Fictitious Unlearning, as a benchmark aimed at helping deepen our understanding of unlearning. We offer a dataset of 200 diverse synthetic author profiles, each consisting of 20 question-answer pairs, and a subset of these profiles called the forget set that serves as the target for unlearning. We compile a suite of metrics that work together to provide a holistic picture of unlearning efficacy. Finally, we provide a set of baseline results from existing unlearning algorithms. Importantly, none of the baselines we consider show effective unlearning motivating continued efforts to develop approaches for unlearning that effectively tune models so that they truly behave as if they were never trained on the forget data at all.
hub tools
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.
NPO enables stable unlearning of 50%+ training data in LLMs on TOFU by making collapse exponentially slower than gradient ascent, preserving sensible outputs where prior methods fail.
MDU minimizes forward KL divergence from prompt-conditional to prompt-masked unconditional predictions at masked positions to unlearn knowledge in MDLMs while trading off privacy and utility via temperature scaling.
New metrics KSS and KPS are introduced to evaluate multilingual machine unlearning quality and cross-language consistency in LLMs, addressing limitations of single-language evaluation protocols.
PPU-Bench is a real-world benchmark exposing forget-retain trade-offs in MLLM unlearning and motivating Boundary-Aware Optimization to enforce intra-subject factual boundaries.
ICU-Bench is a new continual unlearning benchmark for MLLMs using 1000 privacy profiles, 9500 images, and 100 forget tasks, showing existing methods fail to balance forgetting, utility, and scalability.
CoVUBench is the first benchmark framework for evaluating multimodal copyright unlearning in LVLMs via synthetic data, systematic variations, and a dual protocol for forgetting efficacy and utility preservation.
Machine unlearning conflates reversing the influence of specific training examples (untraining) with removing the full underlying distribution or behavior (unlearning).
The paper introduces an AI-FOPT standard that presumes copyright infringement taint in models derived from an infringing foundational model unless developers prove independent lawful sourcing.
FML-Bench shows that a simple greedy hill-climber performs nearly as well as complex tree-search agents on ML research tasks, with an adaptive strategy that switches exploration modes outperforming all tested agents.
Swapping the reasoning trace prefill on unlearned weights can replicate or reverse the parser-split bypass gap, showing that the gap alone does not identify or rule out weight-level memorization.
Toxic context can be laundered into memory summaries that stay below toxicity thresholds while still driving higher downstream toxicity in LLM agents compared to neutral baselines.
ASRU combines activation redirection and reward-optimized fine-tuning to unlearn cross-modal sensitive knowledge in MLLMs, reporting +24.6% better unlearning effectiveness and 5.8x higher generation quality on Qwen3-VL while preserving utility with limited retained data.
GUARD-IT performs machine unlearning in LLMs via input-dependent activation steering at inference time, matching or exceeding gradient-based baselines on TOFU and MUSE while preserving utility and working under quantization.
Early mixing of post-training data into pretraining improves retention of acquired capabilities after subsequent fine-tuning in language models.
A contrastive visual forgetting technique constrained to the null space of retained knowledge enables targeted unlearning of visual concepts in MLLMs while preserving non-target visual and all textual knowledge.
CAP is a reinforcement-learning-driven prompt optimization framework that suppresses target knowledge in LLMs while preserving general capabilities, enabling reversible unlearning without any parameter updates.
MAGE builds a memory graph from a user anchor to generate its own supervision signals for corpus-free unlearning, matching the effectiveness of methods that use external reference data on TOFU and RWKU benchmarks.
LIRA aligns latent instruction representations in LLMs to defend against jailbreaks, backdoors, and undesired knowledge, blocking over 99% of PEZ attacks and achieving optimal WMDP forgetting.
ULS provides minimax-optimal estimation of remaining-data parameters in machine unlearning with limited access and decomposes error into oracle plus unlearning cost terms.
MPU is a framework that achieves privacy-preserving unlearning for LLMs by distributing perturbed model copies for local client-side unlearning followed by server-side aggregation with harmonic denoising.
Downgrading optimizers to lower-information variants during LLM unlearning yields more robust forgetting on MUSE and WMDP benchmarks by converging to harder-to-perturb loss basins.
A penalty-based bi-level optimization framework for machine unlearning that decorrelates forget and retention gradients via inner maximization and restores utility via outer minimization, with convergence guarantees and improved trade-offs on vision and language benchmarks.
citing papers explorer
-
Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models
Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.
-
Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning
VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.
-
Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning
NPO enables stable unlearning of 50%+ training data in LLMs on TOFU by making collapse exponentially slower than gradient ascent, preserving sensible outputs where prior methods fail.
-
Machine Unlearning for Masked Diffusion Language Models
MDU minimizes forward KL divergence from prompt-conditional to prompt-masked unconditional predictions at masked positions to unlearn knowledge in MDLMs while trading off privacy and utility via temperature scaling.
-
Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation
New metrics KSS and KPS are introduced to evaluate multilingual machine unlearning quality and cross-language consistency in LLMs, addressing limitations of single-language evaluation protocols.
-
PPU-Bench:Real World Benchmark for Personalized Partial Unlearning in Vision Language Models
PPU-Bench is a real-world benchmark exposing forget-retain trade-offs in MLLM unlearning and motivating Boundary-Aware Optimization to enforce intra-subject factual boundaries.
-
ICU-Bench:Benchmarking Continual Unlearning in Multimodal Large Language Models
ICU-Bench is a new continual unlearning benchmark for MLLMs using 1000 privacy profiles, 9500 images, and 100 forget tasks, showing existing methods fail to balance forgetting, utility, and scalability.
-
Erase Persona, Forget Lore: Benchmarking Multimodal Copyright Unlearning in Large Vision Language Models
CoVUBench is the first benchmark framework for evaluating multimodal copyright unlearning in LVLMs via synthetic data, systematic variations, and a dual protocol for forgetting efficacy and utility preservation.
-
Is your algorithm unlearning or untraining?
Machine unlearning conflates reversing the influence of specific training examples (untraining) with removing the full underlying distribution or behavior (unlearning).
-
Copyright Laundering Through the AI Ouroboros: Adapting the 'Fruit of the Poisonous Tree' Doctrine to Recursive AI Training
The paper introduces an AI-FOPT standard that presumes copyright infringement taint in models derived from an infringing foundational model unless developers prove independent lawful sourcing.
-
FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics
FML-Bench shows that a simple greedy hill-climber performs nearly as well as complex tree-search agents on ML research tasks, with an adaptive strategy that switches exploration modes outperforming all tested agents.
-
Auditing Reasoning-Trace Memorization Claims after Unlearning with Head-Conditioned Canaries
Swapping the reasoning trace prefill on unlearned weights can replicate or reverse the parser-split bypass gap, showing that the gap alone does not identify or rule out weight-level memorization.
-
State Contamination in Memory-Augmented LLM Agents
Toxic context can be laundered into memory summaries that stay below toxicity thresholds while still driving higher downstream toxicity in LLM agents compared to neutral baselines.
-
ASRU: Activation Steering Meets Reinforcement Unlearning for Multimodal Large Language Models
ASRU combines activation redirection and reward-optimized fine-tuning to unlearn cross-modal sensitive knowledge in MLLMs, reporting +24.6% better unlearning effectiveness and 5.8x higher generation quality on Qwen3-VL while preserving utility with limited retained data.
-
Inference-Time Machine Unlearning via Gated Activation Redirection
GUARD-IT performs machine unlearning in LLMs via input-dependent activation steering at inference time, matching or exceeding gradient-based baselines on TOFU and MUSE while preserving utility and working under quantization.
-
Early Data Exposure Improves Robustness to Subsequent Fine-Tuning
Early mixing of post-training data into pretraining improves retention of acquired capabilities after subsequent fine-tuning in language models.
-
Null Space Constrained Contrastive Visual Forgetting for MLLM Unlearning
A contrastive visual forgetting technique constrained to the null space of retained knowledge enables targeted unlearning of visual concepts in MLLMs while preserving non-target visual and all textual knowledge.
-
CAP: Controllable Alignment Prompting for Unlearning in LLMs
CAP is a reinforcement-learning-driven prompt optimization framework that suppresses target knowledge in LLMs while preserving general capabilities, enabling reversible unlearning without any parameter updates.
-
From Anchors to Supervision: Memory-Graph Guided Corpus-Free Unlearning for Large Language Models
MAGE builds a memory graph from a user anchor to generate its own supervision signals for corpus-free unlearning, matching the effectiveness of methods that use external reference data on TOFU and RWKU benchmarks.
-
Latent Instruction Representation Alignment: defending against jailbreaks, backdoors and undesired knowledge in LLMs
LIRA aligns latent instruction representations in LLMs to defend against jailbreaks, backdoors, and undesired knowledge, blocking over 99% of PEZ attacks and achieving optimal WMDP forgetting.
-
Efficient machine unlearning with minimax optimality
ULS provides minimax-optimal estimation of remaining-data parameters in machine unlearning with limited access and decomposes error into oracle plus unlearning cost terms.
-
MPU: Towards Secure and Privacy-Preserving Knowledge Unlearning for Large Language Models
MPU is a framework that achieves privacy-preserving unlearning for LLMs by distributing perturbed model copies for local client-side unlearning followed by server-side aggregation with harmonic denoising.
-
Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in LLM Unlearning
Downgrading optimizers to lower-information variants during LLM unlearning yields more robust forgetting on MUSE and WMDP benchmarks by converging to harder-to-perturb loss basins.
-
OFMU: Optimization-Driven Framework for Machine Unlearning
A penalty-based bi-level optimization framework for machine unlearning that decorrelates forget and retention gradients via inner maximization and restores utility via outer minimization, with convergence guarantees and improved trade-offs on vision and language benchmarks.
-
SEAT: Sparse Entity-Aware Tuning for Knowledge Adaptation while Preserving Epistemic Abstention
SEAT preserves epistemic abstention in LLMs during knowledge adaptation via sparse tuning and entity-perturbed KL regularization, yielding 18-101% better abstention on unknown queries while retaining near-perfect knowledge acquisition.
-
Calibration vs Decision Making: Revisiting the Reliability Paradox in Unlearned Language Models
Unlearned language models retain low calibration error but show increased shortcut reliance on the TOFU benchmark, extending the reliability paradox to machine unlearning.
-
Fine-Tuning Without Forgetting via Loss-Adaptive Learning Rates
FINCH is a loss-adaptive learning-rate schedule that reduces forgetting by 93% on average during LLM fine-tuning while matching standard task performance across several benchmarks.
-
ZeroUnlearn: Few-Shot Knowledge Unlearning in Large Language Models
ZeroUnlearn is a few-shot unlearning method that maps sensitive inputs to neutral states and enforces representational orthogonality through a closed-form multiplicative update, outperforming baselines while preserving utility.
-
Metric Unreliability in Multimodal Machine Unlearning: A Systematic Analysis and Principled Unified Score
Standard unlearning metrics disagree in multimodal settings, but a correlation-weighted Unified Quality Score delivers consistent method rankings across benchmarks.
-
Revisiting the Past: Data Unlearning with Model State History
MSA performs data unlearning in LLMs by arithmetic operations on prior model checkpoints to remove targeted datapoint influence, with experiments showing competitive or better results than existing unlearning methods.
-
Bridging Perception and Action: A Lightweight Multimodal Meta-Planner Framework for Robust Earth Observation Agents
The LMMP framework improves tool-calling accuracy and task success rates for Earth observation agents by grounding plans in multimodal features and remote sensing expert knowledge via a two-stage training process.