hub

TOFU: A Task of Fictitious Unlearning for LLMs

Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C. Lipton, J. Zico Kolter · 2024 · cs.LG · DOI 10.48550/arxiv.2401.06121 · arXiv 2401.06121

46 Pith papers cite this work. Polarity classification is still indexing.

46 Pith papers citing it

open full Pith review browse 46 citing papers arXiv PDF

abstract

Large language models trained on massive corpora of data from the web can memorize and reproduce sensitive or private data raising both legal and ethical concerns. Unlearning, or tuning models to forget information present in their training data, provides us with a way to protect private data after training. Although several methods exist for such unlearning, it is unclear to what extent they result in models equivalent to those where the data to be forgotten was never learned in the first place. To address this challenge, we present TOFU, a Task of Fictitious Unlearning, as a benchmark aimed at helping deepen our understanding of unlearning. We offer a dataset of 200 diverse synthetic author profiles, each consisting of 20 question-answer pairs, and a subset of these profiles called the forget set that serves as the target for unlearning. We compile a suite of metrics that work together to provide a holistic picture of unlearning efficacy. Finally, we provide a set of baseline results from existing unlearning algorithms. Importantly, none of the baselines we consider show effective unlearning motivating continued efforts to develop approaches for unlearning that effectively tune models so that they truly behave as if they were never trained on the forget data at all.

hub tools

JSON dossier citing papers JSON publisher DOI arXiv source

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models

cs.CR · 2026-05-14 · conditional · novelty 8.0

Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.

Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

cs.CV · 2026-04-03 · conditional · novelty 8.0

VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.

Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning

cs.LG · 2024-04-08 · conditional · novelty 8.0

NPO enables stable unlearning of 50%+ training data in LLMs on TOFU by making collapse exponentially slower than gradient ascent, preserving sensible outputs where prior methods fail.

Auditing Forgetting in Limited Memory Language Models

cs.CL · 2026-07-01 · unverdicted · novelty 7.0

A causal audit of LMLMs finds near-zero parametric leakage after deletion, with surviving correctness coming from retrieval artifacts in the database.

Control-Plane Placement Shapes Forgetting: An Architectural Study of Agent Memory Across Thirteen System Configurations

cs.CL · 2026-06-14 · unverdicted · novelty 7.0

An empirical comparison of thirteen control-plane placements in agent memory pipelines identifies three regimes with complementary forgetting recovery on a new 385-case adversarial benchmark, with mutation-time placement achieving 91.7-93.2% overall.

REMEDI: A Benchmark for Retention and Unlearning Evaluation in Multi-label Clinical Disease Inference

cs.LG · 2026-06-05 · unverdicted · novelty 7.0

REMEDI is a new benchmark for evaluating machine unlearning in multi-label clinical disease inference on MIMIC-III data that reveals trade-offs in existing methods.

The Piggyback Hypothesis of Generalization: Explaining and Mitigating Emergent Misalignment

cs.CL · 2026-06-04 · conditional · novelty 7.0

Shared chat-template tokens piggyback narrow finetuning behaviors onto out-of-domain queries; regularizing their KV states (TReFT) reduces emergent misalignment and other off-topic generalization.

Machine Unlearning for Masked Diffusion Language Models

cs.CL · 2026-05-18 · unverdicted · novelty 7.0

MDU minimizes forward KL divergence from prompt-conditional to prompt-masked unconditional predictions at masked positions to unlearn knowledge in MDLMs while trading off privacy and utility via temperature scaling.

FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics

cs.LG · 2026-05-17 · unverdicted · novelty 7.0 · 2 refs

FML-Bench shows a simple greedy hill-climber nearly matches tree search on dense-opportunity tasks while an adaptive agent that broadens search on stagnation outperforms six baselines across 18 tasks.

Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation

cs.CL · 2026-05-14 · unverdicted · novelty 7.0

New metrics KSS and KPS are introduced to evaluate multilingual machine unlearning quality and cross-language consistency in LLMs, addressing limitations of single-language evaluation protocols.

PPU-Bench:Real World Benchmark for Personalized Partial Unlearning in Vision Language Models

cs.CV · 2026-05-09 · unverdicted · novelty 7.0

PPU-Bench is a real-world benchmark exposing forget-retain trade-offs in MLLM unlearning and motivating Boundary-Aware Optimization to enforce intra-subject factual boundaries.

ICU-Bench:Benchmarking Continual Unlearning in Multimodal Large Language Models

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

ICU-Bench is a new continual unlearning benchmark for MLLMs using 1000 privacy profiles, 9500 images, and 100 forget tasks, showing existing methods fail to balance forgetting, utility, and scalability.

Erase Persona, Forget Lore: Benchmarking Multimodal Copyright Unlearning in Large Vision Language Models

cs.CV · 2026-05-05 · unverdicted · novelty 7.0

CoVUBench is the first benchmark framework for evaluating multimodal copyright unlearning in LVLMs via synthetic data, systematic variations, and a dual protocol for forgetting efficacy and utility preservation.

Is your algorithm unlearning or untraining?

cs.LG · 2026-04-09 · conditional · novelty 7.0

Machine unlearning conflates reversing the influence of specific training examples (untraining) with removing the full underlying distribution or behavior (unlearning).

Copyright Laundering Through the AI Ouroboros: Adapting the 'Fruit of the Poisonous Tree' Doctrine to Recursive AI Training

cs.CY · 2026-01-06 · conditional · novelty 7.0

The paper introduces an AI-FOPT standard that presumes copyright infringement taint in models derived from an infringing foundational model unless developers prove independent lawful sourcing.

KARLA: Knowledge-base Augmented Retrieval for Language Models

cs.AI · 2026-06-25 · unverdicted · novelty 6.0

KARLA augments LLMs by training them to generate special tokens that query a knowledge base for facts during generation, improving accuracy, traceability, and updatability.

Validity Threats for Foundation Model Research

cs.LG · 2026-06-03 · accept · novelty 6.0

Maps common low-compute research strategies for foundation models onto statistical, internal, external, and construct validity threats via a causal-inference lens.

Fast Unlearning at Scale via Margin Self-Correction

cs.LG · 2026-06-01 · unverdicted · novelty 6.0

MASC achieves competitive forget-retain trade-offs in language model unlearning at lower computational cost via margin self-correction and an online stopping criterion on TOFU, MUSE News, and MUSE Books.

Visual-Noise Guided In-Context Distillation for Multimodal Large Language Model Unlearning

cs.CV · 2026-05-26 · unverdicted · novelty 6.0

VGID constructs an intervention-induced teacher distribution via visual perturbation plus textual in-context unlearning and distills it into the student MLLM to achieve parameter-level forgetting.

Model Unlearning Objectives Vary for Distinct Language Functions

cs.CL · 2026-05-26 · unverdicted · novelty 6.0

Unlearning objectives should be tailored to distinct language functions, with a meta-learned RMU variant for dangerous knowledge and a multi-layer probe objective for toxicity, yielding strong results on four 7-8B models.

Subtle Injection for Ground-truth Inference of LLM Training Data

cs.CR · 2026-05-18 · unverdicted · novelty 6.0

SIGIL introduces five canary strategies and a Neyman-Pearson-based Membership Inference Score that achieves AUC 0.831-0.947 in 36,000 simulations, remaining above 0.86 even after full paraphrasing.

Auditing Reasoning-Trace Memorization Claims after Unlearning with Head-Conditioned Canaries

cs.LG · 2026-05-17 · conditional · novelty 6.0

Swapping the reasoning trace prefill on unlearned weights can replicate or reverse the parser-split bypass gap, showing that the gap alone does not identify or rule out weight-level memorization.

ZeroUnlearn: Few-Shot Knowledge Unlearning in Large Language Models

cs.LG · 2026-05-16 · unverdicted · novelty 6.0 · 3 refs

ZeroUnlearn reformulates machine unlearning as knowledge re-mapping via model editing, using multiplicative updates with closed-form solutions for efficient few-shot removal of sensitive representations while preserving utility.

State Contamination in Memory-Augmented LLM Agents

cs.AI · 2026-05-16 · unverdicted · novelty 6.0

Toxic context can be laundered into memory summaries that stay below toxicity thresholds while still driving higher downstream toxicity in LLM agents compared to neutral baselines.

citing papers explorer

Showing 46 of 46 citing papers.

Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models cs.CR · 2026-05-14 · conditional · none · ref 31 · internal anchor
Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.
Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning cs.CV · 2026-04-03 · conditional · none · ref 5 · internal anchor
VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.
Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning cs.LG · 2024-04-08 · conditional · none · ref 15 · internal anchor
NPO enables stable unlearning of 50%+ training data in LLMs on TOFU by making collapse exponentially slower than gradient ascent, preserving sensible outputs where prior methods fail.
Auditing Forgetting in Limited Memory Language Models cs.CL · 2026-07-01 · unverdicted · none · ref 7 · internal anchor
A causal audit of LMLMs finds near-zero parametric leakage after deletion, with surviving correctness coming from retrieval artifacts in the database.
Control-Plane Placement Shapes Forgetting: An Architectural Study of Agent Memory Across Thirteen System Configurations cs.CL · 2026-06-14 · unverdicted · none · ref 63 · internal anchor
An empirical comparison of thirteen control-plane placements in agent memory pipelines identifies three regimes with complementary forgetting recovery on a new 385-case adversarial benchmark, with mutation-time placement achieving 91.7-93.2% overall.
REMEDI: A Benchmark for Retention and Unlearning Evaluation in Multi-label Clinical Disease Inference cs.LG · 2026-06-05 · unverdicted · none · ref 30 · internal anchor
REMEDI is a new benchmark for evaluating machine unlearning in multi-label clinical disease inference on MIMIC-III data that reveals trade-offs in existing methods.
The Piggyback Hypothesis of Generalization: Explaining and Mitigating Emergent Misalignment cs.CL · 2026-06-04 · conditional · none · ref 13 · internal anchor
Shared chat-template tokens piggyback narrow finetuning behaviors onto out-of-domain queries; regularizing their KV states (TReFT) reduces emergent misalignment and other off-topic generalization.
Machine Unlearning for Masked Diffusion Language Models cs.CL · 2026-05-18 · unverdicted · none · ref 11 · internal anchor
MDU minimizes forward KL divergence from prompt-conditional to prompt-masked unconditional predictions at masked positions to unlearn knowledge in MDLMs while trading off privacy and utility via temperature scaling.
FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics cs.LG · 2026-05-17 · unverdicted · none · ref 32 · 2 links · internal anchor
FML-Bench shows a simple greedy hill-climber nearly matches tree search on dense-opportunity tasks while an adaptive agent that broadens search on stagnation outperforms six baselines across 18 tasks.
Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation cs.CL · 2026-05-14 · unverdicted · none · ref 12 · internal anchor
New metrics KSS and KPS are introduced to evaluate multilingual machine unlearning quality and cross-language consistency in LLMs, addressing limitations of single-language evaluation protocols.
PPU-Bench:Real World Benchmark for Personalized Partial Unlearning in Vision Language Models cs.CV · 2026-05-09 · unverdicted · none · ref 21 · internal anchor
PPU-Bench is a real-world benchmark exposing forget-retain trade-offs in MLLM unlearning and motivating Boundary-Aware Optimization to enforce intra-subject factual boundaries.
ICU-Bench:Benchmarking Continual Unlearning in Multimodal Large Language Models cs.AI · 2026-05-07 · unverdicted · none · ref 30 · internal anchor
ICU-Bench is a new continual unlearning benchmark for MLLMs using 1000 privacy profiles, 9500 images, and 100 forget tasks, showing existing methods fail to balance forgetting, utility, and scalability.
Erase Persona, Forget Lore: Benchmarking Multimodal Copyright Unlearning in Large Vision Language Models cs.CV · 2026-05-05 · unverdicted · none · ref 6 · internal anchor
CoVUBench is the first benchmark framework for evaluating multimodal copyright unlearning in LVLMs via synthetic data, systematic variations, and a dual protocol for forgetting efficacy and utility preservation.
Is your algorithm unlearning or untraining? cs.LG · 2026-04-09 · conditional · none · ref 22 · internal anchor
Machine unlearning conflates reversing the influence of specific training examples (untraining) with removing the full underlying distribution or behavior (unlearning).
Copyright Laundering Through the AI Ouroboros: Adapting the 'Fruit of the Poisonous Tree' Doctrine to Recursive AI Training cs.CY · 2026-01-06 · conditional · none · ref 1 · internal anchor
The paper introduces an AI-FOPT standard that presumes copyright infringement taint in models derived from an infringing foundational model unless developers prove independent lawful sourcing.
KARLA: Knowledge-base Augmented Retrieval for Language Models cs.AI · 2026-06-25 · unverdicted · none · ref 6 · internal anchor
KARLA augments LLMs by training them to generate special tokens that query a knowledge base for facts during generation, improving accuracy, traceability, and updatability.
Validity Threats for Foundation Model Research cs.LG · 2026-06-03 · accept · none · ref 65 · internal anchor
Maps common low-compute research strategies for foundation models onto statistical, internal, external, and construct validity threats via a causal-inference lens.
Fast Unlearning at Scale via Margin Self-Correction cs.LG · 2026-06-01 · unverdicted · none · ref 34 · internal anchor
MASC achieves competitive forget-retain trade-offs in language model unlearning at lower computational cost via margin self-correction and an online stopping criterion on TOFU, MUSE News, and MUSE Books.
Visual-Noise Guided In-Context Distillation for Multimodal Large Language Model Unlearning cs.CV · 2026-05-26 · unverdicted · none · ref 23 · internal anchor
VGID constructs an intervention-induced teacher distribution via visual perturbation plus textual in-context unlearning and distills it into the student MLLM to achieve parameter-level forgetting.
Model Unlearning Objectives Vary for Distinct Language Functions cs.CL · 2026-05-26 · unverdicted · none · ref 6 · internal anchor
Unlearning objectives should be tailored to distinct language functions, with a meta-learned RMU variant for dangerous knowledge and a multi-layer probe objective for toxicity, yielding strong results on four 7-8B models.
Subtle Injection for Ground-truth Inference of LLM Training Data cs.CR · 2026-05-18 · unverdicted · none · ref 9 · internal anchor
SIGIL introduces five canary strategies and a Neyman-Pearson-based Membership Inference Score that achieves AUC 0.831-0.947 in 36,000 simulations, remaining above 0.86 even after full paraphrasing.
Auditing Reasoning-Trace Memorization Claims after Unlearning with Head-Conditioned Canaries cs.LG · 2026-05-17 · conditional · none · ref 10 · internal anchor
Swapping the reasoning trace prefill on unlearned weights can replicate or reverse the parser-split bypass gap, showing that the gap alone does not identify or rule out weight-level memorization.
ZeroUnlearn: Few-Shot Knowledge Unlearning in Large Language Models cs.LG · 2026-05-16 · unverdicted · none · ref 13 · 3 links · internal anchor
ZeroUnlearn reformulates machine unlearning as knowledge re-mapping via model editing, using multiplicative updates with closed-form solutions for efficient few-shot removal of sensitive representations while preserving utility.
State Contamination in Memory-Augmented LLM Agents cs.AI · 2026-05-16 · unverdicted · none · ref 14 · internal anchor
Toxic context can be laundered into memory summaries that stay below toxicity thresholds while still driving higher downstream toxicity in LLM agents compared to neutral baselines.
ASRU: Activation Steering Meets Reinforcement Unlearning for Multimodal Large Language Models cs.CL · 2026-05-15 · unverdicted · none · ref 13 · internal anchor
ASRU combines activation redirection and reward-optimized fine-tuning to unlearn cross-modal sensitive knowledge in MLLMs, reporting +24.6% better unlearning effectiveness and 5.8x higher generation quality on Qwen3-VL while preserving utility with limited retained data.
Inference-Time Machine Unlearning via Gated Activation Redirection cs.LG · 2026-05-12 · unverdicted · none · ref 1 · 2 links · internal anchor
GUARD-IT removes forget-set influence from LLMs via training-free, gated residual-stream rotations at inference, matching gradient unlearning baselines without weight edits.
Early Data Exposure Improves Robustness to Subsequent Fine-Tuning cs.LG · 2026-05-12 · conditional · none · ref 10 · internal anchor
Early mixing of post-training data into pretraining improves retention of acquired capabilities after subsequent fine-tuning in language models.
Null Space Constrained Contrastive Visual Forgetting for MLLM Unlearning cs.AI · 2026-05-07 · unverdicted · none · ref 13 · internal anchor
A contrastive visual forgetting technique constrained to the null space of retained knowledge enables targeted unlearning of visual concepts in MLLMs while preserving non-target visual and all textual knowledge.
CAP: Controllable Alignment Prompting for Unlearning in LLMs cs.LG · 2026-04-23 · unverdicted · none · ref 81 · internal anchor
CAP is a reinforcement-learning-driven prompt optimization framework that suppresses target knowledge in LLMs while preserving general capabilities, enabling reversible unlearning without any parameter updates.
From Anchors to Supervision: Memory-Graph Guided Corpus-Free Unlearning for Large Language Models cs.CL · 2026-04-15 · unverdicted · none · ref 4 · internal anchor
MAGE builds a memory graph from a user anchor to generate its own supervision signals for corpus-free unlearning, matching the effectiveness of methods that use external reference data on TOFU and RWKU benchmarks.
Latent Instruction Representation Alignment: defending against jailbreaks, backdoors and undesired knowledge in LLMs cs.LG · 2026-04-12 · unverdicted · none · ref 26 · internal anchor
LIRA aligns latent instruction representations in LLMs to defend against jailbreaks, backdoors, and undesired knowledge, blocking over 99% of PEZ attacks and achieving optimal WMDP forgetting.
Efficient machine unlearning with minimax optimality stat.ML · 2026-04-07 · unverdicted · none · ref 24 · internal anchor
ULS provides minimax-optimal estimation of remaining-data parameters in machine unlearning with limited access and decomposes error into oracle plus unlearning cost terms.
MPU: Towards Secure and Privacy-Preserving Knowledge Unlearning for Large Language Models cs.LG · 2026-02-27 · unverdicted · none · ref 7 · internal anchor
MPU is a framework that achieves privacy-preserving unlearning for LLMs by distributing perturbed model copies for local client-side unlearning followed by server-side aggregation with harmonic denoising.
Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in LLM Unlearning cs.LG · 2025-10-01 · conditional · none · ref 21 · internal anchor
Downgrading optimizers to lower-information variants during LLM unlearning yields more robust forgetting on MUSE and WMDP benchmarks by converging to harder-to-perturb loss basins.
SEAT: Sparse Entity-Aware Tuning for Knowledge Adaptation while Preserving Epistemic Abstention cs.AI · 2025-06-17 · unverdicted · none · ref 7 · internal anchor
SEAT preserves epistemic abstention in LLMs during knowledge adaptation via sparse tuning and entity-perturbed KL regularization, yielding 18-101% better abstention on unknown queries while retaining near-perfect knowledge acquisition.
Probing Stylistic Appropriation using Large Language Models: An Evaluation Framework for Copyright Infringement under EU Law cs.CL · 2026-06-30 · unverdicted · none · ref 51 · internal anchor
PSALM is an LLM-as-a-judge framework with ten evaluators that operationalizes EU copyright doctrine to detect stylistic appropriation in fine-tuned LLMs beyond verbatim copying, applied to Llama 3.2 models on Dutch literary works.
Agents That Know Too Much: A Data-Centric Survey of Privacy in LLM Agents cs.CR · 2026-06-25 · unverdicted · none · ref 80 · internal anchor
A data-centric survey finds that only information-flow control covers compositional and cross-session leakage in LLM agents and that no single benchmark tests an agent across all its data surfaces under one policy.
Mechanism-Guided Selective Unlearning for RLVR-Induced Reasoning cs.LG · 2026-06-17 · unverdicted · none · ref 8 · internal anchor
MAST ranks attention-projection tensors by off-principal energy, update magnitude, and forget-gradient coupling to selectively unlearn RLVR-induced reasoning, achieving significant forgetting on MATH while preserving GSM8K and retain MATH unlike full-parameter updates.
Calibration vs Decision Making: Revisiting the Reliability Paradox in Unlearned Language Models cs.CL · 2026-05-20 · unverdicted · none · ref 56 · internal anchor
Unlearned language models retain low calibration error but show increased shortcut reliance on the TOFU benchmark, extending the reliability paradox to machine unlearning.
Fine-Tuning Without Forgetting via Loss-Adaptive Learning Rates cs.LG · 2026-05-19 · unverdicted · none · ref 43 · internal anchor
FINCH is a loss-adaptive learning-rate schedule that reduces forgetting by 93% on average during LLM fine-tuning while matching standard task performance across several benchmarks.
Metric Unreliability in Multimodal Machine Unlearning: A Systematic Analysis and Principled Unified Score cs.CV · 2026-05-04 · unverdicted · none · ref 6 · 2 links · internal anchor
Standard unlearning metrics disagree in multimodal settings, but a correlation-weighted Unified Quality Score delivers consistent method rankings across benchmarks.
Revisiting the Past: Data Unlearning with Model State History cs.LG · 2025-06-26 · unverdicted · none · ref 27 · internal anchor
MSA performs data unlearning in LLMs by arithmetic operations on prior model checkpoints to remove targeted datapoint influence, with experiments showing competitive or better results than existing unlearning methods.
RPO-PDT: Demonstrating Role-Play-Based Knowledge Adaptation for Student Support Dialogue (Demonstration System) cs.RO · 2026-06-08 · unverdicted · none · ref 7 · internal anchor
RPO-PDT demonstrates a role-play-based, retrieval-grounded system for adaptive, policy-constrained student support dialogue with reverse-roleplay for strategy memory.
AI as a Tool for Simulation-Based Experiments in Literary Studies cs.CL · 2026-06-01 · unverdicted · none · ref 45 · internal anchor
Proposes AI-driven simulations for literary-historical experiments and reports preliminary text-generation results claiming the first limited in-distribution outputs matching human novels.
Bridging Perception and Action: A Lightweight Multimodal Meta-Planner Framework for Robust Earth Observation Agents cs.MA · 2026-05-06 · unverdicted · none · ref 57 · internal anchor
The LMMP framework improves tool-calling accuracy and task success rates for Earth observation agents by grounding plans in multimodal features and remote sensing expert knowledge via a two-stage training process.
OFMU: Optimization-Driven Framework for Machine Unlearning cs.LG · 2025-09-26 · unreviewed · ref 15 · internal anchor

TOFU: A Task of Fictitious Unlearning for LLMs

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer