Improving LLM Unlearning Robustness via Random Perturbations

· 2025 · cs.CL · arXiv 2501.19202

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Here, we show that current LLM unlearning methods inherently reduce models' robustness, causing them to misbehave even when a single non-adversarial forget-token is present in the retain-query. Toward understanding underlying causes, we propose a novel theoretical framework that reframes the unlearning process as a backdoor attack and defense problem: we formulate how the forgetting process inadvertently learns to align forget-tokens (backdoor triggers) with the target-representations (target labels). As a result, forget-tokens act as backdoor triggers that, when activated in retain-queries, cause disruptions in unlearned models' behaviors, similar to successful backdoor attacks. The sense that, LLM unlearning methods themselves poison the model, make it more vulnerable to forget-tokens, and hide rather than erase target knowledge, describes their true mechanism. To mitigate the vulnerability caused by the forgetting process, we reinterpret the retaining process as a backdoor defense and propose Random Noise Augmentation (RNA), a lightweight, model and method-agnostic approach with theoretical guarantees for improving the robustness of unlearned models. Extensive experiments demonstrate that RNA significantly improves the robustness of unlearned models while preserving forget and retain performances. This backdoor attack-defense framework offers insights into the mechanism of unlearning that can shed light on future research directions for improving unlearning robustness.

representative citing papers

Position: The Term "Machine Unlearning" Is Overused in LLMs

cs.CL · 2026-05-08 · accept · novelty 5.0

Machine unlearning should be restricted to dataset-defined deletion achieving retraining equivalence, while other LLM tasks require separate terminology and evaluation baselines.

citing papers explorer

Showing 1 of 1 citing paper.

Position: The Term "Machine Unlearning" Is Overused in LLMs cs.CL · 2026-05-08 · accept · none · ref 8 · internal anchor
Machine unlearning should be restricted to dataset-defined deletion achieving retraining equivalence, while other LLM tasks require separate terminology and evaluation baselines.

Improving LLM Unlearning Robustness via Random Perturbations

fields

years

verdicts

representative citing papers

citing papers explorer