Machine unlearning should be restricted to dataset-defined deletion achieving retraining equivalence, while other LLM tasks require separate terminology and evaluation baselines.
Improving LLM Unlearning Robustness via Random Perturbations
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
Here, we show that current LLM unlearning methods inherently reduce models' robustness, causing them to misbehave even when a single non-adversarial forget-token is present in the retain-query. Toward understanding underlying causes, we propose a novel theoretical framework that reframes the unlearning process as a backdoor attack and defense problem: we formulate how the forgetting process inadvertently learns to align forget-tokens (backdoor triggers) with the target-representations (target labels). As a result, forget-tokens act as backdoor triggers that, when activated in retain-queries, cause disruptions in unlearned models' behaviors, similar to successful backdoor attacks. The sense that, LLM unlearning methods themselves poison the model, make it more vulnerable to forget-tokens, and hide rather than erase target knowledge, describes their true mechanism. To mitigate the vulnerability caused by the forgetting process, we reinterpret the retaining process as a backdoor defense and propose Random Noise Augmentation (RNA), a lightweight, model and method-agnostic approach with theoretical guarantees for improving the robustness of unlearned models. Extensive experiments demonstrate that RNA significantly improves the robustness of unlearned models while preserving forget and retain performances. This backdoor attack-defense framework offers insights into the mechanism of unlearning that can shed light on future research directions for improving unlearning robustness.
fields
cs.CL 1years
2026 1verdicts
ACCEPT 1representative citing papers
citing papers explorer
-
Position: The Term "Machine Unlearning" Is Overused in LLMs
Machine unlearning should be restricted to dataset-defined deletion achieving retraining equivalence, while other LLM tasks require separate terminology and evaluation baselines.