Representation-Aware Unlearning via Activation Signatures: From Suppression to Entity-Signature Erasure

Farig Sadeque; Jareen Tasneem Khondaker; K. M. Shadman Wadith; Md. Rezaur Rahman Bhuiyan; Md. Sameer Sakib; Nazia Tasnim; Syed Naveed Mahmood; Tasfia Zaman

read the original abstract

Entity-level unlearning is usually evaluated by what a model says: whether it stops naming the target, refuses a query, or shifts a Truth Ratio distribution. These output-level tests, however, do not show whether a subject's internal representation has been attenuated. We introduce the Entity Representation Unlearning Framework (ERUF), a representation-aware framework that mines subject-specific activation signatures, suppresses the corresponding activation direction, and distills the behavior into LoRA parameters. Among evaluated baselines, ERUF is the only method that jointly achieves surface-level suppression, internal attenuation, and utility preservation. On TOFU forget10, ERUF achieves FQ = 0.99 and MU = 0.62, matching reported oracle utility while approaching oracle forget quality. Across most standard foundation-model settings, ERUF maintains low leakage and low internal target activation, with SMR between 0.00% and 1.10%, EL10 below 0.06, and utility drift below 3%. On Llama-3.1-8B, adversarial entity recovery falls from 63.89% to 20.15%, while name-agnostic recovery decreases by 72.7% to 77.4%. Joint surface/internal diagnostics further reveal scale-dependent behavior in reasoning-prior models that surface metrics alone would miss. We interpret these results as operational evidence of representation-level attenuation, not as a formal guarantee of irreversible deletion.

Representation-Aware Unlearning via Activation Signatures: From Suppression to Entity-Signature Erasure

discussion (0)