pith. sign in

arxiv: 2601.10566 · v5 · pith:OSVM2HBCnew · submitted 2026-01-15 · 💻 cs.CL · cs.LG

Representation-Aware Unlearning via Activation Signatures: From Suppression to Entity-Signature Erasure

classification 💻 cs.CL cs.LG
keywords activationerufinternalunlearningutilityachievesattenuationbehavior
0
0 comments X
read the original abstract

Entity-level unlearning is usually evaluated by what a model says: whether it stops naming the target, refuses a query, or shifts a Truth Ratio distribution. These output-level tests, however, do not show whether a subject's internal representation has been attenuated. We introduce the Entity Representation Unlearning Framework (ERUF), a representation-aware framework that mines subject-specific activation signatures, suppresses the corresponding activation direction, and distills the behavior into LoRA parameters. Among evaluated baselines, ERUF is the only method that jointly achieves surface-level suppression, internal attenuation, and utility preservation. On TOFU forget10, ERUF achieves FQ = 0.99 and MU = 0.62, matching reported oracle utility while approaching oracle forget quality. Across most standard foundation-model settings, ERUF maintains low leakage and low internal target activation, with SMR between 0.00% and 1.10%, EL10 below 0.06, and utility drift below 3%. On Llama-3.1-8B, adversarial entity recovery falls from 63.89% to 20.15%, while name-agnostic recovery decreases by 72.7% to 77.4%. Joint surface/internal diagnostics further reveal scale-dependent behavior in reasoning-prior models that surface metrics alone would miss. We interpret these results as operational evidence of representation-level attenuation, not as a formal guarantee of irreversible deletion.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.