Erased but Not Forgotten: How Backdoors Compromise Concept Erasure

Anna Rohrbach; Jonas Henry Grebe; Marcus Rohrbach; Tobias Braun

arxiv: 2504.21072 · v2 · pith:KHLRLZVVnew · submitted 2025-04-29 · 💻 cs.CR · cs.AI· cs.LG

Erased but Not Forgotten: How Backdoors Compromise Concept Erasure

Tobias Braun , Jonas Henry Grebe , Marcus Rohrbach , Anna Rohrbach This is my paper

classification 💻 cs.CR cs.AIcs.LG

keywords erasureconceptharmfulmethodsbackdoorworkacrossadversaries

0 comments

read the original abstract

The expansion of text-to-image diffusion models has raised concerns about harmful outputs, from fabricated depictions of public figures to sexually explicit imagery. To mitigate such risks, prior work has proposed concept erasure methods that aim to sever unwanted concepts from the model via fine-tuning, yet it remains unclear whether these approaches truly remove all links to the harmful concept or merely conceal superficial connections. In this work, we reveal a critical vulnerability, the Erasure Evasion Backdoor (EEB): an adversary binds a backdoor trigger to a concept slated for removal, and this malicious link survives subsequent erasure. We show that both black-box and white-box adversaries can instantiate this threat. Across six state-of-the-art erasure methods, including robust ones that explicitly search for alternative representations of the target concept, EEB consistently exposes harmful content: up to 82% success against celebrity-identity unlearning, up to 94% for object erasure, and up to 16 times amplification of explicit-content exposure. While EEB uncovers a blind spot in current erasure methods, it also provides a diagnostic tool for stress-testing future concept erasure techniques.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Token by Token, Compromised: Backdoor Vulnerabilities in Unified Autoregressive Models
cs.CR 2026-05 conditional novelty 7.0

ToBAC is the first backdoor attack on unified autoregressive models, using data or model poisoning to make triggers elicit cross-modal malicious behavior in text and image generation.
Whispers in the Noise: Surrogate-Guided Concept Awakening via a Multi-Agent Framework
cs.AI 2026-05 unverdicted novelty 6.0

ConceptAgent is a black-box multi-agent system that awakens erased concepts in diffusion models by initializing denoising trajectories from surrogate-guided noisy states.