SAEmnesia: Erasing Concepts in Diffusion Models with Supervised Sparse Autoencoders

Alan Perotti; Enrico Cassano; Marco Grangetto; Marco Nurisso; Mirko Zaffaroni; Riccardo Renzulli

arxiv: 2509.21379 · v3 · pith:DDX2B452new · submitted 2025-09-23 · 💻 cs.CV · cs.AI

SAEmnesia: Erasing Concepts in Diffusion Models with Supervised Sparse Autoencoders

Enrico Cassano , Riccardo Renzulli , Marco Nurisso , Mirko Zaffaroni , Alan Perotti , Marco Grangetto This is my paper

classification 💻 cs.CV cs.AI

keywords saemnesiaconceptconceptssparseunlearningachievesbenchmarkdiffusion

0 comments

read the original abstract

Concept unlearning in diffusion models is hampered by feature splitting, where concepts are distributed across many latent features, making their removal challenging and computationally expensive. We introduce SAEmnesia, a supervised sparse autoencoder framework that overcomes this by enforcing one-to-one concept-neuron mappings. By systematically labeling concepts during training, our method achieves feature centralization, binding each concept to a single, interpretable neuron. This enables highly targeted and efficient concept erasure. Compared to the state-of-the-art sparse autoencoder-based unlearning approach, SAEmnesia reduces hyperparameter search by 96.67% and achieves a 9.22% improvement on the UnlearnCanvas benchmark for objects. Our method also shows superior scalability in sequential unlearning, improving accuracy by 28.4% when removing nine objects, establishing a step forward for precise and controllable concept erasure. Moreover, SAEmnesia effectively suppresses nudity on the I2P benchmark and remains robust to adversarial attacks. Source code available at https://github.com/EIDOSLAB/SAEmnesia.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Disentangled Sparse Representations for Concept-Separated Diffusion Unlearning
cs.LG 2026-05 unverdicted novelty 7.0

SAEParate disentangles sparse representations in diffusion models via contrastive clustering and nonlinear encoding to enable more precise concept unlearning with reduced side effects.
Empty SPACE: Cross-Attention Sparsity for Concept Erasure in Diffusion Models
cs.LG 2026-05 unverdicted novelty 5.0

SPACE induces sparsity in cross-attention parameters via closed-form iterative updates to erase target concepts more effectively than dense baselines in large diffusion models.