Projected Gradient Unlearning for Text-to-Image Diffusion Models: Defending Against Concept Revival Attacks
Pith reviewed 2026-05-09 23:57 UTC · model grok-4.3
The pith
Projected Gradient Unlearning projects fine-tuning gradients orthogonal to retain-concept space to block concept revival in text-to-image models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By building a Core Gradient Space (CGS) from retain concept activations and projecting gradient updates into its orthogonal complement, PGU prevents subsequent fine-tuning from undoing concept erasure in diffusion models.
What carries the argument
The Core Gradient Space (CGS), constructed from gradients of retain concepts, with projection of updates onto its orthogonal complement to block revival directions.
If this is right
- PGU eliminates revival of style concepts and delays object concept revival when added to existing unlearning techniques.
- Retain concept selection for CGS should prioritize visual feature similarity over semantic categories.
- PGU is faster than Meta-Unlearning, taking about 6 minutes compared to 2 hours.
- PGU and Meta-Unlearning are complementary based on the encoding of the concept.
Where Pith is reading between the lines
- Similar projection techniques could be tested in other generative models like language models for preventing unlearning reversal.
- Future work might explore dynamic construction of CGS during unlearning rather than post-hoc.
- The method suggests that concept revival is tied to gradient directions in retain spaces, opening paths for broader defense mechanisms.
Load-bearing premise
The Core Gradient Space from retain concepts includes all gradient directions that any later fine-tuning could exploit to revive the erased concept.
What would settle it
Fine-tune the PGU-hardened model on a dataset unrelated to the erased concept and check if the concept's generation quality returns to pre-unlearning levels.
Figures
read the original abstract
Machine unlearning for text-to-image diffusion models aims to selectively remove undesirable concepts from pre-trained models without costly retraining. Current unlearning methods share a common weakness: erased concepts return when the model is fine-tuned on downstream data, even when that data is entirely unrelated. We adapt Projected Gradient Unlearning (PGU) from classification to the diffusion domain as a post-hoc hardening step. By constructing a Core Gradient Space (CGS) from the retain concept activations and projecting gradient updates into its orthogonal complement, PGU ensures that subsequent fine-tuning cannot undo the achieved erasure. Applied on top of existing methods (ESD, UCE, Receler), the approach eliminates revival for style concepts and substantially delays it for object concepts, running in roughly 6 minutes versus the ~2 hours required by Meta-Unlearning. PGU and Meta-Unlearning turn out to be complementary: which performs better depends on how the concept is encoded, and retain concept selection should follow visual feature similarity rather than semantic grouping.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Projected Gradient Unlearning (PGU) as a post-hoc hardening step for concept unlearning in text-to-image diffusion models. It constructs a Core Gradient Space (CGS) from retain-concept activations and projects subsequent gradient updates into the orthogonal complement, claiming this prevents fine-tuning (even on unrelated data) from reviving erased concepts. When applied atop ESD, UCE, and Receler, PGU eliminates revival for style concepts, substantially delays it for object concepts, runs in ~6 minutes, and is complementary to Meta-Unlearning depending on concept encoding.
Significance. If the geometric guarantee holds, PGU would offer an efficient, model-agnostic defense against revival attacks that currently undermine unlearning methods. The reported complementarity with Meta-Unlearning and the suggestion to select retain concepts by visual similarity rather than semantics could guide practical unlearning pipelines.
major comments (2)
- Abstract: the absolute claim that PGU 'ensures that subsequent fine-tuning cannot undo the achieved erasure' is load-bearing for the central contribution yet is immediately qualified by the empirical distinction that revival is eliminated only for styles and merely delayed for objects. This internal inconsistency indicates that the CGS (built from retain activations) does not contain all revival-capable directions, directly contradicting the geometric guarantee.
- Abstract and §3 (method): the construction of the Core Gradient Space from retain-concept activations assumes that all directions capable of reviving an erased concept under later fine-tuning lie inside this subspace. No argument or experiment is supplied showing that revival gradients arising from indirect feature interactions or layer-specific paths in unrelated data must intersect the CGS; the style-vs-object performance gap suggests this assumption fails for some concepts.
minor comments (1)
- Abstract supplies no quantitative metrics, baseline tables, statistical tests, or ablation details for the reported elimination/delay outcomes, making it impossible to assess effect sizes or reproducibility from the summary alone.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We agree that the abstract requires revision to align its claims more precisely with the reported empirical results, and we will expand the method discussion to address the assumptions underlying the Core Gradient Space. Our point-by-point responses follow.
read point-by-point responses
-
Referee: Abstract: the absolute claim that PGU 'ensures that subsequent fine-tuning cannot undo the achieved erasure' is load-bearing for the central contribution yet is immediately qualified by the empirical distinction that revival is eliminated only for styles and merely delayed for objects. This internal inconsistency indicates that the CGS (built from retain activations) does not contain all revival-capable directions, directly contradicting the geometric guarantee.
Authors: We acknowledge that the abstract's phrasing overstates the result. The projection step is designed to keep fine-tuning updates orthogonal to directions that affect retain concepts, thereby protecting the unlearning outcome along those axes. However, the experiments show complete elimination of revival only for style concepts and a substantial delay for object concepts. This gap indicates that some revival directions for objects are not fully captured by the CGS. We will revise the abstract to state that PGU eliminates revival for styles and substantially delays it for objects, removing the absolute guarantee language. The revised wording will appear in the next manuscript version. revision: yes
-
Referee: Abstract and §3 (method): the construction of the Core Gradient Space from retain-concept activations assumes that all directions capable of reviving an erased concept under later fine-tuning lie inside this subspace. No argument or experiment is supplied showing that revival gradients arising from indirect feature interactions or layer-specific paths in unrelated data must intersect the CGS; the style-vs-object performance gap suggests this assumption fails for some concepts.
Authors: The CGS is constructed from retain-concept gradients to isolate the subspace of updates that would alter retain concepts. Orthogonal projection is intended to prevent fine-tuning from undoing unlearning via retain-aligned paths. We did not supply a formal argument or additional experiments proving that every possible revival gradient—including those arising from indirect interactions or unrelated data—must intersect this subspace. The observed difference in performance between styles and objects supports the referee's point that the assumption does not hold uniformly. In the revision we will expand §3 with a clearer description of the CGS construction, add discussion of its limitations, and include an analysis of gradient overlap to explain the style-object disparity. The abstract and conclusion will also be updated to reflect these nuances. revision: partial
Circularity Check
No significant circularity in geometric projection method
full rationale
The paper's central step constructs the Core Gradient Space directly from measured retain-concept activations and applies a standard orthogonal projection to gradient updates. This is a linear-algebra operation with no self-referential definitions, no fitted parameters renamed as predictions, and no load-bearing self-citations or ansatzes. The claim that the projection 'ensures' no revival is presented as a geometric consequence plus empirical results (elimination for styles, delay for objects), not a derivation that reduces to its own inputs by construction. The method is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The Core Gradient Space constructed from retain-concept activations contains the directions that later fine-tuning would use to revive erased concepts.
invented entities (1)
-
Core Gradient Space (CGS)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
T. Hoang, S. Rana, S. Gupta, and S. Venkatesh, “Learn to Unlearn for Deep Neural Networks: Minimizing Unlearning Interference with Gradient Projection,” inProc. IEEE/CVF Winter Conf. Applications of Computer Vision (WACV), 2024, pp. 4807–4816
work page 2024
-
[2]
Meta-Unlearning on Diffusion Models: Preventing Relearning Unlearned Concepts,
H. Gao, T. Pang, C. Du, T. Hu, Z. Deng, and M. Lin, “Meta-Unlearning on Diffusion Models: Preventing Relearning Unlearned Concepts,” in Proc. IEEE/CVF Int. Conf. Computer Vision (ICCV), 2025, pp. 2131– 2141
work page 2025
-
[3]
FLARE Up Your Data: Diffusion-Based Augmentation Method in Astronomical Imaging,
M. T. Alam, R. Imam, M. Guizani, and F. Karray, “FLARE Up Your Data: Diffusion-Based Augmentation Method in Astronomical Imaging,”arXiv preprint arXiv:2405.13267, 2024
-
[4]
N. George, K. N. Dasaraju, R. R. Chittepu, and K. R. Mopuri, “The Illusion of Unlearning: The Unstable Nature of Machine Unlearning in Text-to-Image Diffusion Models,” inProc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2025, pp. 13393–13402
work page 2025
-
[5]
Introducing SDICE: An Index for Assessing Diversity of Synthetic Medical Datasets,
M. T. Alam, R. Imam, M. A. Qazi, A. Ukaye, and K. Nandakumar, “Introducing SDICE: An Index for Assessing Diversity of Synthetic Medical Datasets,” inProc., 2024
work page 2024
-
[6]
LAION-5B: An Open Large-Scale Dataset for Training Next Generation Image-Text Models,
C. Schuhmannet al., “LAION-5B: An Open Large-Scale Dataset for Training Next Generation Image-Text Models,” inAdv. Neural Inf. Process. Syst. (NeurIPS), 2022, pp. 25278–25294
work page 2022
-
[7]
General Data Protection Regulation (GDPR),
European Parliament and Council of the European Union, “General Data Protection Regulation (GDPR),” Regulation (EU) 2016/679, 2018
work page 2016
-
[8]
The European Union General Data Protection Regulation: What It Is and What It Means,
C. J. Hoofnagle, B. van der Sloot, and F. Z. Borgesius, “The European Union General Data Protection Regulation: What It Is and What It Means,”Inf. Commun. Technol. Law, vol. 28, no. 1, pp. 65–98, 2019
work page 2019
-
[9]
A Guide to the California Consumer Privacy Act of 2018,
L. de la Torre, “A Guide to the California Consumer Privacy Act of 2018,”SSRN3275571, 2018
work page 2018
-
[10]
Towards Making Systems Forget with Machine Unlearning,
Y . Cao and J. Yang, “Towards Making Systems Forget with Machine Unlearning,” inProc. IEEE Symp. Security and Privacy (SP), 2015, pp. 463–480
work page 2015
-
[11]
Making AI Forget You: Data Deletion in Machine Learning,
A. Ginart, M. Guan, G. Valiant, and J. Y . Zou, “Making AI Forget You: Data Deletion in Machine Learning,” inAdv. Neural Inf. Process. Syst. (NeurIPS), 2019
work page 2019
-
[12]
Eternal Sunshine of the Spotless Net: Selective Forgetting in Deep Networks,
A. Golatkar, A. Achille, and S. Soatto, “Eternal Sunshine of the Spotless Net: Selective Forgetting in Deep Networks,” inProc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2020, pp. 9301–9309
work page 2020
-
[13]
M. T. Alam, N. Saadi, F. Shamshad, N. Lukas, K. Nandakumar, F. Kar- ray, and S. Poppi, “SPQR: A Standardized Benchmark for Modern Safety Alignment Methods in Text-to-Image Diffusion Models,”arXiv preprint arXiv:2511.19558, 2025
-
[14]
Eras- ing Concepts from Diffusion Models,
R. Gandikota, J. Materzynska, J. Fiotto-Kaufman, and D. Bau, “Eras- ing Concepts from Diffusion Models,” inProc. IEEE/CVF Int. Conf. Computer Vision (ICCV), 2023, pp. 2426–2436
work page 2023
-
[15]
Unified Concept Editing in Diffusion Models,
R. Gandikota, H. Orgad, Y . Belinkov, J. Materzynska, and D. Bau, “Unified Concept Editing in Diffusion Models,” inProc. IEEE/CVF Winter Conf. Applications of Computer Vision (WACV), 2024, pp. 5111–5120
work page 2024
-
[16]
Receler: Reliable Concept Erasing of Text-to-Image Diffusion Models via Lightweight Erasers,
C.-P. Huang, K.-P. Chang, C.-T. Tsai, Y .-H. Lai, F.-E. Yang, and Y .- C. F. Wang, “Receler: Reliable Concept Erasing of Text-to-Image Diffusion Models via Lightweight Erasers,” inProc. European Conf. Computer Vision (ECCV), 2025, pp. 360–376
work page 2025
-
[17]
MACE: Mass Concept Erasure in Diffusion Models,
S. Lu, Z. Wang, L. Li, Y . Liu, and A. W.-K. Kong, “MACE: Mass Concept Erasure in Diffusion Models,” inProc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2024, pp. 6430– 6440
work page 2024
-
[18]
C. Fan, J. Liu, Y . Zhang, E. Wong, D. Wei, and S. Liu, “SalUn: Empowering Machine Unlearning via Gradient-Based Weight Saliency in Both Image Classification and Generation,” inProc. Int. Conf. Learning Representations (ICLR), 2024
work page 2024
-
[19]
Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
X. Qi, Y . Zeng, T. Xie, P.-Y . Chen, R. Jia, P. Mittal, and P. Henderson, “Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!” inProc. Int. Conf. Learning Repre- sentations (ICLR), 2023
work page 2023
-
[20]
Y . Wuet al., “Unlearning Concepts in Diffusion Model via Concept Domain Correction and Concept Preserving Gradient,” inProc. 39th AAAI Conf. Artificial Intelligence (AAAI), 2025
work page 2025
-
[21]
Boosting Alignment for Post-Unlearning Text-to-Image Generative Models,
M. Koet al., “Boosting Alignment for Post-Unlearning Text-to-Image Generative Models,” inAdv. Neural Inf. Process. Syst. (NeurIPS), 2024
work page 2024
-
[22]
Denoising Diffusion Probabilistic Models,
J. Ho, A. Jain, and P. Abbeel, “Denoising Diffusion Probabilistic Models,” inAdv. Neural Inf. Process. Syst. (NeurIPS), vol. 33, 2020, pp. 6840–6851
work page 2020
-
[23]
High-Resolution Image Synthesis with Latent Diffusion Models,
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-Resolution Image Synthesis with Latent Diffusion Models,” inProc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2022, pp. 10674–10685
work page 2022
-
[24]
One-Dimensional Adapter to Rule Them All: Concepts Diffusion Models and Erasing Applications,
M. Lyuet al., “One-Dimensional Adapter to Rule Them All: Concepts Diffusion Models and Erasing Applications,” inProc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2024, pp. 7559– 7568
work page 2024
-
[25]
Under- standing Deep Learning Requires Rethinking Generalization,
C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Under- standing Deep Learning Requires Rethinking Generalization,” inProc. Int. Conf. Learning Representations (ICLR), 2017
work page 2017
-
[26]
Image Style Transfer Using Convolutional Neural Networks,
L. A. Gatys, A. S. Ecker, and M. Bethge, “Image Style Transfer Using Convolutional Neural Networks,” inProc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2414–2423
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.