pith. sign in

arxiv: 2601.08623 · v2 · submitted 2026-01-13 · 💻 cs.CV · cs.AI· cs.CR· cs.LG

SafeRedir: Prompt Embedding Redirection for Robust Unlearning in Image Generation Models

Pith reviewed 2026-05-16 14:21 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CRcs.LG
keywords unlearningimage generationdiffusion modelsprompt embeddingsafety classifieradversarial robustnessinference-time intervention
0
0 comments X

The pith

SafeRedir redirects unsafe prompt embeddings toward safe regions at inference time to unlearn harmful concepts in image generators without retraining the model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

SafeRedir is a lightweight framework that identifies unsafe generation paths in the embedding space of image generation models and redirects them via targeted token adjustments. It operates without altering the underlying model weights, avoiding the quality degradation and computational costs of retraining-based unlearning approaches. The method uses a safety classifier to spot risky trajectories and a delta generator to shift prompts toward benign semantics while preserving details for safe inputs. Results show it maintains image quality, resists adversarial attacks better than prior methods, and applies across different diffusion backbones and already-unlearned models.

Core claim

By combining a latent-aware multi-modal safety classifier with a token-level delta generator that includes masking and scaling predictors, SafeRedir can route unsafe prompts to safe semantic areas in embedding space during inference, achieving effective removal of harmful concepts such as NSFW content or copyrighted styles while retaining high semantic fidelity and perceptual quality for benign prompts.

What carries the argument

Token-level delta generator with auxiliary predictors for masking and adaptive scaling, driven by a latent-aware multi-modal safety classifier that detects unsafe trajectories in embedding space.

If this is right

  • Unlearning no longer requires full model retraining or fine-tuning for each new harmful concept.
  • Existing image generators and already-unlearned models can gain added robustness through plug-in redirection.
  • Adversarial attacks via prompt rephrasing become less effective because interventions occur in embedding space.
  • Semantic and perceptual quality of safe generations stays close to the original model's output.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same redirection logic could be tested on text-to-video or text-to-3D models to see if embedding interventions transfer across generation modalities.
  • Deployment pipelines might reduce reliance on separate post-generation filters if redirection proves consistent across prompt distributions.
  • Concept-specific redirection strength could be tuned per user or per deployment to balance safety with creative freedom on borderline prompts.

Load-bearing premise

The safety classifier reliably flags unsafe trajectories and the delta generator redirects them without creating artifacts or mistakenly altering safe prompts.

What would settle it

Running SafeRedir on a set of adversarial paraphrases of known harmful prompts and measuring whether any still produce the targeted unsafe content, or testing image quality metrics on a large set of safe prompts before and after redirection to check for unintended degradation.

Figures

Figures reproduced from arXiv: 2601.08623 by Han Qiu, Jie Zhang, Kangjie Chen, Kwok-Yan Lam, Renyang Liu, See-kiong Ng, Tianwei Zhang.

Figure 1
Figure 1. Figure 1: A demo case of SafeRedir. Given the prompt 𝑝=“A naked woman sits on a rock by a waterfall”, a standard diffusion pipeline (left) directly encodes the prompt and generates images 𝐼 containing explicit content. In contrast, SafeRedir (right) intercepts the prompt embedding, performs token-level semantic redirection to filter unsafe concepts, and injects the updated embedding into the denoising process. The r… view at source ↗
Figure 2
Figure 2. Figure 2: Generated images of unlearning methods on three [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (a) Images generated by leveraging unlearned models [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: SafeRedir inference pipeline for safety-aware text-to-image generation. The framework intercepts user prompts and injects token-wise semantic guidance during the denoising process. Unsafe semantic elements (e.g., “naked person”) are automatically redirected in the prompt embedding space at each denoising step 𝑡, resulting in sanitized and semantically coherent outputs. For safe prompts, the original genera… view at source ↗
Figure 5
Figure 5. Figure 5: Selective semantic redirection. Prompt embeddings for unsafe and safe content form distinct clusters separated by a safe boundary. SafeRedir minimally shifts only unsafe embeddings into the safe region using 𝛼 · Δ˜, leaving benign prompts unchanged. Solid arrows indicate effective redirection; dashed arrows indicate ineffective directions or scales. where Δ˜ denotes a learned direction from unsafe to safe,… view at source ↗
Figure 7
Figure 7. Figure 7: SafeRedir for safety detection. It fuses multi-modal inputs—image latent features z𝑡 , timestep 𝑡, and prompt embeddings 𝑝𝑒𝑚𝑏—via dedicated encoders and multi-scale cross-attention 𝑓attn, which will be used for safety detection. TABLE IV: Performance of different configurations of redirec￾tion embedding, scaling factor 𝛼, and mask 𝑚. Here, emb1 is the vector difference embsafe −embunsafe, and emb2 is predi… view at source ↗
Figure 6
Figure 6. Figure 6: Latent-only detection accuracy vs. diffusion step. Label Pred y Label Predictor Linear & SiLU &Dropout D [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: shows our redirection mechanism, which computes three key factors: (1) the token-wise shift vector (Δ) denotes the direction of correction in the embedding space; (2) the adaptive scaling factor 𝛼 determines the magnitude of correction; and (3) the soft mask 𝑚 determines the locations of tokens for corrections. These three factors form a robust and flexible intervention pipeline, enabling dynamic adjustmen… view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of redirection strategies in embedding [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Person Detect Rate (PDR) for person-centric unsafe [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Comparison of unlearning methods on image quality [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
Figure 13
Figure 13. Figure 13: Person detection rates on unsafe prompts after [PITH_FULL_IMAGE:figures/full_fig_p013_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Extended qualitative comparison of incomplete forgetting in image generation model unlearning. Sample outputs of a wide range of unlearning methods on three representative forgetting tasks: Van Gogh style (top), NSFW (middle), and Church (bottom). Each column corresponds to a mainstream method. Across all settings, sensitive content or style is often only partially removed, with residual attributes, subje… view at source ↗
Figure 15
Figure 15. Figure 15: Images generated by various unlearning models in response to prompts containing [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Images generated by various unlearning models in response to prompts containing [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Nudity content reduced rate across different unlearning methods compared to the original (ORG) model. Each horizontal [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: SafeRedir Transferability to Other Models. Visual examples demonstrating the transferability of SafeRedir to a range of popular diffusion backbones, including SD v1.5, Any v3, DL v1, OJ v1, RV v1.4, WD v1.3. The left block (a, Initial Model) shows that all original models generate NSFW content when prompted with explicit queries. The right block (b, +SafeRedir) demonstrates that integrating SafeRedir robu… view at source ↗
Figure 19
Figure 19. Figure 19: Forgetting Performance Improvements of Existing Baselines Brought by SafeRedir. Each column represents a different baseline model after applying SafeRedir, and each row corresponds to a prompt containing NSFW content. SafeRedir effectively removes residual explicit features, restores natural and well-clothed appearances, and preserves scene semantics and visual fidelity across all baselines. These results… view at source ↗
read the original abstract

Image generation models (IGMs), while capable of producing impressive and creative content, often memorize a wide range of undesirable concepts from their training data, leading to the reproduction of unsafe content such as NSFW imagery and copyrighted artistic styles. Such behaviors pose persistent safety and compliance risks in real-world deployments and cannot be reliably mitigated by post-hoc filtering, owing to the limited robustness of such mechanisms and a lack of fine-grained semantic control. Recent unlearning methods seek to erase harmful concepts at the model level, which exhibit the limitations of requiring costly retraining, degrading the quality of benign generations, or failing to withstand prompt paraphrasing and adversarial attacks. To address these challenges, we introduce SafeRedir, a lightweight inference-time framework for robust unlearning via prompt embedding redirection. Without modifying the underlying IGMs, SafeRedir adaptively routes unsafe prompts toward safe semantic regions through token-level interventions in the embedding space. The framework comprises two core components: a latent-aware multi-modal safety classifier for identifying unsafe generation trajectories, and a token-level delta generator for precise semantic redirection, equipped with auxiliary predictors for token masking and adaptive scaling to localize and regulate the intervention. Empirical results across multiple representative unlearning tasks demonstrate that SafeRedir achieves effective unlearning capability, high semantic and perceptual preservation, robust image quality, and enhanced resistance to adversarial attacks. Furthermore, SafeRedir generalizes effectively across a variety of diffusion backbones and existing unlearned models, validating its plug-and-play compatibility and broad applicability. Code and data are available at https://github.com/ryliu68/SafeRedir.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces SafeRedir, a lightweight inference-time framework for robust unlearning in image generation models via prompt embedding redirection. It comprises a latent-aware multi-modal safety classifier to identify unsafe generation trajectories and a token-level delta generator (with auxiliary masking and adaptive scaling predictors) to redirect embeddings toward safe semantic regions without modifying the underlying diffusion model. The authors claim that empirical results across multiple unlearning tasks show effective concept erasure, high semantic/perceptual preservation, robust image quality, enhanced adversarial resistance, and generalization across diffusion backbones and existing unlearned models.

Significance. If the central empirical claims hold with adequate quantitative support, SafeRedir would represent a practical advance by providing a plug-and-play, training-free alternative to retraining-based unlearning methods, potentially enabling safer real-world deployment of image generation models while maintaining generation quality and attack robustness.

major comments (3)
  1. [Abstract] Abstract: the claim of 'effective unlearning capability, high semantic and perceptual preservation, robust image quality, and enhanced resistance to adversarial attacks' is presented without any quantitative metrics, baselines, or specific results (e.g., no FID, CLIP scores, attack success rates, or classifier precision/recall), preventing assessment of whether the data actually support the stated conclusions.
  2. [Method (latent-aware classifier)] Method description of the latent-aware multi-modal safety classifier: no accuracy metrics (precision, recall, FPR on balanced safe/unsafe prompt sets) or validation details are supplied to confirm that unsafe trajectories can be reliably identified from prompt embeddings alone; this directly bears on the preservation claims for benign prompts.
  3. [Experiments] Experiments section: no ablation isolating the token-level delta generator's effect (masking and scaling predictors) on benign prompts is reported, leaving the risk of unintended semantic drift or quality degradation unquantified and undermining the 'high preservation' and 'plug-and-play' assertions.
minor comments (1)
  1. [Abstract] The GitHub link is provided but the manuscript does not indicate whether the released code includes the exact experimental configurations and random seeds used for the reported results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to provide stronger quantitative support.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of 'effective unlearning capability, high semantic and perceptual preservation, robust image quality, and enhanced resistance to adversarial attacks' is presented without any quantitative metrics, baselines, or specific results (e.g., no FID, CLIP scores, attack success rates, or classifier precision/recall), preventing assessment of whether the data actually support the stated conclusions.

    Authors: We agree that the abstract would be strengthened by including concrete quantitative results. In the revised version we will add specific metrics (e.g., FID, CLIP similarity, attack success rates, and classifier precision/recall) drawn from the experiments section to directly support the stated claims. revision: yes

  2. Referee: [Method (latent-aware classifier)] Method description of the latent-aware multi-modal safety classifier: no accuracy metrics (precision, recall, FPR on balanced safe/unsafe prompt sets) or validation details are supplied to confirm that unsafe trajectories can be reliably identified from prompt embeddings alone; this directly bears on the preservation claims for benign prompts.

    Authors: The referee correctly notes the absence of explicit classifier metrics in the method section. We will revise this section to report precision, recall, and FPR on balanced safe/unsafe prompt sets together with validation details, thereby clarifying the classifier's reliability and its limited impact on benign prompts. revision: yes

  3. Referee: [Experiments] Experiments section: no ablation isolating the token-level delta generator's effect (masking and scaling predictors) on benign prompts is reported, leaving the risk of unintended semantic drift or quality degradation unquantified and undermining the 'high preservation' and 'plug-and-play' assertions.

    Authors: We acknowledge that an ablation isolating the masking and scaling predictors on benign prompts is missing. In the revised experiments we will add this ablation, reporting CLIP scores, FID, and perceptual metrics on benign prompts with and without these components to quantify any semantic drift or quality impact. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents SafeRedir as an independent inference-time framework with two explicitly described components (latent-aware classifier and token-level delta generator) whose operation is not reduced to self-definition, fitted inputs renamed as predictions, or load-bearing self-citations. No equations, uniqueness theorems, or ansatzes are shown that equate outputs to inputs by construction. Empirical results are positioned as separate validation across backbones, making the central claims externally falsifiable rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient detail to identify specific free parameters, axioms, or invented entities; the approach relies on standard components such as classifiers and generators without explicit new postulates.

pith-pipeline@v0.9.0 · 5614 in / 1061 out tokens · 76202 ms · 2026-05-16T14:21:28.729927+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 1 internal anchor

  1. [1]

    High- resolution image synthesis with latent diffusion models,

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inCVPR. IEEE, 2022, pp. 10 674–10 685

  2. [2]

    Dall-e 3: Text-to-image generation and editing,

    OpenAI, “Dall-e 3: Text-to-image generation and editing,”OpenAI Technical Report, 2023

  3. [3]

    Photorealistic text-to-image diffusion models with deep language understanding,

    C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, S. K. S. Ghasemipour, R. G. Lopes, B. K. Ayan, T. Salimans, J. Ho, D. J. Fleet, and M. Norouzi, “Photorealistic text-to-image diffusion models with deep language understanding,” inNeurIPS, 2022

  4. [4]

    Midjourney,

    Midjourney, “Midjourney,” 2022, https://en.wikipedia.org/wiki/ Midjourney

  5. [5]

    Safegen: Mitigating unsafe content generation in text-to-image models,

    X. Li, Y . Yang, J. Deng, C. Yan, Y . Chen, X. Ji, and W. Xu, “Safegen: Mitigating unsafe content generation in text-to-image models,” inCCS, 2024

  6. [6]

    Safe-clip: Removing nsfw concepts from vision-and-language models,

    S. Poppi, T. Poppi, F. Cocchi, M. Cornia, L. Baraldi, R. Cucchiaraet al., “Safe-clip: Removing nsfw concepts from vision-and-language models,” inECCV, 2024

  7. [7]

    Unsafe diffusion: On the generation of unsafe images and hateful memes from text-to-image models,

    Y . Qu, X. Shen, X. He, M. Backes, S. Zannettou, and Y . Zhang, “Unsafe diffusion: On the generation of unsafe images and hateful memes from text-to-image models,” inCCS. ACM, 2023, pp. 3403–3417

  8. [8]

    Regulation (eu) 2016/679 of the european parliament and of the council,

    P. Regulation, “Regulation (eu) 2016/679 of the european parliament and of the council,”Regulation (eu), vol. 679, p. 2016, 2016

  9. [9]

    Erasing concepts from diffusion models,

    R. Gandikota, J. Materzynska, J. Fiotto-Kaufman, and D. Bau, “Erasing concepts from diffusion models,” inICCV. IEEE, 2023, pp. 2426–2436

  10. [10]

    Unified concept editing in diffusion models,

    R. Gandikota, H. Orgad, Y . Belinkov, J. Materzynska, and D. Bau, “Unified concept editing in diffusion models,” inWACV. IEEE, 2024, pp. 5099–5108

  11. [11]

    MACE: mass concept erasure in diffusion models,

    S. Lu, Z. Wang, L. Li, Y . Liu, and A. W. Kong, “MACE: mass concept erasure in diffusion models,” inCVPR. IEEE, 2024, pp. 6430–6440

  12. [12]

    Mma-diffusion: Multimodal attack on diffusion models,

    Y . Yang, R. Gao, X. Wang, T. Ho, N. Xu, and Q. Xu, “Mma-diffusion: Multimodal attack on diffusion models,” inCVPR. IEEE, 2024, pp. 7737–7746

  13. [13]

    Sneakyprompt: Jailbreaking text-to-image generative models,

    Y . Yang, B. Hui, H. Yuan, N. Gong, and Y . Cao, “Sneakyprompt: Jailbreaking text-to-image generative models,” inS&P. IEEE, 2024, pp. 897–912

  14. [14]

    Surrogateprompt: Bypassing the safety filter of text-to-image models via substitution,

    Z. Ba, J. Zhong, J. Lei, P. Cheng, Q. Wang, Z. Qin, Z. Wang, and K. Ren, “Surrogateprompt: Bypassing the safety filter of text-to-image models via substitution,” inCCS, B. Luo, X. Liao, J. Xu, E. Kirda, and D. Lie, Eds. ACM, 2024, pp. 1166–1180

  15. [15]

    Reliable and efficient concept erasure of text-to-image diffusion models,

    C. Gong, K. Chen, Z. Wei, J. Chen, and Y . Jiang, “Reliable and efficient concept erasure of text-to-image diffusion models,” inECCV. Springer, 2024, pp. 73–88

  16. [16]

    Conceptprune: Concept editing in diffusion models via skilled neuron pruning,

    R. Chavhan, D. Li, and T. M. Hospedales, “Conceptprune: Concept editing in diffusion models via skilled neuron pruning,” inICLR. OpenReview.net, 2025

  17. [17]

    Defensive unlearning with adversarial training for robust concept erasure in diffusion models,

    Y . Zhang, X. Chen, J. Jia, Y . Zhang, C. Fan, J. Liu, M. Hong, K. Ding, and S. Liu, “Defensive unlearning with adversarial training for robust concept erasure in diffusion models,” inNeurIPS, 2024, pp. 36 748–36 776

  18. [18]

    Receler: Reliable concept erasing of text-to-image diffusion models via lightweight erasers,

    C. Huang, K. Chang, C. Tsai, Y . Lai, F. Yang, and Y . F. Wang, “Receler: Reliable concept erasing of text-to-image diffusion models via lightweight erasers,” inECCV, vol. 15098. Springer, 2024, pp. 360–376

  19. [19]

    Localizing and editing knowledge in text-to-image generative models,

    S. Basu, N. Zhao, V . I. Morariu, S. Feizi, and V . Manjunatha, “Localizing and editing knowledge in text-to-image generative models,” inICLR. OpenReview.net, 2024

  20. [20]

    Erasing concepts, steering generations: A comprehensive survey of concept suppression.arXiv preprint arXiv:2505.19398,

    Y . Xie, P. Liu, and Z. Zhang, “Erasing concepts, steering genera- tions: A comprehensive survey of concept suppression,”arXiv preprint arXiv:2505.19398, 2025

  21. [21]

    To generate or not? safety-driven unlearned diffusion models are still easy to generate unsafe images ... for now,

    Y . Zhang, J. Jia, X. Chen, A. Chen, Y . Zhang, J. Liu, K. Ding, and S. Liu, “To generate or not? safety-driven unlearned diffusion models are still easy to generate unsafe images ... for now,” inCVPR. IEEE, 2024, pp. 385–403. 14

  22. [22]

    Image can bring your memory back: A novel multi-modal guided attack against image generation model unlearning,

    R. Liu, G. Li, T. Zhang, and S.-K. Ng, “Image can bring your memory back: A novel multi-modal guided attack against image generation model unlearning,”arXiv preprint arXiv:2507.07139, 2025

  23. [23]

    SSC-V AE: structured sparse coding based variational autoencoder for detail preserved image reconstruction,

    H. Wang, L. Wang, Z. Wang, L. Ma, and Y . Luo, “SSC-V AE: structured sparse coding based variational autoencoder for detail preserved image reconstruction,” inAAAI, T. Walsh, J. Shah, and Z. Kolter, Eds. AAAI Press, 2025, pp. 7665–7673

  24. [24]

    Stargan v2: Diverse image synthesis for multiple domains,

    Y . Choi, Y . Uh, J. Yoo, and J. Ha, “Stargan v2: Diverse image synthesis for multiple domains,” inCVPR. Computer Vision Foundation / IEEE, 2020, pp. 8185–8194

  25. [25]

    Styleflow: Attribute- conditioned exploration of stylegan-generated images using conditional continuous normalizing flows,

    R. Abdal, P. Zhu, N. J. Mitra, and P. Wonka, “Styleflow: Attribute- conditioned exploration of stylegan-generated images using conditional continuous normalizing flows,”ACM Trans. Graph., vol. 40, no. 3, pp. 21:1–21:21, 2021

  26. [26]

    Denoising diffusion probabilistic models,

    J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” inNeurIPS, 2020

  27. [27]

    Denoising diffusion implicit models,

    J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” inICLR. OpenReview.net, 2021

  28. [28]

    U-net: Convolutional networks for biomedical image segmentation,

    O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” inMICCAI, vol. 9351, 2015, pp. 234–241

  29. [29]

    LAION-5B: an open large-scale dataset for training next generation image-text models,

    C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wight- man, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, P. Schramowski, S. Kundurthy, K. Crowson, L. Schmidt, R. Kaczmarczyk, and J. Jitsev, “LAION-5B: an open large-scale dataset for training next generation image-text models,” inNeurIPS, 2022

  30. [30]

    Stable diffusion v1.5,

    CompVis, “Stable diffusion v1.5,” https://huggingface.co/ stable-diffusion-v1-5/stable-diffusion-v1-5, 2022

  31. [31]

    Extracting training data from diffusion models,

    N. Carlini, J. Hayes, M. Nasr, M. Jagielski, V . Sehwag, F. Tram `er, B. Balle, D. Ippolito, and E. Wallace, “Extracting training data from diffusion models,” inUSENIX Security. USENIX Association, 2023, pp. 5253–5270

  32. [32]

    Machine unlearning,

    L. Bourtoule, V . Chandrasekaran, C. A. Choquette-Choo, H. Jia, A. Travers, B. Zhang, D. Lie, and N. Papernot, “Machine unlearning,” inS&P. IEEE, 2021, pp. 141–159

  33. [33]

    Rethinking machine unlearning in image generation models,

    R. Liu, W. Feng, T. Zhang, W. Zhou, X. Cheng, and S.-K. Ng, “Rethinking machine unlearning in image generation models,” inCCS, 2025

  34. [34]

    Athena: Unlearning spurious features via data filtering and model fine-tuning,

    F. Sommeret al., “Athena: Unlearning spurious features via data filtering and model fine-tuning,”NeurIPS, 2022

  35. [35]

    Concept pinpoint eraser for text-to-image diffusion models via residual attention gate,

    B. H. Lee, S. Lim, S. Lee, D. U. Kang, and S. Y . Chun, “Concept pinpoint eraser for text-to-image diffusion models via residual attention gate,” inICLR, 2025

  36. [36]

    Safe text-to-image generation: Simply sanitize the prompt embedding,

    H. Qiu, G. Chen, M. Zhang, X. Zhang, X. You, and M. Yang, “Safe text-to-image generation: Simply sanitize the prompt embedding,”arXiv, 2024

  37. [37]

    Personalization as a shortcut for few-shot backdoor attack against text-to-image diffusion models,

    Y . Huang, F. Juefei-Xu, Q. Guo, J. Zhang, Y . Wu, M. Hu, T. Li, G. Pu, and Y . Liu, “Personalization as a shortcut for few-shot backdoor attack against text-to-image diffusion models,” inAAAI, 2024, pp. 21 169–21 178

  38. [38]

    Yolov8: The next generation of yolo,

    G. Jocher, A. Chaurasia, J. Qiu, and R. Stoken, “Yolov8: The next generation of yolo,” https://github.com/ultralytics/ultralytics, 2023

  39. [39]

    Ring-a-bell! how reliable are concept removal methods for diffusion models?

    Y . Tsai, C. Hsu, C. Xie, C. Lin, J. Chen, B. Li, P. Chen, C. Yu, and C. Huang, “Ring-a-bell! how reliable are concept removal methods for diffusion models?” inICLR. OpenReview.net, 2024

  40. [40]

    Decision-based adversarial attacks: Reliable attacks against black-box machine learning models,

    W. Brendel, J. Rauber, and M. Bethge, “Decision-based adversarial attacks: Reliable attacks against black-box machine learning models,” in ICLR, 2018

  41. [41]

    ADBA: approximation decision boundary approach for black-box adversarial attacks,

    F. Wang, X. Zuo, H. Huang, and G. Chen, “ADBA: approximation decision boundary approach for black-box adversarial attacks,” inAAAI, 2025, pp. 7628–7636

  42. [42]

    Safe inputs but unsafe output: Benchmarking cross-modality safety alignment of large vision-language models,

    S. Wang, X. Ye, Q. Cheng, J. Duan, S. Li, J. Fu, X. Qiu, and X. Huang, “Safe inputs but unsafe output: Benchmarking cross-modality safety alignment of large vision-language models,” inNAACL. Association for Computational Linguistics, 2025, pp. 3563–3605

  43. [44]

    GPT-4 Technical Report

    [Online]. Available: https://arxiv.org/abs/2303.08774

  44. [45]

    Unlearning concepts in diffusion model via concept domain correction and concept preserving gradient,

    Y . Wu, S. Zhou, M. Yang, L. Wang, W. Zhu, H. Chang, X. Zhou, and X. Yang, “Unlearning concepts in diffusion model via concept domain correction and concept preserving gradient,” inAAAI, 2025

  45. [46]

    Anything-v3.0,

    admruul, “Anything-v3.0,” https://huggingface.co/admruul/anything-v3.0, 2022

  46. [47]

    Dreamlike diffusion 1.0,

    dreamlike art, “Dreamlike diffusion 1.0,” https://huggingface.co/ dreamlike-art/dreamlike-diffusion-1.0, 2022

  47. [48]

    Openjourney v1,

    PromptHero, “Openjourney v1,” https://huggingface.co/prompthero/ openjourney, 2022

  48. [49]

    Realistic vision v1.4,

    SG161222, “Realistic vision v1.4,” https://huggingface.co/SG161222/ Realistic Vision V1.4, 2022

  49. [50]

    Waifu diffusion v1.3,

    hakurei, “Waifu diffusion v1.3,” https://huggingface.co/hakurei/ waifu-diffusion-v1-3, 2022

  50. [51]

    Nudenet: Deep learning model for nudity detection,

    B. Praneeth, “Nudenet: Deep learning model for nudity detection,” https: //github.com/notAI-tech/NudeNet, 2023

  51. [52]

    Six-cd: Benchmarking concept removals for benign text-to- image diffusion models,

    J. Ren, K. Chen, Y . Cui, S. Zeng, H. Liu, Y . Xing, J. Tang, and L. Lyu, “Six-cd: Benchmarking concept removals for benign text-to- image diffusion models,” inCVPR. IEEE, 2025, pp. 28 769–28 778

  52. [53]

    EraX-NSFW-V1.0: An open nsfw image classifier,

    erax ai, “EraX-NSFW-V1.0: An open nsfw image classifier,” https:// huggingface.co/erax-ai/EraX-NSFW-V1.0, 2023, accessed: 2025-04-10

  53. [54]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inCVPR. IEEE, 2016, pp. 770–778

  54. [55]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium,

    M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” inNeurIPS, 2017, pp. 6626–6637

  55. [56]

    Q-align: Teaching lmms for visual scoring via discrete text-defined levels,

    H. Wu, Z. Zhang, W. Zhang, C. Chen, L. Liao, C. Li, Y . Gao, A. Wang, E. Zhang, W. Sun, Q. Yan, X. Min, G. Zhai, and W. Lin, “Q-align: Teaching lmms for visual scoring via discrete text-defined levels,” in ICML, 2024

  56. [57]

    Clipscore: A reference-free evaluation metric for image captioning,

    J. Hessel, A. Holtzman, M. Forbes, R. L. Bras, and Y . Choi, “Clipscore: A reference-free evaluation metric for image captioning,” inEMNLP. Association for Computational Linguistics, 2021, pp. 7514–7528

  57. [58]

    Stable diffusion 2.0 release,

    Stability AI, “Stable diffusion 2.0 release,” https://stability.ai/blog/ stable-diffusion-v2-release, 2022, accessed: 2025-05-30

  58. [59]

    Stable diffusion 2.1 on hugging face,

    ——, “Stable diffusion 2.1 on hugging face,” https://huggingface.co/ stabilityai/stable-diffusion-2, 2023, accessed: 2025-05-30. 15 APPENDIXA APPENDIXOVERVIEW This appendix provides supplementary material that expands upon the main paper by presenting additional details, analyses, and experimental results omitted due to space constraints. All content herei...

  59. [60]

    naked”, while the safe counterpart replaces it with “well-clothed

    Dataset Construction We construct a dataset of semantically aligned safe and unsafe prompt pairs to enable supervised learning of SafeRedir. Each pair (𝑝 safe, 𝑝unsafe) is carefully curated such that both prompts describe the same benign context but differ by the presence of a high-risk element. Take theNSFWtask as an instance, the unsafe prompt may conta...

  60. [61]

    nude” or “naked

    Training Objective The SafeRedir model is optimized end-to-end using a multi- component objective designed to enable accurate unsafe content detection, fine-grained token-level guidance, and minimal semantic disruption to benign content. The overall training loss is given by: Ltotal =𝜆 clsLcls +𝜆 mseLmse +𝜆 cosLcos +𝜆 maskLmask +𝜆 𝛼L 𝛼, (14) where𝜆 ∗ are ...

  61. [62]

    ANUS EXPOSED

    Forgetting Qualitative Comparison:Due to page limitations, repre- sentative generations forNSFWandVan Gogh Styleunlearning are presented in Figs. 15 and 16, respectively, in response to prompts containing the corresponding concepts. These qualitative results, consistent with our quantitative findings, demonstrate that SafeRedir not only achieves effective...

  62. [63]

    Adopted to Other Models Fig. 18 presents qualitative results demonstrating the trans- ferability of SafeRedir to a diverse set of community diffusion models, including SD v1.5, Any v3, DL v1, OJ v1, RV v1.4, and WD v1.3. Each row corresponds to a distinctNSFWprompt. The left block shows outputs from the original models, which consistently generate sensiti...

  63. [64]

    19 illustrates qualitative improvements achieved by integrating SafeRedir into ten representative unlearning meth- ods

    Enhancement of Existing Unlearning Fig. 19 illustrates qualitative improvements achieved by integrating SafeRedir into ten representative unlearning meth- ods. Across all cases, residualNSFWcontent is effectively removed, and visual or semantic artifacts introduced by the original methods are mitigated. SafeRedir enhances image realism, preserves scene co...

  64. [65]

    Core Inputs, Model Components, and Training Strategies We conduct a comprehensive ablation study to quantify the contributions of each core element in SafeRedir across three evaluation dimensions: forgetting effectiveness (FSR), preser- vation (CSDR, YOLO), and image quality (FID, LPIPS, Q- Align, Laion aes). Specifically, we first analyze the importance ...

  65. [66]

    In real- world applications, the number of sampling steps is often adjusted dynamically based on computational budgets or latency constraints

    Robustness to Sampling Steps A critical consideration for the practical deployment of safety-guided unlearning in diffusion models is its robust- ness to variation in inference-time sampling steps. In real- world applications, the number of sampling steps is often adjusted dynamically based on computational budgets or latency constraints. Therefore, it is...

  66. [67]

    Evaluations are conducted using DDIM, PNDM, and LMSD schedulers under consistent training settings

    Robustness to Sampling Scheduler We further assess the performance of SafeRedir under differ- ent diffusion schedulers, as practical deployments often require switching between sampling algorithms to balance quality and efficiency. Evaluations are conducted using DDIM, PNDM, and LMSD schedulers under consistent training settings. 24 TABLE XV: Robustness o...