SafeRedir: Prompt Embedding Redirection for Robust Unlearning in Image Generation Models

Han Qiu; Jie Zhang; Kangjie Chen; Kwok-Yan Lam; Renyang Liu; See-kiong Ng; Tianwei Zhang

arxiv: 2601.08623 · v2 · submitted 2026-01-13 · 💻 cs.CV · cs.AI· cs.CR· cs.LG

SafeRedir: Prompt Embedding Redirection for Robust Unlearning in Image Generation Models

Renyang Liu , Kangjie Chen , Han Qiu , Jie Zhang , Kwok-Yan Lam , Tianwei Zhang , See-kiong Ng This is my paper

Pith reviewed 2026-05-16 14:21 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CRcs.LG

keywords unlearningimage generationdiffusion modelsprompt embeddingsafety classifieradversarial robustnessinference-time intervention

0 comments

The pith

SafeRedir redirects unsafe prompt embeddings toward safe regions at inference time to unlearn harmful concepts in image generators without retraining the model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

SafeRedir is a lightweight framework that identifies unsafe generation paths in the embedding space of image generation models and redirects them via targeted token adjustments. It operates without altering the underlying model weights, avoiding the quality degradation and computational costs of retraining-based unlearning approaches. The method uses a safety classifier to spot risky trajectories and a delta generator to shift prompts toward benign semantics while preserving details for safe inputs. Results show it maintains image quality, resists adversarial attacks better than prior methods, and applies across different diffusion backbones and already-unlearned models.

Core claim

By combining a latent-aware multi-modal safety classifier with a token-level delta generator that includes masking and scaling predictors, SafeRedir can route unsafe prompts to safe semantic areas in embedding space during inference, achieving effective removal of harmful concepts such as NSFW content or copyrighted styles while retaining high semantic fidelity and perceptual quality for benign prompts.

What carries the argument

Token-level delta generator with auxiliary predictors for masking and adaptive scaling, driven by a latent-aware multi-modal safety classifier that detects unsafe trajectories in embedding space.

If this is right

Unlearning no longer requires full model retraining or fine-tuning for each new harmful concept.
Existing image generators and already-unlearned models can gain added robustness through plug-in redirection.
Adversarial attacks via prompt rephrasing become less effective because interventions occur in embedding space.
Semantic and perceptual quality of safe generations stays close to the original model's output.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same redirection logic could be tested on text-to-video or text-to-3D models to see if embedding interventions transfer across generation modalities.
Deployment pipelines might reduce reliance on separate post-generation filters if redirection proves consistent across prompt distributions.
Concept-specific redirection strength could be tuned per user or per deployment to balance safety with creative freedom on borderline prompts.

Load-bearing premise

The safety classifier reliably flags unsafe trajectories and the delta generator redirects them without creating artifacts or mistakenly altering safe prompts.

What would settle it

Running SafeRedir on a set of adversarial paraphrases of known harmful prompts and measuring whether any still produce the targeted unsafe content, or testing image quality metrics on a large set of safe prompts before and after redirection to check for unintended degradation.

Figures

Figures reproduced from arXiv: 2601.08623 by Han Qiu, Jie Zhang, Kangjie Chen, Kwok-Yan Lam, Renyang Liu, See-kiong Ng, Tianwei Zhang.

**Figure 1.** Figure 1: A demo case of SafeRedir. Given the prompt 𝑝=“A naked woman sits on a rock by a waterfall”, a standard diffusion pipeline (left) directly encodes the prompt and generates images 𝐼 containing explicit content. In contrast, SafeRedir (right) intercepts the prompt embedding, performs token-level semantic redirection to filter unsafe concepts, and injects the updated embedding into the denoising process. The r… view at source ↗

**Figure 2.** Figure 2: Generated images of unlearning methods on three [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: (a) Images generated by leveraging unlearned models [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: SafeRedir inference pipeline for safety-aware text-to-image generation. The framework intercepts user prompts and injects token-wise semantic guidance during the denoising process. Unsafe semantic elements (e.g., “naked person”) are automatically redirected in the prompt embedding space at each denoising step 𝑡, resulting in sanitized and semantically coherent outputs. For safe prompts, the original genera… view at source ↗

**Figure 5.** Figure 5: Selective semantic redirection. Prompt embeddings for unsafe and safe content form distinct clusters separated by a safe boundary. SafeRedir minimally shifts only unsafe embeddings into the safe region using 𝛼 · Δ˜, leaving benign prompts unchanged. Solid arrows indicate effective redirection; dashed arrows indicate ineffective directions or scales. where Δ˜ denotes a learned direction from unsafe to safe,… view at source ↗

**Figure 7.** Figure 7: SafeRedir for safety detection. It fuses multi-modal inputs—image latent features z𝑡 , timestep 𝑡, and prompt embeddings 𝑝𝑒𝑚𝑏—via dedicated encoders and multi-scale cross-attention 𝑓attn, which will be used for safety detection. TABLE IV: Performance of different configurations of redirection embedding, scaling factor 𝛼, and mask 𝑚. Here, emb1 is the vector difference embsafe −embunsafe, and emb2 is predi… view at source ↗

**Figure 6.** Figure 6: Latent-only detection accuracy vs. diffusion step. Label Pred y Label Predictor Linear & SiLU &Dropout D [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 8.** Figure 8: shows our redirection mechanism, which computes three key factors: (1) the token-wise shift vector (Δ) denotes the direction of correction in the embedding space; (2) the adaptive scaling factor 𝛼 determines the magnitude of correction; and (3) the soft mask 𝑚 determines the locations of tokens for corrections. These three factors form a robust and flexible intervention pipeline, enabling dynamic adjustmen… view at source ↗

**Figure 9.** Figure 9: Comparison of redirection strategies in embedding [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: Person Detect Rate (PDR) for person-centric unsafe [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

**Figure 11.** Figure 11: Comparison of unlearning methods on image quality [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

**Figure 13.** Figure 13: Person detection rates on unsafe prompts after [PITH_FULL_IMAGE:figures/full_fig_p013_13.png] view at source ↗

**Figure 14.** Figure 14: Extended qualitative comparison of incomplete forgetting in image generation model unlearning. Sample outputs of a wide range of unlearning methods on three representative forgetting tasks: Van Gogh style (top), NSFW (middle), and Church (bottom). Each column corresponds to a mainstream method. Across all settings, sensitive content or style is often only partially removed, with residual attributes, subje… view at source ↗

**Figure 15.** Figure 15: Images generated by various unlearning models in response to prompts containing [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗

**Figure 16.** Figure 16: Images generated by various unlearning models in response to prompts containing [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗

**Figure 17.** Figure 17: Nudity content reduced rate across different unlearning methods compared to the original (ORG) model. Each horizontal [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗

**Figure 18.** Figure 18: SafeRedir Transferability to Other Models. Visual examples demonstrating the transferability of SafeRedir to a range of popular diffusion backbones, including SD v1.5, Any v3, DL v1, OJ v1, RV v1.4, WD v1.3. The left block (a, Initial Model) shows that all original models generate NSFW content when prompted with explicit queries. The right block (b, +SafeRedir) demonstrates that integrating SafeRedir robu… view at source ↗

**Figure 19.** Figure 19: Forgetting Performance Improvements of Existing Baselines Brought by SafeRedir. Each column represents a different baseline model after applying SafeRedir, and each row corresponds to a prompt containing NSFW content. SafeRedir effectively removes residual explicit features, restores natural and well-clothed appearances, and preserves scene semantics and visual fidelity across all baselines. These results… view at source ↗

read the original abstract

Image generation models (IGMs), while capable of producing impressive and creative content, often memorize a wide range of undesirable concepts from their training data, leading to the reproduction of unsafe content such as NSFW imagery and copyrighted artistic styles. Such behaviors pose persistent safety and compliance risks in real-world deployments and cannot be reliably mitigated by post-hoc filtering, owing to the limited robustness of such mechanisms and a lack of fine-grained semantic control. Recent unlearning methods seek to erase harmful concepts at the model level, which exhibit the limitations of requiring costly retraining, degrading the quality of benign generations, or failing to withstand prompt paraphrasing and adversarial attacks. To address these challenges, we introduce SafeRedir, a lightweight inference-time framework for robust unlearning via prompt embedding redirection. Without modifying the underlying IGMs, SafeRedir adaptively routes unsafe prompts toward safe semantic regions through token-level interventions in the embedding space. The framework comprises two core components: a latent-aware multi-modal safety classifier for identifying unsafe generation trajectories, and a token-level delta generator for precise semantic redirection, equipped with auxiliary predictors for token masking and adaptive scaling to localize and regulate the intervention. Empirical results across multiple representative unlearning tasks demonstrate that SafeRedir achieves effective unlearning capability, high semantic and perceptual preservation, robust image quality, and enhanced resistance to adversarial attacks. Furthermore, SafeRedir generalizes effectively across a variety of diffusion backbones and existing unlearned models, validating its plug-and-play compatibility and broad applicability. Code and data are available at https://github.com/ryliu68/SafeRedir.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SafeRedir gives a lightweight inference-time redirection method for unlearning unsafe concepts in diffusion models, but the classifier and delta generator lack the metrics needed to confirm they leave safe prompts untouched.

read the letter

SafeRedir is an inference-time framework that spots unsafe prompt embeddings with a latent-aware multi-modal classifier and then applies a token-level delta generator, complete with masking and scaling predictors, to steer generations toward safer regions. The approach leaves the base model untouched and is positioned as plug-and-play across diffusion backbones and already-unlearned checkpoints. That combination of components is the clearest new piece relative to retraining-heavy unlearning or simple post-hoc filters. The paper also supplies a GitHub link, which helps with checking the implementation later. The write-up does a straightforward job of listing why retraining is expensive and why filters fall short on adversarial prompts, then shows how the redirection targets those gaps. Claims of maintained semantic quality, image fidelity, and better attack resistance follow from the architecture description. The soft spot is exactly the one flagged in the stress-test note. The abstract gives no precision, recall, or false-positive numbers for the safety classifier on balanced safe/unsafe sets, and there is no ablation isolating the delta generator's effect on benign prompts. Without those, it is difficult to know whether redirection introduces drift or artifacts on normal inputs, which undercuts the preservation guarantees. If the full paper contains detailed tables, ROC curves, or benign-prompt ablations, that would tighten the case; based on what is visible here, the central reliability claim rests on unshown evidence. This work is aimed at engineers and researchers who need safety fixes that do not require full model retraining. A reader already working on inference-time controls or diffusion safety would find the architecture description useful even if they end up adapting the components. The paper shows clear thinking about the practical constraints and cites relevant prior lines of work without obvious circularity. It deserves a serious referee because the problem is current and the proposed mechanism is distinct enough to warrant external scrutiny, even with the current gaps in quantitative validation.

Referee Report

3 major / 1 minor

Summary. The paper introduces SafeRedir, a lightweight inference-time framework for robust unlearning in image generation models via prompt embedding redirection. It comprises a latent-aware multi-modal safety classifier to identify unsafe generation trajectories and a token-level delta generator (with auxiliary masking and adaptive scaling predictors) to redirect embeddings toward safe semantic regions without modifying the underlying diffusion model. The authors claim that empirical results across multiple unlearning tasks show effective concept erasure, high semantic/perceptual preservation, robust image quality, enhanced adversarial resistance, and generalization across diffusion backbones and existing unlearned models.

Significance. If the central empirical claims hold with adequate quantitative support, SafeRedir would represent a practical advance by providing a plug-and-play, training-free alternative to retraining-based unlearning methods, potentially enabling safer real-world deployment of image generation models while maintaining generation quality and attack robustness.

major comments (3)

[Abstract] Abstract: the claim of 'effective unlearning capability, high semantic and perceptual preservation, robust image quality, and enhanced resistance to adversarial attacks' is presented without any quantitative metrics, baselines, or specific results (e.g., no FID, CLIP scores, attack success rates, or classifier precision/recall), preventing assessment of whether the data actually support the stated conclusions.
[Method (latent-aware classifier)] Method description of the latent-aware multi-modal safety classifier: no accuracy metrics (precision, recall, FPR on balanced safe/unsafe prompt sets) or validation details are supplied to confirm that unsafe trajectories can be reliably identified from prompt embeddings alone; this directly bears on the preservation claims for benign prompts.
[Experiments] Experiments section: no ablation isolating the token-level delta generator's effect (masking and scaling predictors) on benign prompts is reported, leaving the risk of unintended semantic drift or quality degradation unquantified and undermining the 'high preservation' and 'plug-and-play' assertions.

minor comments (1)

[Abstract] The GitHub link is provided but the manuscript does not indicate whether the released code includes the exact experimental configurations and random seeds used for the reported results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to provide stronger quantitative support.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of 'effective unlearning capability, high semantic and perceptual preservation, robust image quality, and enhanced resistance to adversarial attacks' is presented without any quantitative metrics, baselines, or specific results (e.g., no FID, CLIP scores, attack success rates, or classifier precision/recall), preventing assessment of whether the data actually support the stated conclusions.

Authors: We agree that the abstract would be strengthened by including concrete quantitative results. In the revised version we will add specific metrics (e.g., FID, CLIP similarity, attack success rates, and classifier precision/recall) drawn from the experiments section to directly support the stated claims. revision: yes
Referee: [Method (latent-aware classifier)] Method description of the latent-aware multi-modal safety classifier: no accuracy metrics (precision, recall, FPR on balanced safe/unsafe prompt sets) or validation details are supplied to confirm that unsafe trajectories can be reliably identified from prompt embeddings alone; this directly bears on the preservation claims for benign prompts.

Authors: The referee correctly notes the absence of explicit classifier metrics in the method section. We will revise this section to report precision, recall, and FPR on balanced safe/unsafe prompt sets together with validation details, thereby clarifying the classifier's reliability and its limited impact on benign prompts. revision: yes
Referee: [Experiments] Experiments section: no ablation isolating the token-level delta generator's effect (masking and scaling predictors) on benign prompts is reported, leaving the risk of unintended semantic drift or quality degradation unquantified and undermining the 'high preservation' and 'plug-and-play' assertions.

Authors: We acknowledge that an ablation isolating the masking and scaling predictors on benign prompts is missing. In the revised experiments we will add this ablation, reporting CLIP scores, FID, and perceptual metrics on benign prompts with and without these components to quantify any semantic drift or quality impact. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents SafeRedir as an independent inference-time framework with two explicitly described components (latent-aware classifier and token-level delta generator) whose operation is not reduced to self-definition, fitted inputs renamed as predictions, or load-bearing self-citations. No equations, uniqueness theorems, or ansatzes are shown that equate outputs to inputs by construction. Empirical results are positioned as separate validation across backbones, making the central claims externally falsifiable rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient detail to identify specific free parameters, axioms, or invented entities; the approach relies on standard components such as classifiers and generators without explicit new postulates.

pith-pipeline@v0.9.0 · 5614 in / 1061 out tokens · 76202 ms · 2026-05-16T14:21:28.729927+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SafeRedir comprises two core components: a latent-aware multi-modal safety classifier for identifying unsafe generation trajectories, and a token-level delta generator for precise semantic redirection, equipped with auxiliary predictors for token masking and adaptive scaling
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Empirical results across multiple representative unlearning tasks demonstrate that SafeRedir achieves effective unlearning capability, high semantic and perceptual preservation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 1 internal anchor

[1]

High- resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inCVPR. IEEE, 2022, pp. 10 674–10 685

work page 2022
[2]

Dall-e 3: Text-to-image generation and editing,

OpenAI, “Dall-e 3: Text-to-image generation and editing,”OpenAI Technical Report, 2023

work page 2023
[3]

Photorealistic text-to-image diffusion models with deep language understanding,

C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, S. K. S. Ghasemipour, R. G. Lopes, B. K. Ayan, T. Salimans, J. Ho, D. J. Fleet, and M. Norouzi, “Photorealistic text-to-image diffusion models with deep language understanding,” inNeurIPS, 2022

work page 2022
[4]

Midjourney,

Midjourney, “Midjourney,” 2022, https://en.wikipedia.org/wiki/ Midjourney

work page 2022
[5]

Safegen: Mitigating unsafe content generation in text-to-image models,

X. Li, Y . Yang, J. Deng, C. Yan, Y . Chen, X. Ji, and W. Xu, “Safegen: Mitigating unsafe content generation in text-to-image models,” inCCS, 2024

work page 2024
[6]

Safe-clip: Removing nsfw concepts from vision-and-language models,

S. Poppi, T. Poppi, F. Cocchi, M. Cornia, L. Baraldi, R. Cucchiaraet al., “Safe-clip: Removing nsfw concepts from vision-and-language models,” inECCV, 2024

work page 2024
[7]

Unsafe diffusion: On the generation of unsafe images and hateful memes from text-to-image models,

Y . Qu, X. Shen, X. He, M. Backes, S. Zannettou, and Y . Zhang, “Unsafe diffusion: On the generation of unsafe images and hateful memes from text-to-image models,” inCCS. ACM, 2023, pp. 3403–3417

work page 2023
[8]

Regulation (eu) 2016/679 of the european parliament and of the council,

P. Regulation, “Regulation (eu) 2016/679 of the european parliament and of the council,”Regulation (eu), vol. 679, p. 2016, 2016

work page 2016
[9]

Erasing concepts from diffusion models,

R. Gandikota, J. Materzynska, J. Fiotto-Kaufman, and D. Bau, “Erasing concepts from diffusion models,” inICCV. IEEE, 2023, pp. 2426–2436

work page 2023
[10]

Unified concept editing in diffusion models,

R. Gandikota, H. Orgad, Y . Belinkov, J. Materzynska, and D. Bau, “Unified concept editing in diffusion models,” inWACV. IEEE, 2024, pp. 5099–5108

work page 2024
[11]

MACE: mass concept erasure in diffusion models,

S. Lu, Z. Wang, L. Li, Y . Liu, and A. W. Kong, “MACE: mass concept erasure in diffusion models,” inCVPR. IEEE, 2024, pp. 6430–6440

work page 2024
[12]

Mma-diffusion: Multimodal attack on diffusion models,

Y . Yang, R. Gao, X. Wang, T. Ho, N. Xu, and Q. Xu, “Mma-diffusion: Multimodal attack on diffusion models,” inCVPR. IEEE, 2024, pp. 7737–7746

work page 2024
[13]

Sneakyprompt: Jailbreaking text-to-image generative models,

Y . Yang, B. Hui, H. Yuan, N. Gong, and Y . Cao, “Sneakyprompt: Jailbreaking text-to-image generative models,” inS&P. IEEE, 2024, pp. 897–912

work page 2024
[14]

Surrogateprompt: Bypassing the safety filter of text-to-image models via substitution,

Z. Ba, J. Zhong, J. Lei, P. Cheng, Q. Wang, Z. Qin, Z. Wang, and K. Ren, “Surrogateprompt: Bypassing the safety filter of text-to-image models via substitution,” inCCS, B. Luo, X. Liao, J. Xu, E. Kirda, and D. Lie, Eds. ACM, 2024, pp. 1166–1180

work page 2024
[15]

Reliable and efficient concept erasure of text-to-image diffusion models,

C. Gong, K. Chen, Z. Wei, J. Chen, and Y . Jiang, “Reliable and efficient concept erasure of text-to-image diffusion models,” inECCV. Springer, 2024, pp. 73–88

work page 2024
[16]

Conceptprune: Concept editing in diffusion models via skilled neuron pruning,

R. Chavhan, D. Li, and T. M. Hospedales, “Conceptprune: Concept editing in diffusion models via skilled neuron pruning,” inICLR. OpenReview.net, 2025

work page 2025
[17]

Defensive unlearning with adversarial training for robust concept erasure in diffusion models,

Y . Zhang, X. Chen, J. Jia, Y . Zhang, C. Fan, J. Liu, M. Hong, K. Ding, and S. Liu, “Defensive unlearning with adversarial training for robust concept erasure in diffusion models,” inNeurIPS, 2024, pp. 36 748–36 776

work page 2024
[18]

Receler: Reliable concept erasing of text-to-image diffusion models via lightweight erasers,

C. Huang, K. Chang, C. Tsai, Y . Lai, F. Yang, and Y . F. Wang, “Receler: Reliable concept erasing of text-to-image diffusion models via lightweight erasers,” inECCV, vol. 15098. Springer, 2024, pp. 360–376

work page 2024
[19]

Localizing and editing knowledge in text-to-image generative models,

S. Basu, N. Zhao, V . I. Morariu, S. Feizi, and V . Manjunatha, “Localizing and editing knowledge in text-to-image generative models,” inICLR. OpenReview.net, 2024

work page 2024
[20]

Erasing concepts, steering generations: A comprehensive survey of concept suppression.arXiv preprint arXiv:2505.19398,

Y . Xie, P. Liu, and Z. Zhang, “Erasing concepts, steering genera- tions: A comprehensive survey of concept suppression,”arXiv preprint arXiv:2505.19398, 2025

work page arXiv 2025
[21]

To generate or not? safety-driven unlearned diffusion models are still easy to generate unsafe images ... for now,

Y . Zhang, J. Jia, X. Chen, A. Chen, Y . Zhang, J. Liu, K. Ding, and S. Liu, “To generate or not? safety-driven unlearned diffusion models are still easy to generate unsafe images ... for now,” inCVPR. IEEE, 2024, pp. 385–403. 14

work page 2024
[22]

Image can bring your memory back: A novel multi-modal guided attack against image generation model unlearning,

R. Liu, G. Li, T. Zhang, and S.-K. Ng, “Image can bring your memory back: A novel multi-modal guided attack against image generation model unlearning,”arXiv preprint arXiv:2507.07139, 2025

work page arXiv 2025
[23]

SSC-V AE: structured sparse coding based variational autoencoder for detail preserved image reconstruction,

H. Wang, L. Wang, Z. Wang, L. Ma, and Y . Luo, “SSC-V AE: structured sparse coding based variational autoencoder for detail preserved image reconstruction,” inAAAI, T. Walsh, J. Shah, and Z. Kolter, Eds. AAAI Press, 2025, pp. 7665–7673

work page 2025
[24]

Stargan v2: Diverse image synthesis for multiple domains,

Y . Choi, Y . Uh, J. Yoo, and J. Ha, “Stargan v2: Diverse image synthesis for multiple domains,” inCVPR. Computer Vision Foundation / IEEE, 2020, pp. 8185–8194

work page 2020
[25]

Styleflow: Attribute- conditioned exploration of stylegan-generated images using conditional continuous normalizing flows,

R. Abdal, P. Zhu, N. J. Mitra, and P. Wonka, “Styleflow: Attribute- conditioned exploration of stylegan-generated images using conditional continuous normalizing flows,”ACM Trans. Graph., vol. 40, no. 3, pp. 21:1–21:21, 2021

work page 2021
[26]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” inNeurIPS, 2020

work page 2020
[27]

Denoising diffusion implicit models,

J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” inICLR. OpenReview.net, 2021

work page 2021
[28]

U-net: Convolutional networks for biomedical image segmentation,

O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” inMICCAI, vol. 9351, 2015, pp. 234–241

work page 2015
[29]

LAION-5B: an open large-scale dataset for training next generation image-text models,

C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wight- man, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, P. Schramowski, S. Kundurthy, K. Crowson, L. Schmidt, R. Kaczmarczyk, and J. Jitsev, “LAION-5B: an open large-scale dataset for training next generation image-text models,” inNeurIPS, 2022

work page 2022
[30]

Stable diffusion v1.5,

CompVis, “Stable diffusion v1.5,” https://huggingface.co/ stable-diffusion-v1-5/stable-diffusion-v1-5, 2022

work page 2022
[31]

Extracting training data from diffusion models,

N. Carlini, J. Hayes, M. Nasr, M. Jagielski, V . Sehwag, F. Tram `er, B. Balle, D. Ippolito, and E. Wallace, “Extracting training data from diffusion models,” inUSENIX Security. USENIX Association, 2023, pp. 5253–5270

work page 2023
[32]

Machine unlearning,

L. Bourtoule, V . Chandrasekaran, C. A. Choquette-Choo, H. Jia, A. Travers, B. Zhang, D. Lie, and N. Papernot, “Machine unlearning,” inS&P. IEEE, 2021, pp. 141–159

work page 2021
[33]

Rethinking machine unlearning in image generation models,

R. Liu, W. Feng, T. Zhang, W. Zhou, X. Cheng, and S.-K. Ng, “Rethinking machine unlearning in image generation models,” inCCS, 2025

work page 2025
[34]

Athena: Unlearning spurious features via data filtering and model fine-tuning,

F. Sommeret al., “Athena: Unlearning spurious features via data filtering and model fine-tuning,”NeurIPS, 2022

work page 2022
[35]

Concept pinpoint eraser for text-to-image diffusion models via residual attention gate,

B. H. Lee, S. Lim, S. Lee, D. U. Kang, and S. Y . Chun, “Concept pinpoint eraser for text-to-image diffusion models via residual attention gate,” inICLR, 2025

work page 2025
[36]

Safe text-to-image generation: Simply sanitize the prompt embedding,

H. Qiu, G. Chen, M. Zhang, X. Zhang, X. You, and M. Yang, “Safe text-to-image generation: Simply sanitize the prompt embedding,”arXiv, 2024

work page 2024
[37]

Personalization as a shortcut for few-shot backdoor attack against text-to-image diffusion models,

Y . Huang, F. Juefei-Xu, Q. Guo, J. Zhang, Y . Wu, M. Hu, T. Li, G. Pu, and Y . Liu, “Personalization as a shortcut for few-shot backdoor attack against text-to-image diffusion models,” inAAAI, 2024, pp. 21 169–21 178

work page 2024
[38]

Yolov8: The next generation of yolo,

G. Jocher, A. Chaurasia, J. Qiu, and R. Stoken, “Yolov8: The next generation of yolo,” https://github.com/ultralytics/ultralytics, 2023

work page 2023
[39]

Ring-a-bell! how reliable are concept removal methods for diffusion models?

Y . Tsai, C. Hsu, C. Xie, C. Lin, J. Chen, B. Li, P. Chen, C. Yu, and C. Huang, “Ring-a-bell! how reliable are concept removal methods for diffusion models?” inICLR. OpenReview.net, 2024

work page 2024
[40]

Decision-based adversarial attacks: Reliable attacks against black-box machine learning models,

W. Brendel, J. Rauber, and M. Bethge, “Decision-based adversarial attacks: Reliable attacks against black-box machine learning models,” in ICLR, 2018

work page 2018
[41]

ADBA: approximation decision boundary approach for black-box adversarial attacks,

F. Wang, X. Zuo, H. Huang, and G. Chen, “ADBA: approximation decision boundary approach for black-box adversarial attacks,” inAAAI, 2025, pp. 7628–7636

work page 2025
[42]

Safe inputs but unsafe output: Benchmarking cross-modality safety alignment of large vision-language models,

S. Wang, X. Ye, Q. Cheng, J. Duan, S. Li, J. Fu, X. Qiu, and X. Huang, “Safe inputs but unsafe output: Benchmarking cross-modality safety alignment of large vision-language models,” inNAACL. Association for Computational Linguistics, 2025, pp. 3563–3605

work page 2025
[44]

GPT-4 Technical Report

[Online]. Available: https://arxiv.org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv
[45]

Unlearning concepts in diffusion model via concept domain correction and concept preserving gradient,

Y . Wu, S. Zhou, M. Yang, L. Wang, W. Zhu, H. Chang, X. Zhou, and X. Yang, “Unlearning concepts in diffusion model via concept domain correction and concept preserving gradient,” inAAAI, 2025

work page 2025
[46]

Anything-v3.0,

admruul, “Anything-v3.0,” https://huggingface.co/admruul/anything-v3.0, 2022

work page 2022
[47]

Dreamlike diffusion 1.0,

dreamlike art, “Dreamlike diffusion 1.0,” https://huggingface.co/ dreamlike-art/dreamlike-diffusion-1.0, 2022

work page 2022
[48]

Openjourney v1,

PromptHero, “Openjourney v1,” https://huggingface.co/prompthero/ openjourney, 2022

work page 2022
[49]

Realistic vision v1.4,

SG161222, “Realistic vision v1.4,” https://huggingface.co/SG161222/ Realistic Vision V1.4, 2022

work page 2022
[50]

Waifu diffusion v1.3,

hakurei, “Waifu diffusion v1.3,” https://huggingface.co/hakurei/ waifu-diffusion-v1-3, 2022

work page 2022
[51]

Nudenet: Deep learning model for nudity detection,

B. Praneeth, “Nudenet: Deep learning model for nudity detection,” https: //github.com/notAI-tech/NudeNet, 2023

work page 2023
[52]

Six-cd: Benchmarking concept removals for benign text-to- image diffusion models,

J. Ren, K. Chen, Y . Cui, S. Zeng, H. Liu, Y . Xing, J. Tang, and L. Lyu, “Six-cd: Benchmarking concept removals for benign text-to- image diffusion models,” inCVPR. IEEE, 2025, pp. 28 769–28 778

work page 2025
[53]

EraX-NSFW-V1.0: An open nsfw image classifier,

erax ai, “EraX-NSFW-V1.0: An open nsfw image classifier,” https:// huggingface.co/erax-ai/EraX-NSFW-V1.0, 2023, accessed: 2025-04-10

work page 2023
[54]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inCVPR. IEEE, 2016, pp. 770–778

work page 2016
[55]

Gans trained by a two time-scale update rule converge to a local nash equilibrium,

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” inNeurIPS, 2017, pp. 6626–6637

work page 2017
[56]

Q-align: Teaching lmms for visual scoring via discrete text-defined levels,

H. Wu, Z. Zhang, W. Zhang, C. Chen, L. Liao, C. Li, Y . Gao, A. Wang, E. Zhang, W. Sun, Q. Yan, X. Min, G. Zhai, and W. Lin, “Q-align: Teaching lmms for visual scoring via discrete text-defined levels,” in ICML, 2024

work page 2024
[57]

Clipscore: A reference-free evaluation metric for image captioning,

J. Hessel, A. Holtzman, M. Forbes, R. L. Bras, and Y . Choi, “Clipscore: A reference-free evaluation metric for image captioning,” inEMNLP. Association for Computational Linguistics, 2021, pp. 7514–7528

work page 2021
[58]

Stable diffusion 2.0 release,

Stability AI, “Stable diffusion 2.0 release,” https://stability.ai/blog/ stable-diffusion-v2-release, 2022, accessed: 2025-05-30

work page 2022
[59]

Stable diffusion 2.1 on hugging face,

——, “Stable diffusion 2.1 on hugging face,” https://huggingface.co/ stabilityai/stable-diffusion-2, 2023, accessed: 2025-05-30. 15 APPENDIXA APPENDIXOVERVIEW This appendix provides supplementary material that expands upon the main paper by presenting additional details, analyses, and experimental results omitted due to space constraints. All content herei...

work page 2023
[60]

naked”, while the safe counterpart replaces it with “well-clothed

Dataset Construction We construct a dataset of semantically aligned safe and unsafe prompt pairs to enable supervised learning of SafeRedir. Each pair (𝑝 safe, 𝑝unsafe) is carefully curated such that both prompts describe the same benign context but differ by the presence of a high-risk element. Take theNSFWtask as an instance, the unsafe prompt may conta...

work page
[61]

nude” or “naked

Training Objective The SafeRedir model is optimized end-to-end using a multi- component objective designed to enable accurate unsafe content detection, fine-grained token-level guidance, and minimal semantic disruption to benign content. The overall training loss is given by: Ltotal =𝜆 clsLcls +𝜆 mseLmse +𝜆 cosLcos +𝜆 maskLmask +𝜆 𝛼L 𝛼, (14) where𝜆 ∗ are ...

work page
[62]

ANUS EXPOSED

Forgetting Qualitative Comparison:Due to page limitations, repre- sentative generations forNSFWandVan Gogh Styleunlearning are presented in Figs. 15 and 16, respectively, in response to prompts containing the corresponding concepts. These qualitative results, consistent with our quantitative findings, demonstrate that SafeRedir not only achieves effective...

work page
[63]

Adopted to Other Models Fig. 18 presents qualitative results demonstrating the trans- ferability of SafeRedir to a diverse set of community diffusion models, including SD v1.5, Any v3, DL v1, OJ v1, RV v1.4, and WD v1.3. Each row corresponds to a distinctNSFWprompt. The left block shows outputs from the original models, which consistently generate sensiti...

work page
[64]

19 illustrates qualitative improvements achieved by integrating SafeRedir into ten representative unlearning meth- ods

Enhancement of Existing Unlearning Fig. 19 illustrates qualitative improvements achieved by integrating SafeRedir into ten representative unlearning meth- ods. Across all cases, residualNSFWcontent is effectively removed, and visual or semantic artifacts introduced by the original methods are mitigated. SafeRedir enhances image realism, preserves scene co...

work page 2025
[65]

Core Inputs, Model Components, and Training Strategies We conduct a comprehensive ablation study to quantify the contributions of each core element in SafeRedir across three evaluation dimensions: forgetting effectiveness (FSR), preser- vation (CSDR, YOLO), and image quality (FID, LPIPS, Q- Align, Laion aes). Specifically, we first analyze the importance ...

work page
[66]

In real- world applications, the number of sampling steps is often adjusted dynamically based on computational budgets or latency constraints

Robustness to Sampling Steps A critical consideration for the practical deployment of safety-guided unlearning in diffusion models is its robust- ness to variation in inference-time sampling steps. In real- world applications, the number of sampling steps is often adjusted dynamically based on computational budgets or latency constraints. Therefore, it is...

work page
[67]

Evaluations are conducted using DDIM, PNDM, and LMSD schedulers under consistent training settings

Robustness to Sampling Scheduler We further assess the performance of SafeRedir under differ- ent diffusion schedulers, as practical deployments often require switching between sampling algorithms to balance quality and efficiency. Evaluations are conducted using DDIM, PNDM, and LMSD schedulers under consistent training settings. 24 TABLE XV: Robustness o...

work page

[1] [1]

High- resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inCVPR. IEEE, 2022, pp. 10 674–10 685

work page 2022

[2] [2]

Dall-e 3: Text-to-image generation and editing,

OpenAI, “Dall-e 3: Text-to-image generation and editing,”OpenAI Technical Report, 2023

work page 2023

[3] [3]

Photorealistic text-to-image diffusion models with deep language understanding,

C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, S. K. S. Ghasemipour, R. G. Lopes, B. K. Ayan, T. Salimans, J. Ho, D. J. Fleet, and M. Norouzi, “Photorealistic text-to-image diffusion models with deep language understanding,” inNeurIPS, 2022

work page 2022

[4] [4]

Midjourney,

Midjourney, “Midjourney,” 2022, https://en.wikipedia.org/wiki/ Midjourney

work page 2022

[5] [5]

Safegen: Mitigating unsafe content generation in text-to-image models,

X. Li, Y . Yang, J. Deng, C. Yan, Y . Chen, X. Ji, and W. Xu, “Safegen: Mitigating unsafe content generation in text-to-image models,” inCCS, 2024

work page 2024

[6] [6]

Safe-clip: Removing nsfw concepts from vision-and-language models,

S. Poppi, T. Poppi, F. Cocchi, M. Cornia, L. Baraldi, R. Cucchiaraet al., “Safe-clip: Removing nsfw concepts from vision-and-language models,” inECCV, 2024

work page 2024

[7] [7]

Unsafe diffusion: On the generation of unsafe images and hateful memes from text-to-image models,

Y . Qu, X. Shen, X. He, M. Backes, S. Zannettou, and Y . Zhang, “Unsafe diffusion: On the generation of unsafe images and hateful memes from text-to-image models,” inCCS. ACM, 2023, pp. 3403–3417

work page 2023

[8] [8]

Regulation (eu) 2016/679 of the european parliament and of the council,

P. Regulation, “Regulation (eu) 2016/679 of the european parliament and of the council,”Regulation (eu), vol. 679, p. 2016, 2016

work page 2016

[9] [9]

Erasing concepts from diffusion models,

R. Gandikota, J. Materzynska, J. Fiotto-Kaufman, and D. Bau, “Erasing concepts from diffusion models,” inICCV. IEEE, 2023, pp. 2426–2436

work page 2023

[10] [10]

Unified concept editing in diffusion models,

R. Gandikota, H. Orgad, Y . Belinkov, J. Materzynska, and D. Bau, “Unified concept editing in diffusion models,” inWACV. IEEE, 2024, pp. 5099–5108

work page 2024

[11] [11]

MACE: mass concept erasure in diffusion models,

S. Lu, Z. Wang, L. Li, Y . Liu, and A. W. Kong, “MACE: mass concept erasure in diffusion models,” inCVPR. IEEE, 2024, pp. 6430–6440

work page 2024

[12] [12]

Mma-diffusion: Multimodal attack on diffusion models,

Y . Yang, R. Gao, X. Wang, T. Ho, N. Xu, and Q. Xu, “Mma-diffusion: Multimodal attack on diffusion models,” inCVPR. IEEE, 2024, pp. 7737–7746

work page 2024

[13] [13]

Sneakyprompt: Jailbreaking text-to-image generative models,

Y . Yang, B. Hui, H. Yuan, N. Gong, and Y . Cao, “Sneakyprompt: Jailbreaking text-to-image generative models,” inS&P. IEEE, 2024, pp. 897–912

work page 2024

[14] [14]

Surrogateprompt: Bypassing the safety filter of text-to-image models via substitution,

Z. Ba, J. Zhong, J. Lei, P. Cheng, Q. Wang, Z. Qin, Z. Wang, and K. Ren, “Surrogateprompt: Bypassing the safety filter of text-to-image models via substitution,” inCCS, B. Luo, X. Liao, J. Xu, E. Kirda, and D. Lie, Eds. ACM, 2024, pp. 1166–1180

work page 2024

[15] [15]

Reliable and efficient concept erasure of text-to-image diffusion models,

C. Gong, K. Chen, Z. Wei, J. Chen, and Y . Jiang, “Reliable and efficient concept erasure of text-to-image diffusion models,” inECCV. Springer, 2024, pp. 73–88

work page 2024

[16] [16]

Conceptprune: Concept editing in diffusion models via skilled neuron pruning,

R. Chavhan, D. Li, and T. M. Hospedales, “Conceptprune: Concept editing in diffusion models via skilled neuron pruning,” inICLR. OpenReview.net, 2025

work page 2025

[17] [17]

Defensive unlearning with adversarial training for robust concept erasure in diffusion models,

Y . Zhang, X. Chen, J. Jia, Y . Zhang, C. Fan, J. Liu, M. Hong, K. Ding, and S. Liu, “Defensive unlearning with adversarial training for robust concept erasure in diffusion models,” inNeurIPS, 2024, pp. 36 748–36 776

work page 2024

[18] [18]

Receler: Reliable concept erasing of text-to-image diffusion models via lightweight erasers,

C. Huang, K. Chang, C. Tsai, Y . Lai, F. Yang, and Y . F. Wang, “Receler: Reliable concept erasing of text-to-image diffusion models via lightweight erasers,” inECCV, vol. 15098. Springer, 2024, pp. 360–376

work page 2024

[19] [19]

Localizing and editing knowledge in text-to-image generative models,

S. Basu, N. Zhao, V . I. Morariu, S. Feizi, and V . Manjunatha, “Localizing and editing knowledge in text-to-image generative models,” inICLR. OpenReview.net, 2024

work page 2024

[20] [20]

Erasing concepts, steering generations: A comprehensive survey of concept suppression.arXiv preprint arXiv:2505.19398,

Y . Xie, P. Liu, and Z. Zhang, “Erasing concepts, steering genera- tions: A comprehensive survey of concept suppression,”arXiv preprint arXiv:2505.19398, 2025

work page arXiv 2025

[21] [21]

To generate or not? safety-driven unlearned diffusion models are still easy to generate unsafe images ... for now,

Y . Zhang, J. Jia, X. Chen, A. Chen, Y . Zhang, J. Liu, K. Ding, and S. Liu, “To generate or not? safety-driven unlearned diffusion models are still easy to generate unsafe images ... for now,” inCVPR. IEEE, 2024, pp. 385–403. 14

work page 2024

[22] [22]

Image can bring your memory back: A novel multi-modal guided attack against image generation model unlearning,

R. Liu, G. Li, T. Zhang, and S.-K. Ng, “Image can bring your memory back: A novel multi-modal guided attack against image generation model unlearning,”arXiv preprint arXiv:2507.07139, 2025

work page arXiv 2025

[23] [23]

SSC-V AE: structured sparse coding based variational autoencoder for detail preserved image reconstruction,

H. Wang, L. Wang, Z. Wang, L. Ma, and Y . Luo, “SSC-V AE: structured sparse coding based variational autoencoder for detail preserved image reconstruction,” inAAAI, T. Walsh, J. Shah, and Z. Kolter, Eds. AAAI Press, 2025, pp. 7665–7673

work page 2025

[24] [24]

Stargan v2: Diverse image synthesis for multiple domains,

Y . Choi, Y . Uh, J. Yoo, and J. Ha, “Stargan v2: Diverse image synthesis for multiple domains,” inCVPR. Computer Vision Foundation / IEEE, 2020, pp. 8185–8194

work page 2020

[25] [25]

Styleflow: Attribute- conditioned exploration of stylegan-generated images using conditional continuous normalizing flows,

R. Abdal, P. Zhu, N. J. Mitra, and P. Wonka, “Styleflow: Attribute- conditioned exploration of stylegan-generated images using conditional continuous normalizing flows,”ACM Trans. Graph., vol. 40, no. 3, pp. 21:1–21:21, 2021

work page 2021

[26] [26]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” inNeurIPS, 2020

work page 2020

[27] [27]

Denoising diffusion implicit models,

J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” inICLR. OpenReview.net, 2021

work page 2021

[28] [28]

U-net: Convolutional networks for biomedical image segmentation,

O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” inMICCAI, vol. 9351, 2015, pp. 234–241

work page 2015

[29] [29]

LAION-5B: an open large-scale dataset for training next generation image-text models,

C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wight- man, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, P. Schramowski, S. Kundurthy, K. Crowson, L. Schmidt, R. Kaczmarczyk, and J. Jitsev, “LAION-5B: an open large-scale dataset for training next generation image-text models,” inNeurIPS, 2022

work page 2022

[30] [30]

Stable diffusion v1.5,

CompVis, “Stable diffusion v1.5,” https://huggingface.co/ stable-diffusion-v1-5/stable-diffusion-v1-5, 2022

work page 2022

[31] [31]

Extracting training data from diffusion models,

N. Carlini, J. Hayes, M. Nasr, M. Jagielski, V . Sehwag, F. Tram `er, B. Balle, D. Ippolito, and E. Wallace, “Extracting training data from diffusion models,” inUSENIX Security. USENIX Association, 2023, pp. 5253–5270

work page 2023

[32] [32]

Machine unlearning,

L. Bourtoule, V . Chandrasekaran, C. A. Choquette-Choo, H. Jia, A. Travers, B. Zhang, D. Lie, and N. Papernot, “Machine unlearning,” inS&P. IEEE, 2021, pp. 141–159

work page 2021

[33] [33]

Rethinking machine unlearning in image generation models,

R. Liu, W. Feng, T. Zhang, W. Zhou, X. Cheng, and S.-K. Ng, “Rethinking machine unlearning in image generation models,” inCCS, 2025

work page 2025

[34] [34]

Athena: Unlearning spurious features via data filtering and model fine-tuning,

F. Sommeret al., “Athena: Unlearning spurious features via data filtering and model fine-tuning,”NeurIPS, 2022

work page 2022

[35] [35]

Concept pinpoint eraser for text-to-image diffusion models via residual attention gate,

B. H. Lee, S. Lim, S. Lee, D. U. Kang, and S. Y . Chun, “Concept pinpoint eraser for text-to-image diffusion models via residual attention gate,” inICLR, 2025

work page 2025

[36] [36]

Safe text-to-image generation: Simply sanitize the prompt embedding,

H. Qiu, G. Chen, M. Zhang, X. Zhang, X. You, and M. Yang, “Safe text-to-image generation: Simply sanitize the prompt embedding,”arXiv, 2024

work page 2024

[37] [37]

Personalization as a shortcut for few-shot backdoor attack against text-to-image diffusion models,

Y . Huang, F. Juefei-Xu, Q. Guo, J. Zhang, Y . Wu, M. Hu, T. Li, G. Pu, and Y . Liu, “Personalization as a shortcut for few-shot backdoor attack against text-to-image diffusion models,” inAAAI, 2024, pp. 21 169–21 178

work page 2024

[38] [38]

Yolov8: The next generation of yolo,

G. Jocher, A. Chaurasia, J. Qiu, and R. Stoken, “Yolov8: The next generation of yolo,” https://github.com/ultralytics/ultralytics, 2023

work page 2023

[39] [39]

Ring-a-bell! how reliable are concept removal methods for diffusion models?

Y . Tsai, C. Hsu, C. Xie, C. Lin, J. Chen, B. Li, P. Chen, C. Yu, and C. Huang, “Ring-a-bell! how reliable are concept removal methods for diffusion models?” inICLR. OpenReview.net, 2024

work page 2024

[40] [40]

Decision-based adversarial attacks: Reliable attacks against black-box machine learning models,

W. Brendel, J. Rauber, and M. Bethge, “Decision-based adversarial attacks: Reliable attacks against black-box machine learning models,” in ICLR, 2018

work page 2018

[41] [41]

ADBA: approximation decision boundary approach for black-box adversarial attacks,

F. Wang, X. Zuo, H. Huang, and G. Chen, “ADBA: approximation decision boundary approach for black-box adversarial attacks,” inAAAI, 2025, pp. 7628–7636

work page 2025

[42] [42]

Safe inputs but unsafe output: Benchmarking cross-modality safety alignment of large vision-language models,

S. Wang, X. Ye, Q. Cheng, J. Duan, S. Li, J. Fu, X. Qiu, and X. Huang, “Safe inputs but unsafe output: Benchmarking cross-modality safety alignment of large vision-language models,” inNAACL. Association for Computational Linguistics, 2025, pp. 3563–3605

work page 2025

[43] [44]

GPT-4 Technical Report

[Online]. Available: https://arxiv.org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv

[44] [45]

Unlearning concepts in diffusion model via concept domain correction and concept preserving gradient,

Y . Wu, S. Zhou, M. Yang, L. Wang, W. Zhu, H. Chang, X. Zhou, and X. Yang, “Unlearning concepts in diffusion model via concept domain correction and concept preserving gradient,” inAAAI, 2025

work page 2025

[45] [46]

Anything-v3.0,

admruul, “Anything-v3.0,” https://huggingface.co/admruul/anything-v3.0, 2022

work page 2022

[46] [47]

Dreamlike diffusion 1.0,

dreamlike art, “Dreamlike diffusion 1.0,” https://huggingface.co/ dreamlike-art/dreamlike-diffusion-1.0, 2022

work page 2022

[47] [48]

Openjourney v1,

PromptHero, “Openjourney v1,” https://huggingface.co/prompthero/ openjourney, 2022

work page 2022

[48] [49]

Realistic vision v1.4,

SG161222, “Realistic vision v1.4,” https://huggingface.co/SG161222/ Realistic Vision V1.4, 2022

work page 2022

[49] [50]

Waifu diffusion v1.3,

hakurei, “Waifu diffusion v1.3,” https://huggingface.co/hakurei/ waifu-diffusion-v1-3, 2022

work page 2022

[50] [51]

Nudenet: Deep learning model for nudity detection,

B. Praneeth, “Nudenet: Deep learning model for nudity detection,” https: //github.com/notAI-tech/NudeNet, 2023

work page 2023

[51] [52]

Six-cd: Benchmarking concept removals for benign text-to- image diffusion models,

J. Ren, K. Chen, Y . Cui, S. Zeng, H. Liu, Y . Xing, J. Tang, and L. Lyu, “Six-cd: Benchmarking concept removals for benign text-to- image diffusion models,” inCVPR. IEEE, 2025, pp. 28 769–28 778

work page 2025

[52] [53]

EraX-NSFW-V1.0: An open nsfw image classifier,

erax ai, “EraX-NSFW-V1.0: An open nsfw image classifier,” https:// huggingface.co/erax-ai/EraX-NSFW-V1.0, 2023, accessed: 2025-04-10

work page 2023

[53] [54]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inCVPR. IEEE, 2016, pp. 770–778

work page 2016

[54] [55]

Gans trained by a two time-scale update rule converge to a local nash equilibrium,

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” inNeurIPS, 2017, pp. 6626–6637

work page 2017

[55] [56]

Q-align: Teaching lmms for visual scoring via discrete text-defined levels,

H. Wu, Z. Zhang, W. Zhang, C. Chen, L. Liao, C. Li, Y . Gao, A. Wang, E. Zhang, W. Sun, Q. Yan, X. Min, G. Zhai, and W. Lin, “Q-align: Teaching lmms for visual scoring via discrete text-defined levels,” in ICML, 2024

work page 2024

[56] [57]

Clipscore: A reference-free evaluation metric for image captioning,

J. Hessel, A. Holtzman, M. Forbes, R. L. Bras, and Y . Choi, “Clipscore: A reference-free evaluation metric for image captioning,” inEMNLP. Association for Computational Linguistics, 2021, pp. 7514–7528

work page 2021

[57] [58]

Stable diffusion 2.0 release,

Stability AI, “Stable diffusion 2.0 release,” https://stability.ai/blog/ stable-diffusion-v2-release, 2022, accessed: 2025-05-30

work page 2022

[58] [59]

Stable diffusion 2.1 on hugging face,

——, “Stable diffusion 2.1 on hugging face,” https://huggingface.co/ stabilityai/stable-diffusion-2, 2023, accessed: 2025-05-30. 15 APPENDIXA APPENDIXOVERVIEW This appendix provides supplementary material that expands upon the main paper by presenting additional details, analyses, and experimental results omitted due to space constraints. All content herei...

work page 2023

[59] [60]

naked”, while the safe counterpart replaces it with “well-clothed

Dataset Construction We construct a dataset of semantically aligned safe and unsafe prompt pairs to enable supervised learning of SafeRedir. Each pair (𝑝 safe, 𝑝unsafe) is carefully curated such that both prompts describe the same benign context but differ by the presence of a high-risk element. Take theNSFWtask as an instance, the unsafe prompt may conta...

work page

[60] [61]

nude” or “naked

Training Objective The SafeRedir model is optimized end-to-end using a multi- component objective designed to enable accurate unsafe content detection, fine-grained token-level guidance, and minimal semantic disruption to benign content. The overall training loss is given by: Ltotal =𝜆 clsLcls +𝜆 mseLmse +𝜆 cosLcos +𝜆 maskLmask +𝜆 𝛼L 𝛼, (14) where𝜆 ∗ are ...

work page

[61] [62]

ANUS EXPOSED

Forgetting Qualitative Comparison:Due to page limitations, repre- sentative generations forNSFWandVan Gogh Styleunlearning are presented in Figs. 15 and 16, respectively, in response to prompts containing the corresponding concepts. These qualitative results, consistent with our quantitative findings, demonstrate that SafeRedir not only achieves effective...

work page

[62] [63]

Adopted to Other Models Fig. 18 presents qualitative results demonstrating the trans- ferability of SafeRedir to a diverse set of community diffusion models, including SD v1.5, Any v3, DL v1, OJ v1, RV v1.4, and WD v1.3. Each row corresponds to a distinctNSFWprompt. The left block shows outputs from the original models, which consistently generate sensiti...

work page

[63] [64]

19 illustrates qualitative improvements achieved by integrating SafeRedir into ten representative unlearning meth- ods

Enhancement of Existing Unlearning Fig. 19 illustrates qualitative improvements achieved by integrating SafeRedir into ten representative unlearning meth- ods. Across all cases, residualNSFWcontent is effectively removed, and visual or semantic artifacts introduced by the original methods are mitigated. SafeRedir enhances image realism, preserves scene co...

work page 2025

[64] [65]

Core Inputs, Model Components, and Training Strategies We conduct a comprehensive ablation study to quantify the contributions of each core element in SafeRedir across three evaluation dimensions: forgetting effectiveness (FSR), preser- vation (CSDR, YOLO), and image quality (FID, LPIPS, Q- Align, Laion aes). Specifically, we first analyze the importance ...

work page

[65] [66]

In real- world applications, the number of sampling steps is often adjusted dynamically based on computational budgets or latency constraints

Robustness to Sampling Steps A critical consideration for the practical deployment of safety-guided unlearning in diffusion models is its robust- ness to variation in inference-time sampling steps. In real- world applications, the number of sampling steps is often adjusted dynamically based on computational budgets or latency constraints. Therefore, it is...

work page

[66] [67]

Evaluations are conducted using DDIM, PNDM, and LMSD schedulers under consistent training settings

Robustness to Sampling Scheduler We further assess the performance of SafeRedir under differ- ent diffusion schedulers, as practical deployments often require switching between sampling algorithms to balance quality and efficiency. Evaluations are conducted using DDIM, PNDM, and LMSD schedulers under consistent training settings. 24 TABLE XV: Robustness o...

work page