Rethinking and Red-Teaming Protective Perturbation in Personalized Diffusion Models

Lichao Sun; Ruoxi Chen; Xun Chen; Yixin Liu

arxiv: 2406.18944 · v7 · pith:4PWB3TGQnew · submitted 2024-06-27 · 💻 cs.CV · cs.AI· cs.CR

Rethinking and Red-Teaming Protective Perturbation in Personalized Diffusion Models

Yixin Liu , Ruoxi Chen , Xun Chen , Lichao Sun This is my paper

Pith reviewed 2026-05-24 00:24 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CR

keywords personalized diffusion modelsprotective perturbationsCLIP misalignmentshortcut learningred-teamingimage restorationcontrastive decouplingadversarial robustness

0 comments

The pith

Adversarial perturbations protect personalized diffusion models by creating a CLIP embedding misalignment that triggers shortcut learning of noise patterns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that protective perturbations succeed because they misalign perturbed images from their text prompts inside the CLIP space. During fine-tuning the diffusion model therefore binds the added noise patterns to the unique subject identifier instead of the real visual content, so the model fails to generate the subject later. The authors test this shortcut-learning account by restoring image semantics with off-the-shelf tools and then applying contrastive decoupling that uses extra noise tokens to separate genuine concepts from spurious patterns. If the account holds, existing purification methods can be improved by explicitly reversing the misalignment rather than simply cleaning pixels. The work supplies both the diagnostic analysis and a concrete red-teaming pipeline that outperforms prior purification baselines while remaining robust to adaptive attacks.

Core claim

Adversarial perturbations induce a latent-space misalignment between images and their text prompts in the CLIP embedding space. This misalignment causes the model to erroneously associate noisy patterns with unique identifiers during fine-tuning, resulting in poor generalization. The authors therefore introduce a red-teaming framework that first restores images to realign their latent semantics and then applies contrastive decoupling learning with noise tokens to prevent the model from learning the spurious patterns.

What carries the argument

CLIP embedding misalignment that forces shortcut learning of noise patterns rather than subject concepts during personalized fine-tuning.

If this is right

The proposed purification and contrastive decoupling steps outperform earlier over-purification baselines on standard protection benchmarks.
The same misalignment mechanism explains why current protective perturbations remain effective against simple restoration attacks.
A systematic red-teaming pipeline built on this analysis can be used to evaluate and strengthen future protection schemes.
The method stays effective even when the adversary adapts the perturbation knowing the red-teaming steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the misalignment account is correct, protection schemes that avoid CLIP-space shifts may prove harder to red-team than pixel-level noise alone.
The same shortcut-learning diagnosis could be tested on other conditional generators that rely on CLIP-style encoders.
Extending the contrastive decoupling step to multiple noise tokens might further isolate subject identity from any residual artifacts.

Load-bearing premise

The CLIP misalignment is the main causal driver of the protection effect, not merely a side effect, and can be reliably reversed by standard restoration plus contrastive decoupling without new failure modes.

What would settle it

A controlled test in which CLIP misalignment is removed yet the fine-tuned model still refuses to generate the protected subject, or in which the proposed restoration-plus-decoupling pipeline fails to restore generation quality on held-out adaptive perturbations.

Figures

Figures reproduced from arXiv: 2406.18944 by Lichao Sun, Ruoxi Chen, Xun Chen, Yixin Liu.

**Figure 2.** Figure 2: (a) The original causal graph representing the variable relationships in personalized diffusion [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Latent 2D visualization and concept clas [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of purified images that were originally protected by MetaCloak. Our method [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Generations from models trained on: (left) clean data, (middle) perturbed data without [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: LIQE quality score of V∗. We present the LIQE (Zhang et al., 2023) quality score curve during fine-tuning under different settings, including clean training, vanilla training on perturbed data, training with CDL, and training with CodeSR+CDL in [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: More visualization on the learned personalized and noise concepts from trained models [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Concept extraction with three different prompts from the trained model with vanilla training, [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

read the original abstract

Personalized diffusion models (PDMs) have become prominent for adapting pre-trained text-to-image models to generate images of specific subjects using minimal training data. However, PDMs are susceptible to minor adversarial perturbations, leading to significant degradation when fine-tuned on corrupted datasets. These vulnerabilities are exploited to create protective perturbations that prevent unauthorized image generation. Existing purification methods attempt to red-team the protective perturbation to break the protection but often over-purify images, resulting in information loss. In this work, we conduct an in-depth analysis of the fine-tuning process of PDMs through the lens of shortcut learning. We hypothesize and empirically demonstrate that adversarial perturbations induce a latent-space misalignment between images and their text prompts in the CLIP embedding space. This misalignment causes the model to erroneously associate noisy patterns with unique identifiers during fine-tuning, resulting in poor generalization. Based on these insights, we propose a systematic red-teaming framework that includes data purification and contrastive decoupling learning. We first employ off-the-shelf image restoration techniques to realign images with their original semantic content in latent space. Then, we introduce contrastive decoupling learning with noise tokens to decouple the learning of personalized concepts from spurious noise patterns. Our study not only uncovers shortcut learning vulnerabilities in PDMs but also provides a thorough evaluation framework for developing stronger protection. Our extensive evaluation demonstrates its advantages over existing purification methods and its robustness against adaptive perturbations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper claims that adversarial perturbations on images used to fine-tune personalized diffusion models (PDMs) induce a misalignment in CLIP embedding space between the corrupted images and their text prompts. This misalignment is hypothesized to cause shortcut learning in which the model erroneously associates noisy patterns with the unique subject identifier, resulting in poor generalization and the observed protection effect. The authors propose a red-teaming framework that first applies off-the-shelf image restoration to realign semantics in latent space and then uses contrastive decoupling learning with noise tokens to separate personalized concepts from spurious patterns. They report that this approach outperforms existing purification methods and remains robust to adaptive perturbations.

Significance. If the causal role of CLIP misalignment is substantiated, the work would advance understanding of shortcut-learning vulnerabilities in PDMs and supply a concrete evaluation framework for stronger protective perturbations, with implications for privacy in text-to-image personalization.

major comments (1)

[Abstract] Abstract: the central claim that CLIP-space misalignment is the primary causal driver of the protection effect (inducing erroneous association of noise with identifiers) is presented as empirically demonstrated, yet no controls are described that isolate this mechanism from confounders such as general distribution shift or image degradation (e.g., targeted embedding interventions or ablations that hold image statistics fixed while varying only alignment). This assumption underpins both the diagnostic analysis and the contrastive-decoupling component.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the concern regarding the central claim in the abstract below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that CLIP-space misalignment is the primary causal driver of the protection effect (inducing erroneous association of noise with identifiers) is presented as empirically demonstrated, yet no controls are described that isolate this mechanism from confounders such as general distribution shift or image degradation (e.g., targeted embedding interventions or ablations that hold image statistics fixed while varying only alignment). This assumption underpins both the diagnostic analysis and the contrastive-decoupling component.

Authors: We acknowledge the referee's point that the abstract presents the causal role of CLIP misalignment as empirically demonstrated without explicitly describing controls that isolate it from confounders such as general distribution shift. Our current experiments establish a strong correlation between the induced misalignment and the observed shortcut learning during fine-tuning, supported by diagnostic visualizations and comparisons to non-protected baselines. However, we agree that additional targeted ablations would provide stronger evidence for causality. In the revised version, we will add a dedicated subsection with controls that hold image statistics fixed while varying only the alignment (including targeted embedding interventions), and we will revise the abstract to more precisely distinguish between hypothesis, observed correlation, and causal evidence. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical analysis self-contained

full rationale

The paper frames its contribution as an empirical hypothesis test and method proposal: it observes CLIP-space misalignment during PDM fine-tuning, attributes protection effects to shortcut learning, and evaluates purification plus contrastive decoupling. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claim rests on experimental demonstration rather than any derivation that reduces by construction to its own inputs or prior author work. This is the expected non-finding for an empirical red-teaming study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that CLIP misalignment is the operative mechanism and on standard supervised fine-tuning assumptions; no explicit free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption CLIP embedding misalignment is the primary cause of the observed protection effect during fine-tuning
Invoked as the hypothesis that the work empirically demonstrates.

pith-pipeline@v0.9.0 · 5786 in / 1232 out tokens · 25801 ms · 2026-05-24T00:24:03.462660+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

adversarial perturbations induce a latent-space misalignment between images and their text prompts in the CLIP embedding space... shortcut learning vulnerabilities
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

causal graph... spurious correlations... Contrastive Decoupling Learning

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SyncBreaker:Stage-Aware Multimodal Adversarial Attacks on Audio-Driven Talking Head Generation
cs.CV 2026-04 unverdicted novelty 6.0

SyncBreaker jointly attacks image and audio streams with Multi-Interval Sampling and Cross-Attention Fooling to degrade speech-driven talking head generation more than single-modality baselines.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · cited by 1 Pith paper

[1]

a photo of a [class norn]

following IP-adapter (Ye et al., 2023). We report the weighted average of them with a weighting factor on IMS-IP as 70 We compute all the mean scores for all generated images and instances. For the instance i and its j-th metric, its k-th observation value is defined as mi,j,k. For the j-th metric, the mean value is obtained withP i,k mi,j,k/(NiNk), where...

work page 2023
[2]

a photo of sks person

For the evaluation phase, we set the inferring steps as 100 with prompts “a photo of sks person” and “a smiling photo of sks person” during inference to generate16 images per prompt. For all the settings, the classifier-free guidance Ho & Salimans (2022) is turned on by default with a guidance scale of 7.5. For the implementation of baseline methods, plea...

work page 2022
[3]

As evident from the figure, our proposed decoupled learning (CDL) approach signifi- cantly enhances the quality compared to the case with per- turbations

This curve illustrates the evolution of image quality throughout the training process. As evident from the figure, our proposed decoupled learning (CDL) approach signifi- cantly enhances the quality compared to the case with per- turbations. Moreover, when we combine CDL with input purification (CodeSR + CDL), the model achieves quality performance compar...

work page
[4]

a photo of V∗

This validates the effectiveness of our method in learning the correct concept-image correlations and decoupling the noise concept. Furthermore, from Fig. 8, we find that adding input purification (CodeSR) greatly boosts generation quality. Under the purification case, the contribution of CDL is more about decoupling the left-over background artifacts fro...

work page 2023
[5]

Gaussian Filtering is a well-known image-processing technique used to reduce image noise and detail by applying a Gaussian kernel

Gaussian Filtering. Gaussian Filtering is a well-known image-processing technique used to reduce image noise and detail by applying a Gaussian kernel. The high-frequent part in adversarial perturbation can be smoothed after filtering. The kernel size is set as 5 following (Van Le et al., 2023)

work page 2023
[6]

Total Variation Minimization (TVM) (Wang et al., 2020)The main idea of TVM is to conduct image reconstruction based on the observation that the benign images should have low total variation. We implemented the TVM defense in the following steps: we first resized the instance image to 642 pixels, applied a random dropout mask with a 2 After optimization, t...

work page 2020
[7]

It involves transforming an image into a format that uses less storage space and reduces the image file’s size

JPEG Compression. It involves transforming an image into a format that uses less storage space and reduces the image file’s size. We set the JPEG quality to 75 following (Liu et al., 2024)

work page 2024
[8]

DiffPure (Nie et al., 2022). Diffusion Purification (DiffPure) first diffuses the adversarial example with a small amount of noise given a pre-defined timestep t following a forward diffusion process, where the adversarial noise is smoothed and then recovers the clean image through the reverse generative process. Depending on the type of diffusion model u...

work page 2022
[9]

a photo of [class_name], high quality, highres

for its superior performance. Since the SD model has the ability to input additional text prompts during the purification process, we investigate two variants with and without the usage of purified text prompting. For LatentDiffPure-∅, we set the text to null, while for LatentDiffPure, we set it as “a photo of [class_name], high quality, highres”

work page
[10]

DDSPure (Carlini et al., 2022) . Similar to DiffPure (Nie et al., 2022), the main idea behind Diffusion Denoised Smoothing (DDS) is to find an optimal timestamp that can maximally remove the adversarial perturbation via the SDEdit process (Meng et al., 2021). Given smoothing noise level δ, the optimal timestamp t∗ is computed via, t∗ = 1− ¯αt ¯αt = σ2. Fo...

work page 2022
[11]

GrIDPure notices that for purification in defending protective perturbation, conducting iterative DiffPure with small steps can outperform one-shot DiffPure with larger steps

GrIDPure (Zheng et al., 2023). GrIDPure notices that for purification in defending protective perturbation, conducting iterative DiffPure with small steps can outperform one-shot DiffPure with larger steps. Furthermore, it suppresses the generative nature during diffusion purification by additionally splitting the image into multiple small grids that are ...

work page 2023
[12]

IMPRESS (Cao et al., 2024) The key idea of IMPRESS is to conduct purification that ensures latent consistency with visual similarity constraints : (1) the purified image should be visually similar to the perturbed image, and (2) the purified image should be consistent upon an LDM-based reconstruction. To quantify the similarity condition, IMPRESS uses the...

work page 2024
[13]

We implement this approach with three surrogates

of surrogate models finetuned from different pre-trained generators, which can lead to better transferability. We implement this approach with three surrogates. Besides, we follow the practice of a single model at a time in an interleaving manner to produce optimal perturbed data due to GPU memory constraints. MetaCloak. Despite the effectiveness of pertu...

work page 2024

[1] [1]

a photo of a [class norn]

following IP-adapter (Ye et al., 2023). We report the weighted average of them with a weighting factor on IMS-IP as 70 We compute all the mean scores for all generated images and instances. For the instance i and its j-th metric, its k-th observation value is defined as mi,j,k. For the j-th metric, the mean value is obtained withP i,k mi,j,k/(NiNk), where...

work page 2023

[2] [2]

a photo of sks person

For the evaluation phase, we set the inferring steps as 100 with prompts “a photo of sks person” and “a smiling photo of sks person” during inference to generate16 images per prompt. For all the settings, the classifier-free guidance Ho & Salimans (2022) is turned on by default with a guidance scale of 7.5. For the implementation of baseline methods, plea...

work page 2022

[3] [3]

As evident from the figure, our proposed decoupled learning (CDL) approach signifi- cantly enhances the quality compared to the case with per- turbations

This curve illustrates the evolution of image quality throughout the training process. As evident from the figure, our proposed decoupled learning (CDL) approach signifi- cantly enhances the quality compared to the case with per- turbations. Moreover, when we combine CDL with input purification (CodeSR + CDL), the model achieves quality performance compar...

work page

[4] [4]

a photo of V∗

This validates the effectiveness of our method in learning the correct concept-image correlations and decoupling the noise concept. Furthermore, from Fig. 8, we find that adding input purification (CodeSR) greatly boosts generation quality. Under the purification case, the contribution of CDL is more about decoupling the left-over background artifacts fro...

work page 2023

[5] [5]

Gaussian Filtering is a well-known image-processing technique used to reduce image noise and detail by applying a Gaussian kernel

Gaussian Filtering. Gaussian Filtering is a well-known image-processing technique used to reduce image noise and detail by applying a Gaussian kernel. The high-frequent part in adversarial perturbation can be smoothed after filtering. The kernel size is set as 5 following (Van Le et al., 2023)

work page 2023

[6] [6]

Total Variation Minimization (TVM) (Wang et al., 2020)The main idea of TVM is to conduct image reconstruction based on the observation that the benign images should have low total variation. We implemented the TVM defense in the following steps: we first resized the instance image to 642 pixels, applied a random dropout mask with a 2 After optimization, t...

work page 2020

[7] [7]

It involves transforming an image into a format that uses less storage space and reduces the image file’s size

JPEG Compression. It involves transforming an image into a format that uses less storage space and reduces the image file’s size. We set the JPEG quality to 75 following (Liu et al., 2024)

work page 2024

[8] [8]

DiffPure (Nie et al., 2022). Diffusion Purification (DiffPure) first diffuses the adversarial example with a small amount of noise given a pre-defined timestep t following a forward diffusion process, where the adversarial noise is smoothed and then recovers the clean image through the reverse generative process. Depending on the type of diffusion model u...

work page 2022

[9] [9]

a photo of [class_name], high quality, highres

for its superior performance. Since the SD model has the ability to input additional text prompts during the purification process, we investigate two variants with and without the usage of purified text prompting. For LatentDiffPure-∅, we set the text to null, while for LatentDiffPure, we set it as “a photo of [class_name], high quality, highres”

work page

[10] [10]

DDSPure (Carlini et al., 2022) . Similar to DiffPure (Nie et al., 2022), the main idea behind Diffusion Denoised Smoothing (DDS) is to find an optimal timestamp that can maximally remove the adversarial perturbation via the SDEdit process (Meng et al., 2021). Given smoothing noise level δ, the optimal timestamp t∗ is computed via, t∗ = 1− ¯αt ¯αt = σ2. Fo...

work page 2022

[11] [11]

GrIDPure notices that for purification in defending protective perturbation, conducting iterative DiffPure with small steps can outperform one-shot DiffPure with larger steps

GrIDPure (Zheng et al., 2023). GrIDPure notices that for purification in defending protective perturbation, conducting iterative DiffPure with small steps can outperform one-shot DiffPure with larger steps. Furthermore, it suppresses the generative nature during diffusion purification by additionally splitting the image into multiple small grids that are ...

work page 2023

[12] [12]

IMPRESS (Cao et al., 2024) The key idea of IMPRESS is to conduct purification that ensures latent consistency with visual similarity constraints : (1) the purified image should be visually similar to the perturbed image, and (2) the purified image should be consistent upon an LDM-based reconstruction. To quantify the similarity condition, IMPRESS uses the...

work page 2024

[13] [13]

We implement this approach with three surrogates

of surrogate models finetuned from different pre-trained generators, which can lead to better transferability. We implement this approach with three surrogates. Besides, we follow the practice of a single model at a time in an interleaving manner to produce optimal perturbed data due to GPU memory constraints. MetaCloak. Despite the effectiveness of pertu...

work page 2024