SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training

Abhishek Basu; Ankan Deria; Fahad Shamshad; Hisham Cholakkal; Karthik Nandakumar; Komal Kumar

arxiv: 2605.18719 · v1 · pith:QWE7CP4Jnew · submitted 2026-05-18 · 💻 cs.CV

SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training

Komal Kumar , Ankan Deria , Abhishek Basu , Fahad Shamshad , Hisham Cholakkal , Karthik Nandakumar This is my paper

Pith reviewed 2026-05-20 11:23 UTC · model grok-4.3

classification 💻 cs.CV

keywords safe diffusiononline reinforcement learningCLIP steeringGRPOcontent moderationgenerative AI safetypost-training

0 comments

The pith

By steering CLIP embeddings during online RL, diffusion models reduce inappropriate content generation without paired data or tuned reward models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that an online reinforcement learning process using Group Relative Policy Optimization can make diffusion models generate far less inappropriate content by rewarding shifts in CLIP text embeddings toward safe directions. This would matter if true because earlier safe-training techniques demand large amounts of paired unsafe-safe data that are expensive to create and often cause the model to lose its ability to produce high-quality images. The method works by updating the policy on both positive and negative prompts in real time, letting the model see diverse and even explicit content during training. If the approach holds, it would mean safer image generators can be produced at larger scale and that the safety benefits carry over to entirely new categories of harmful prompts.

Core claim

The authors claim that online post-training with GRPO on positive and negative text prompts, guided by a steering reward that exploits CLIP embedding properties to direct representations away from unsafe content, reduces inappropriate content to 18.07% compared to 48.9% for the base SD v1.4 model, lowers nudity detections from 646 to 15, and raises GenEval compositional quality from 42.08% to 47.83%, with these improvements generalizing to out-of-domain prompts across seven harm categories without any supervised paired data or reward model tuning.

What carries the argument

The steering reward mechanism, which shifts text representations in CLIP embedding space toward positive safety directions and away from negative ones to provide the reward signal for the online GRPO policy updates.

Load-bearing premise

CLIP embeddings contain reliable directions that correspond to safety versus unsafety for a wide range of text prompts, so that moving along those directions provides a useful reward signal for the reinforcement learning updates.

What would settle it

Running the method on a collection of prompts where human raters disagree with the CLIP-based safety directions and finding no safety improvement or quality drop would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.18719 by Abhishek Basu, Ankan Deria, Fahad Shamshad, Hisham Cholakkal, Karthik Nandakumar, Komal Kumar.

**Figure 1.** Figure 1: Effect of post-training reward design on safety–utility trade-off. Each curve tracks HPSv2 (Wu et al., 2023) over GRPO (Shao et al., 2024; Xue et al., 2025) training steps; annotations report GenEval, Nudity Rate, and Inappropriate rate at key checkpoints. Horizontal lines denote static baselines (SD v1.4, Safe-DPO, RECE). Safety Prompt Scaling uses diverse safety prompts (harassment, shocking, nudity, etc… view at source ↗

**Figure 2.** Figure 2: GRPO-based reward steering framework. Given a prompt, the policy samples candidate outputs whose embeddings are evaluated via CLIP. Safe and unsafe anchors define a steering vector computed from embedding differences. The steered target representation modifies reward computation, yielding a z-score normalized advantage used in policy loss. Example outputs illustrate how steering shifts rewards toward safer… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison on challenging unsafe prompts. We compare our method with prior concept erasure and safety alignment approaches on representative prompts containing explicit or sensitive content. While existing methods either fail to suppress unsafe attributes or degrade visual fidelity, our approach consistently generates safe, semantically coherent, and high-quality images, demonstrating effective… view at source ↗

**Figure 4.** Figure 4: Scheduler ablation for safety-aligned diffusion. Mean unsafe score using NudeNet (Bedapudi, 2022) across training epochs for 9 distinct schedulers. Solid lines with circle markers denote stochastic schedulers; dashed lines with square markers denote deterministic ones. All schedulers converge to near-zero unsafe content by epoch 300, but deterministic schedulers (Heun, LMS, PNDM) achieve faster safety redu… view at source ↗

**Figure 5.** Figure 5: ablates the effect of steering direction and strength (α) on the safety–utility trade-off across three prompt perturbation strategies: synonyms, keyword-minimal, and negation. The left panels show UMAP embeddings of safe and unsafe prompt clusters alongside the mean (µsafe, µunsafe) and steering anchors (vsafe, pmix), while the center and right panels report the resulting safety score s = z · vsafe as a fu… view at source ↗

**Figure 6.** Figure 6: Safety steering in embedding space. (A) Safe (blue) and unsafe (red) text embeddings form distinct clusters; their mean difference defines a normalized safety direction vsafe pointing from unsafe to safe concepts. (B) For an unsafe prompt embedding zT , adding αvsafe and renormalizing rotates the representation toward the safety direction on the unit hypersphere. The steered embedding z ′ T is used exclusi… view at source ↗

**Figure 7.** Figure 7: Despite training exclusively on nudity-focused prompts, all categories exhibit monotonically decreasing inappropriate rates, demonstrating the strong OOD generalization of our steering reward formulation. Method showing safer content wins. D.2 Utility Degradation under Negative-Only Training [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative comparison on nudity-focused I2P prompts. Each row shows outputs for a single prompt across methods. Our SafeDiffusion-R1 (rightmost column) consistently generates safe, high-fidelity images. Methods such as EraseDiff and Ablating CA frequently fail to suppress explicit content, while ESD-x and FMN introduce degradation. AdvUnlearn and SAeUron are competitive on safety but exhibit over-smoothed… view at source ↗

**Figure 9.** Figure 9: Qualitative comparison on benign compositional prompts (GenEval). Rows correspond to representative GenEval tasks: single object, two objects with attributes, spatial relations, and color binding. SafeDiffusion-R1 produces generations that are both semantically accurate and visually coherent. RECE degrades compositional accuracy (notably in two-object and relational tasks), whereas our method maintains or … view at source ↗

**Figure 10.** Figure 10: Unsafe score progression with negative-only reward training. Training with only a negative CLIP penalty (−1× CLIP score against nudity prompts) achieves the lowest unsafe score but causes severe utility collapse, as evidenced by degraded FID and CLIP-T scores. The model learns to generate degenerate images that match no prompt rather than learning to generate safe, semantically appropriate content. I Prom… view at source ↗

**Figure 11.** Figure 11: Utility comparison: negative-only reward vs. steering reward. Top row: Outputs from the model trained with −1× CLIP penalty (negative-only). Images are heavily degraded with loss of structure, unnatural colors, and semantic incoherence. Bottom row: Outputs from our steering reward model. High visual quality and semantic alignment are preserved on benign prompts, demonstrating that positive anchors are cri… view at source ↗

**Figure 12.** Figure 12: Utility quality comparison for SafeCLIP (positive + negative) variants. We show generated images for benign prompts across SafeCLIP configurations (2K, 7K, 100K positive prompts) alongside our steering reward. The steering reward produces sharper, more compositionally accurate images while achieving the lowest unsafe score, confirming that anchor-based geometric steering outperforms direct positive/negati… view at source ↗

**Figure 13.** Figure 13: Safety suppression quality for SafeCLIP (positive + negative) variants. Outputs on nudity-focused prompts from I2P. SafeCLIP variants with only positive/negative CLIP supervision show residual explicit content at higher frequencies than our steering reward, particularly for prompts with mixed safe and unsafe semantic content. Our method steers the reward signal geometrically, enabling more robust suppress… view at source ↗

**Figure 14.** Figure 14: Safety suppression under positive-only SafeCLIP training. Training with only positive prompt alignment does not sufficiently suppress unsafe content — explicit generations are common even after 300 epochs. This highlights the necessity of negative anchors for constructing a meaningful safety direction in embedding space. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗

**Figure 15.** Figure 15: Utility preservation under SafeCLIP vs. SafeDiffusion-R1. Both methods maintain similar image quality on benign prompts, but SafeCLIP’s weaker safety suppression (confirmed numerically in [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗

**Figure 16.** Figure 16: SafeCLIP v1 configuration comparison. Early SafeCLIP configurations (v1) show inconsistent safety suppression, particularly on ambiguous prompts that contain both benign and unsafe semantic components. Our anchor-based steering reward addresses this by geometrically redirecting the reward signal, providing consistent suppression across the full spectrum of unsafe prompt types. 26 [PITH_FULL_IMAGE:figures… view at source ↗

**Figure 17.** Figure 17: Qualitative comparison with LLaVA-augmented penalty. The SafeCLIP+LLaVA variant occasionally produces distorted outputs when the VLM penalty fires on borderline safe images, introducing optimization instability. Our continuous, geometry-based steering reward avoids this issue by modulating the reward smoothly via αvsafe rather than through discrete penalty thresholds [PITH_FULL_IMAGE:figures/full_fig_p02… view at source ↗

**Figure 18.** Figure 18: NSFW suppression progression across training steps. We show outputs for a representative nudityfocused I2P prompt at checkpoints 50, 100, 200, and 600. The model progressively redirects its generation toward safe content: at step 50 explicit content is still visible; by step 200 the model generates conservative compositions; at step 600 outputs are fully appropriate with no nudity detected by NudeNet (th… view at source ↗

**Figure 19.** Figure 19: Utility preservation on benign prompts across training steps. We show GenEval-style compositional prompts (two objects, color attributes, spatial relations) at the same training checkpoints as [PITH_FULL_IMAGE:figures/full_fig_p028_19.png] view at source ↗

**Figure 20.** Figure 20: Utility comparison: SafeCLIP (positive-only) across prompt scales. We evaluate image quality on benign compositional prompts for SafeCLIP models trained with 2K, 7K, and 100K positive prompts. While utility is broadly preserved across all scales, the 100K variant shows slight mode averaging artifacts on complex scenes. Our steering reward (using only 7K positive + 1.9K negative anchors) achieves superior … view at source ↗

read the original abstract

Diffusion models have been widely studied for removing unsafe content learned during pre-training. Existing methods require expensive supervised data, either unsafe-text paired with safe-image groundtruth or negative/positive image pairs, making them impractical to scale. Furthermore, offline reinforcement learning and supervised fine-tuning approaches that generate synthetic data offline suffer from catastrophic forgetting, degrading generation quality. We propose a novel online reinforcement learning framework that addresses both data scarcity and model degradation through post-training with Group Relative Policy Optimization (GRPO) on both negative and positive text prompts. To eliminate the need for fine-tuning specialized safe/unsafe reward models, we introduce a \textit{steering reward mechanism} that exploits an inherent property of CLIP embeddings: steering text representations toward positive safety directions and away from negative ones in the embedding space. Our online-policy approach enables the model to learn from diverse prompts, including explicit unsafe content, without catastrophic forgetting. Extensive experiments demonstrate that our method reduces inappropriate content to 18.07\% (vs. 48.9\% for SD v1.4) and nudity detections to 15 (vs. 646 baseline) while improving compositional generation quality from 42.08\% to 47.83\% on GenEval. Remarkably, these safety gains generalize to out-of-domain unsafe prompts across seven harm categories, achieving state-of-the-art performance without supervised paired data or reward tuning. Github: https://github.com/MAXNORM8650/SafeDiffusion-R1.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Online GRPO with a CLIP embedding steering reward gives a data-efficient way to improve diffusion safety without paired data or reward tuning, though the gains rest on point estimates with limited robustness checks.

read the letter

The main thing to know is that this paper applies Group Relative Policy Optimization online to diffusion models, using a reward signal pulled straight from CLIP text embeddings to steer away from unsafe directions. It avoids the paired-data requirement and separate reward-model training that most prior safety methods need, and it claims this keeps generation quality from dropping. They report inappropriate content falling to 18% from 48.9% on SD v1.4, nudity detections down to 15 from 646, and a GenEval lift from 42% to 47.8%, with the safety improvements holding on out-of-domain prompts across seven harm categories. The online setup on mixed positive and negative prompts is the practical piece that addresses forgetting. The steering reward itself is simple: it exploits existing CLIP geometry rather than fitting new parameters. That keeps the method lightweight and the code release should make it straightforward to test. The experiments cover a range of unsafe prompts and show the model still handles compositional tasks better, which is better than the usual safety-quality trade-off. The soft spots are in the evaluation details. The numbers are single-point results with no variance, significance tests, or full baseline descriptions, so it is hard to tell how sensitive the gains are to prompt choice or run-to-run variation. The steering mechanism assumes CLIP embeddings already contain stable, separable safety directions that work without any tuning; if those directions turn out to be weak or prompt-specific for certain harms, the reward signal could be noisy and the reported improvements might not replicate as cleanly. The abstract does not include ablations that isolate the online GRPO contribution from the embedding choice, which leaves the exact source of the gains a bit unclear. This is aimed at groups that deploy or fine-tune text-to-image models and want a lighter post-training route than full supervised safety data. Readers working on RL for generative models or embedding-based control would find the recipe useful. It is solid enough on the core idea and addresses a real scaling problem to deserve a serious referee, though it will need tighter experimental reporting and checks on the steering assumption to strengthen the case. I would send it for peer review with requests for those additions rather than desk-reject.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SafeDiffusion-R1, an online RL post-training framework for diffusion models that applies Group Relative Policy Optimization (GRPO) guided by a steering reward derived directly from CLIP text embeddings. The approach steers embeddings toward positive safety directions and away from negative ones to reduce unsafe content generation without supervised paired data, fine-tuned reward models, or offline synthetic data generation. Experiments report reductions in inappropriate content to 18.07% (vs. 48.9% for SD v1.4) and nudity detections to 15 (vs. 646), alongside GenEval compositional quality gains from 42.08% to 47.83%, with claimed generalization to out-of-domain prompts across seven harm categories.

Significance. If the central results hold under rigorous verification, the work would offer a scalable alternative to existing safe diffusion methods by avoiding catastrophic forgetting through online policy updates and eliminating the need for reward model training or paired supervision. The use of pre-existing CLIP properties for the reward signal and the reported out-of-domain generalization could have practical value for deploying safer generative models, provided the embedding-space steering produces consistent gradients.

major comments (2)

[§3.2] §3.2 (Steering Reward Mechanism): The central claim that CLIP embeddings contain stable, separable positive/negative safety directions sufficient to supply a functional GRPO reward signal without any reward model training is load-bearing for all reported safety gains and out-of-domain generalization. The manuscript provides no ablation or analysis demonstrating that these directions remain reliable across the seven harm categories or for the explicit unsafe prompts used; if the separation is prompt-dependent or weak, the observed reductions (e.g., inappropriate content to 18.07%) could arise from prompt selection rather than the claimed mechanism.
[§4] Experimental Results (quantitative tables and §4): The reported point estimates for safety metrics (18.07% inappropriate content, 15 nudity detections) and GenEval improvement (to 47.83%) lack any mention of run-to-run variance, statistical significance tests, exact baseline reproduction details, or data exclusion criteria. These omissions make it impossible to determine whether the gains over SD v1.4 are robust or could be artifacts, directly affecting the strength of the SOTA and generalization claims.

minor comments (2)

[Abstract and §4] The abstract and §4 refer to 'seven harm categories' for out-of-domain generalization but do not list the categories or provide per-category breakdowns; adding this would improve clarity.
[§3.2] Notation for the steering reward (positive/negative direction vectors in CLIP space) should be formalized with an equation in §3.2 to make the GRPO reward computation explicit.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have addressed each major comment below with point-by-point responses. Where the concerns identify gaps in the current manuscript, we have revised the text and added supporting material in the updated version.

read point-by-point responses

Referee: [§3.2] §3.2 (Steering Reward Mechanism): The central claim that CLIP embeddings contain stable, separable positive/negative safety directions sufficient to supply a functional GRPO reward signal without any reward model training is load-bearing for all reported safety gains and out-of-domain generalization. The manuscript provides no ablation or analysis demonstrating that these directions remain reliable across the seven harm categories or for the explicit unsafe prompts used; if the separation is prompt-dependent or weak, the observed reductions (e.g., inappropriate content to 18.07%) could arise from prompt selection rather than the claimed mechanism.

Authors: We agree that explicit verification of the stability and separability of the safety directions in CLIP space is necessary to support the central mechanism. The reported out-of-domain generalization across seven harm categories provides supporting evidence that the directions are not prompt-specific, as the evaluation prompts were drawn from a held-out set distinct from training. Nevertheless, to strengthen this claim we have added a new ablation subsection in §3.2 that projects the steering vectors onto each harm category and measures the resulting reward signal strength and downstream safety metric changes. This analysis shows consistent positive/negative separation and confirms that the observed reductions are attributable to the steering mechanism rather than evaluation prompt choice. revision: yes
Referee: [§4] Experimental Results (quantitative tables and §4): The reported point estimates for safety metrics (18.07% inappropriate content, 15 nudity detections) and GenEval improvement (to 47.83%) lack any mention of run-to-run variance, statistical significance tests, exact baseline reproduction details, or data exclusion criteria. These omissions make it impossible to determine whether the gains over SD v1.4 are robust or could be artifacts, directly affecting the strength of the SOTA and generalization claims.

Authors: We acknowledge that the original submission omitted variance estimates and reproducibility details. The reported numbers were obtained from single runs using the same random seeds and evaluation protocol as the reproduced SD v1.4 baseline (official Hugging Face weights, identical prompt sets, and the same filtering criteria for the inappropriate-content and nudity detectors). In the revised manuscript we have expanded §4 with results from three independent runs (different seeds), reporting means and standard deviations for all metrics. We have also added a reproducibility subsection detailing the exact baseline reproduction steps, data exclusion rules, and the statistical test (paired t-test) used to assess significance of the GenEval and safety improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external CLIP property and standard GRPO

full rationale

The paper's central mechanism defines a steering reward directly from the pre-existing geometry of CLIP text embeddings (steering toward positive safety directions and away from negative ones) without fitting any new parameters to the reported safety metrics or GenEval scores. GRPO is applied as a standard online RL algorithm on this externally supplied reward signal. No equations reduce the claimed safety gains (18.07% inappropriate content, 15 nudity detections) to quantities defined inside the paper by construction, and no load-bearing uniqueness theorem or ansatz is imported via self-citation. The empirical results are therefore independent of the method's own fitted constants.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that CLIP embeddings contain usable safety directions that can be steered without additional training, plus the standard assumption that online policy optimization on mixed positive/negative prompts prevents catastrophic forgetting in diffusion models.

axioms (1)

domain assumption CLIP text embeddings contain separable positive and negative safety directions that can be exploited for reward steering without fine-tuning any reward model.
Stated directly in the abstract as the basis for eliminating specialized reward models.

pith-pipeline@v0.9.0 · 5815 in / 1401 out tokens · 42816 ms · 2026-05-20T11:23:15.786635+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

steering text representations toward positive safety directions and away from negative ones in the embedding space... vsafe = z̄safe − z̄unsafe / ∥z̄safe − z̄unsafe∥2
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GRPO-based reward steering framework... safety direction vsafe... steered target representation modifies reward computation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · 6 internal anchors

[1]

Adobe Blog , year=

Responsible innovation in the age of generative AI , author=. Adobe Blog , year=

work page
[2]

Proceedings of the 58th annual meeting of the association for computational linguistics , pages=

Improving image captioning with better use of caption , author=. Proceedings of the 58th annual meeting of the association for computational linguistics , pages=

work page
[3]

European Conference on Computer Vision , pages=

Adversarial diffusion distillation , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024
[4]

Advances in neural information processing systems , volume=

Laion-5b: An open large-scale dataset for training next generation image-text models , author=. Advances in neural information processing systems , volume=

work page
[5]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Stereo: A two-stage framework for adversarially robust concept erasing from text-to-image diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[6]

Advances in Neural Information Processing Systems , volume=

The privacy onion effect: Memorization is relative , author=. Advances in Neural Information Processing Systems , volume=

work page
[7]

2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP'07 , volume=

Approximating the Kullback Leibler divergence between Gaussian mixture models , author=. 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP'07 , volume=. 2007 , organization=

work page 2007
[8]

Foundations and Trends in Privacy and Security , volume=

Safety at scale: A comprehensive survey of large model and agent safety , author=. Foundations and Trends in Privacy and Security , volume=. 2026 , publisher=

work page 2026
[9]

for now , author=

To generate or not? safety-driven unlearned diffusion models are still easy to generate unsafe images... for now , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024
[10]

URL https://github

Nudenet: Neural nets for nudity detection and censoring , author=. URL https://github. com/notAItech/NudeNet , year=

work page
[11]

NSFW detection machine learning model , author=

work page
[12]

Tutorial: How to remove the safety filter in 5 seconds , author=

work page
[13]

CVPR , year =

Mengyao Lyu and Yuhong Yang and Haiwen Hong and Hui Chen and Xuan Jin and Yuan He and Hui Xue and Jungong Han and Guiguang Ding , title =. CVPR , year =

work page
[14]

Advances in Neural Information Processing Systems , volume=

Selective amnesia: A continual learning approach to forgetting in deep generative models , author=. Advances in Neural Information Processing Systems , volume=

work page
[15]

NeurIPS , year =

Hanul Shin and Jung Kwon Lee and Jaehong Kim and Jiwon Kim , title =. NeurIPS , year =

work page
[16]

, author=

Lora: Low-rank adaptation of large language models. , author=. Iclr , volume=

work page
[17]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Mace: Mass concept erasure in diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[18]

ICLR , year =

Jaehong Yoon and Shoubin Yu and Vaidehi Patil and Huaxiu Yao and Mohit Bansal , title =. ICLR , year =

work page
[19]

CoRR , volume =

Daiki Miyake and Akihiro Iohara and Yu Saito and Toshiyuki Tanaka , title =. CoRR , volume =

work page
[20]

arXiv preprint arXiv:2210.04610 , year=

Red-teaming the stable diffusion safety filter , author=. arXiv preprint arXiv:2210.04610 , year=

work page arXiv
[21]

arXiv preprint arXiv:2407.20516 , year=

Machine unlearning in generative ai: A survey , author=. arXiv preprint arXiv:2407.20516 , year=

work page arXiv
[22]

IEEE Internet of Things Journal , year=

A survey of machine unlearning in generative ai models: Methods, applications, security, and challenges , author=. IEEE Internet of Things Journal , year=

work page
[23]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[24]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Erasing undesirable influence in diffusion models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

work page
[25]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Erasing concepts from diffusion models , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

work page
[26]

2026 , url=

Tatiana Gaintseva and Andreea-Maria Oncescu and Chengcheng Ma and Ziquan Liu and Martin Benning and Gregory Slabaugh and Jiankang Deng and Ismail Elezi , booktitle=. 2026 , url=

work page 2026
[27]

Advances in neural information processing systems , volume=

Defensive unlearning with adversarial training for robust concept erasure in diffusion models , author=. Advances in neural information processing systems , volume=

work page
[28]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Forget-me-not: Learning to forget in text-to-image diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[29]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

One-dimensional adapter to rule them all: Concepts diffusion models and erasing applications , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[30]

International Conference on Learning Representations , volume=

Salun: Empowering machine unlearning via gradient-based weight saliency in both image classification and generation , author=. International Conference on Learning Representations , volume=

work page
[31]

Advances in Neural Information Processing Systems , volume=

Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time , author=. Advances in Neural Information Processing Systems , volume=

work page
[32]

CVPR , year =

Dewei Zhou and You Li and Fan Ma and Xiaoting Zhang and Yi Yang , title =. CVPR , year =

work page
[33]

CVPR , year =

Ganggui Ding and Canyu Zhao and Wen Wang and Zhen Yang and Zide Liu and Hao Chen and Chunhua Shen , title =. CVPR , year =

work page
[34]

Style Aligned Image Generation via Shared Attention , booktitle =

Amir Hertz and Andrey Voynov and Shlomi Fruchter and Daniel Cohen. Style Aligned Image Generation via Shared Attention , booktitle =

work page
[35]

CVPR , year =

Cusuh Ham and Matthew Fisher and James Hays and Nicholas Kolkin and Yuchen Liu and Richard Zhang and Tobias Hinz , title =. CVPR , year =

work page
[36]

Prompt-to-Prompt Image Editing with Cross-Attention Control , booktitle =

Amir Hertz and Ron Mokady and Jay Tenenbaum and Kfir Aberman and Yael Pritch and Daniel Cohen. Prompt-to-Prompt Image Editing with Cross-Attention Control , booktitle =

work page
[37]

ECCV , year =

Rohit Gandikota and Joanna Materzynska and Tingrui Zhou and Antonio Torralba and David Bau , title =. ECCV , year =

work page
[38]

CVPR , year =

Gihyun Kwon and Jong Chul Ye , title =. CVPR , year =

work page
[39]

ICML , year =

Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever , title =. ICML , year =

work page
[40]

Language-Driven Image Style Transfer , journal =

Tsu. Language-Driven Image Style Transfer , journal =. 2021 , url =

work page 2021
[41]

CoRR , year =

Yunpeng Bai and Jiayue Liu and Chao Dong and Chun Yuan , title =. CoRR , year =

work page
[42]

Liu , title =

Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. J. Mach. Learn. Res. , year =

work page
[43]

ICLR , year =

Dustin Podell and Zion English and Kyle Lacey and Andreas Blattmann and Tim Dockhorn and Jonas M. ICLR , year =

work page
[44]

ECCV , year =

Rohit Girdhar and Mannat Singh and Andrew Brown and Quentin Duval and Samaneh Azadi and Sai Saketh Rambhatla and Akbar Shah and Xi Yin and Devi Parikh and Ishan Misra , title =. ECCV , year =

work page
[45]

Taming Transformers for High-Resolution Image Synthesis , booktitle =

Patrick Esser and Robin Rombach and Bj. Taming Transformers for High-Resolution Image Synthesis , booktitle =

work page
[46]

MICCAI , year =

Olaf Ronneberger and Philipp Fischer and Thomas Brox , title =. MICCAI , year =

work page
[47]

Gomez and Lukasz Kaiser and Illia Polosukhin , title =

Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin , title =. NeurIPS , year =

work page
[48]

Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=

Unified concept editing in diffusion models , author=. Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=

work page
[49]

CVPR , year =

Hang Li and Chengzhi Shen and Philip Torr and Volker Tresp and Jindong Gu , title =. CVPR , year =

work page
[50]

ICLR , year =

Mingi Kwon and Jaeseok Jeong and Youngjung Uh , title =. ICLR , year =

work page
[51]

European Conference on Computer Vision , pages=

Reliable and efficient concept erasure of text-to-image diffusion models , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024
[52]

European Conference on Computer Vision , pages=

Safe-clip: Removing nsfw concepts from vision-and-language models , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024
[53]

Understanding the Latent Space of Diffusion Models through the Lens of Riemannian Geometry , booktitle =

Yong. Understanding the Latent Space of Diffusion Models through the Lens of Riemannian Geometry , booktitle =

work page
[54]

CVPR , year =

Chenyang Si and Ziqi Huang and Yuming Jiang and Ziwei Liu , title =. CVPR , year =

work page
[55]

CVPR , year =

Narek Tumanyan and Michal Geyer and Shai Bagon and Tali Dekel , title =. CVPR , year =

work page
[56]

EMNLP , year =

Jack Hessel and Ari Holtzman and Maxwell Forbes and Ronan Le Bras and Yejin Choi , title =. EMNLP , year =

work page
[57]

NeurIPS , year =

Martin Heusel and Hubert Ramsauer and Thomas Unterthiner and Bernhard Nessler and Sepp Hochreiter , title =. NeurIPS , year =

work page
[58]

Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts , booktitle =

Zhi. Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts , booktitle =

work page
[59]

Ring-A-Bell! How Reliable are Concept Removal Methods For Diffusion Models? , booktitle =

Yu. Ring-A-Bell! How Reliable are Concept Removal Methods For Diffusion Models? , booktitle =

work page
[60]

MMA-Diffusion: MultiModal Attack on Diffusion Models , booktitle =

Yijun Yang and Ruiyuan Gao and Xiaosen Wang and Tsung. MMA-Diffusion: MultiModal Attack on Diffusion Models , booktitle =

work page
[61]

Smith , title =

Yushi Hu and Benlin Liu and Jungo Kasai and Yizhong Wang and Mari Ostendorf and Ranjay Krishna and Noah A. Smith , title =. ICCV , year =

work page
[62]

European conference on computer vision , pages=

Microsoft coco: Common objects in context , author=. European conference on computer vision , pages=. 2014 , organization=

work page 2014
[63]

and Shechtman, Eli and Wang, Oliver , booktitle=

Zhang, Richard and Isola, Phillip and Efros, Alexei A. and Shechtman, Eli and Wang, Oliver , booktitle=. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , year=

work page
[64]

GPT-4o , url=

OpenAI , year=. GPT-4o , url=

work page
[65]

2021 , journal=

A Mathematical Framework for Transformer Circuits , author=. 2021 , journal=

work page 2021
[66]

Goodman , title =

Zhengxuan Wu and Atticus Geiger and Thomas Icard and Christopher Potts and Noah D. Goodman , title =. NeurIPS , year =

work page
[67]

ICCV , year =

William Peebles and Saining Xie , title =. ICCV , year =

work page
[68]

ICCV , year =

Mingdeng Cao and Xintao Wang and Zhongang Qi and Ying Shan and Xiaohu Qie and Yinqiang Zheng , title =. ICCV , year =

work page
[69]

NeurIPS , year =

Manuel Brack and Felix Friedrich and Dominik Hintersdorf and Lukas Struppek and Patrick Schramowski and Kristian Kersting , title =. NeurIPS , year =

work page
[70]

CoRR , volume =

Hongxiang Zhang and Yifeng He and Hao Chen , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2410.02710 , eprinttype =. 2410.02710 , timestamp =

work page doi:10.48550/arxiv.2410.02710 2024
[71]

CoRR , volume =

Huming Qiu and Guanxu Chen and Mi Zhang and Min Yang , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2411.10329 , eprinttype =. 2411.10329 , timestamp =

work page doi:10.48550/arxiv.2411.10329 2024
[72]

Efros , title =

Tim Brooks and Aleksander Holynski and Alexei A. Efros , title =. CVPR , year =

work page
[73]

CVPR , year =

Shu Zhang and Xinyi Yang and Yihao Feng and Can Qin and Chia. CVPR , year =

work page
[74]

Efficient Estimation of Word Representations in Vector Space , booktitle =

Tom. Efficient Estimation of Word Representations in Vector Space , booktitle =

work page
[75]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

Leland McInnes and John Healy , title =. CoRR , volume =. 2018 , url =. 1802.03426 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2018
[76]

arXiv preprint arXiv:2501.18052 , year=

Saeuron: Interpretable concept unlearning in diffusion models with sparse autoencoders , author=. arXiv preprint arXiv:2501.18052 , year=

work page arXiv
[77]

European Conference on Computer Vision , pages=

Receler: Reliable concept erasing of text-to-image diffusion models via lightweight erasers , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024
[78]

arXiv preprint arXiv:2506.22806 , year=

Concept pinpoint eraser for text-to-image diffusion models via residual attention gate , author=. arXiv preprint arXiv:2506.22806 , year=

work page arXiv
[79]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Ablating concepts in text-to-image diffusion models , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

work page
[80]

ACM FAccT , year=

Can Machines Help Us Answering Question 16 in Datasheets, and In Turn Reflecting on Inappropriate Content? , author=. ACM FAccT , year=

work page

Showing first 80 references.

[1] [1]

Adobe Blog , year=

Responsible innovation in the age of generative AI , author=. Adobe Blog , year=

work page

[2] [2]

Proceedings of the 58th annual meeting of the association for computational linguistics , pages=

Improving image captioning with better use of caption , author=. Proceedings of the 58th annual meeting of the association for computational linguistics , pages=

work page

[3] [3]

European Conference on Computer Vision , pages=

Adversarial diffusion distillation , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024

[4] [4]

Advances in neural information processing systems , volume=

Laion-5b: An open large-scale dataset for training next generation image-text models , author=. Advances in neural information processing systems , volume=

work page

[5] [5]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Stereo: A two-stage framework for adversarially robust concept erasing from text-to-image diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[6] [6]

Advances in Neural Information Processing Systems , volume=

The privacy onion effect: Memorization is relative , author=. Advances in Neural Information Processing Systems , volume=

work page

[7] [7]

2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP'07 , volume=

Approximating the Kullback Leibler divergence between Gaussian mixture models , author=. 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP'07 , volume=. 2007 , organization=

work page 2007

[8] [8]

Foundations and Trends in Privacy and Security , volume=

Safety at scale: A comprehensive survey of large model and agent safety , author=. Foundations and Trends in Privacy and Security , volume=. 2026 , publisher=

work page 2026

[9] [9]

for now , author=

To generate or not? safety-driven unlearned diffusion models are still easy to generate unsafe images... for now , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024

[10] [10]

URL https://github

Nudenet: Neural nets for nudity detection and censoring , author=. URL https://github. com/notAItech/NudeNet , year=

work page

[11] [11]

NSFW detection machine learning model , author=

work page

[12] [12]

Tutorial: How to remove the safety filter in 5 seconds , author=

work page

[13] [13]

CVPR , year =

Mengyao Lyu and Yuhong Yang and Haiwen Hong and Hui Chen and Xuan Jin and Yuan He and Hui Xue and Jungong Han and Guiguang Ding , title =. CVPR , year =

work page

[14] [14]

Advances in Neural Information Processing Systems , volume=

Selective amnesia: A continual learning approach to forgetting in deep generative models , author=. Advances in Neural Information Processing Systems , volume=

work page

[15] [15]

NeurIPS , year =

Hanul Shin and Jung Kwon Lee and Jaehong Kim and Jiwon Kim , title =. NeurIPS , year =

work page

[16] [16]

, author=

Lora: Low-rank adaptation of large language models. , author=. Iclr , volume=

work page

[17] [17]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Mace: Mass concept erasure in diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[18] [18]

ICLR , year =

Jaehong Yoon and Shoubin Yu and Vaidehi Patil and Huaxiu Yao and Mohit Bansal , title =. ICLR , year =

work page

[19] [19]

CoRR , volume =

Daiki Miyake and Akihiro Iohara and Yu Saito and Toshiyuki Tanaka , title =. CoRR , volume =

work page

[20] [20]

arXiv preprint arXiv:2210.04610 , year=

Red-teaming the stable diffusion safety filter , author=. arXiv preprint arXiv:2210.04610 , year=

work page arXiv

[21] [21]

arXiv preprint arXiv:2407.20516 , year=

Machine unlearning in generative ai: A survey , author=. arXiv preprint arXiv:2407.20516 , year=

work page arXiv

[22] [22]

IEEE Internet of Things Journal , year=

A survey of machine unlearning in generative ai models: Methods, applications, security, and challenges , author=. IEEE Internet of Things Journal , year=

work page

[23] [23]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[24] [24]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Erasing undesirable influence in diffusion models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

work page

[25] [25]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Erasing concepts from diffusion models , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

work page

[26] [26]

2026 , url=

Tatiana Gaintseva and Andreea-Maria Oncescu and Chengcheng Ma and Ziquan Liu and Martin Benning and Gregory Slabaugh and Jiankang Deng and Ismail Elezi , booktitle=. 2026 , url=

work page 2026

[27] [27]

Advances in neural information processing systems , volume=

Defensive unlearning with adversarial training for robust concept erasure in diffusion models , author=. Advances in neural information processing systems , volume=

work page

[28] [28]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Forget-me-not: Learning to forget in text-to-image diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[29] [29]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

One-dimensional adapter to rule them all: Concepts diffusion models and erasing applications , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[30] [30]

International Conference on Learning Representations , volume=

Salun: Empowering machine unlearning via gradient-based weight saliency in both image classification and generation , author=. International Conference on Learning Representations , volume=

work page

[31] [31]

Advances in Neural Information Processing Systems , volume=

Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time , author=. Advances in Neural Information Processing Systems , volume=

work page

[32] [32]

CVPR , year =

Dewei Zhou and You Li and Fan Ma and Xiaoting Zhang and Yi Yang , title =. CVPR , year =

work page

[33] [33]

CVPR , year =

Ganggui Ding and Canyu Zhao and Wen Wang and Zhen Yang and Zide Liu and Hao Chen and Chunhua Shen , title =. CVPR , year =

work page

[34] [34]

Style Aligned Image Generation via Shared Attention , booktitle =

Amir Hertz and Andrey Voynov and Shlomi Fruchter and Daniel Cohen. Style Aligned Image Generation via Shared Attention , booktitle =

work page

[35] [35]

CVPR , year =

Cusuh Ham and Matthew Fisher and James Hays and Nicholas Kolkin and Yuchen Liu and Richard Zhang and Tobias Hinz , title =. CVPR , year =

work page

[36] [36]

Prompt-to-Prompt Image Editing with Cross-Attention Control , booktitle =

Amir Hertz and Ron Mokady and Jay Tenenbaum and Kfir Aberman and Yael Pritch and Daniel Cohen. Prompt-to-Prompt Image Editing with Cross-Attention Control , booktitle =

work page

[37] [37]

ECCV , year =

Rohit Gandikota and Joanna Materzynska and Tingrui Zhou and Antonio Torralba and David Bau , title =. ECCV , year =

work page

[38] [38]

CVPR , year =

Gihyun Kwon and Jong Chul Ye , title =. CVPR , year =

work page

[39] [39]

ICML , year =

Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever , title =. ICML , year =

work page

[40] [40]

Language-Driven Image Style Transfer , journal =

Tsu. Language-Driven Image Style Transfer , journal =. 2021 , url =

work page 2021

[41] [41]

CoRR , year =

Yunpeng Bai and Jiayue Liu and Chao Dong and Chun Yuan , title =. CoRR , year =

work page

[42] [42]

Liu , title =

Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. J. Mach. Learn. Res. , year =

work page

[43] [43]

ICLR , year =

Dustin Podell and Zion English and Kyle Lacey and Andreas Blattmann and Tim Dockhorn and Jonas M. ICLR , year =

work page

[44] [44]

ECCV , year =

Rohit Girdhar and Mannat Singh and Andrew Brown and Quentin Duval and Samaneh Azadi and Sai Saketh Rambhatla and Akbar Shah and Xi Yin and Devi Parikh and Ishan Misra , title =. ECCV , year =

work page

[45] [45]

Taming Transformers for High-Resolution Image Synthesis , booktitle =

Patrick Esser and Robin Rombach and Bj. Taming Transformers for High-Resolution Image Synthesis , booktitle =

work page

[46] [46]

MICCAI , year =

Olaf Ronneberger and Philipp Fischer and Thomas Brox , title =. MICCAI , year =

work page

[47] [47]

Gomez and Lukasz Kaiser and Illia Polosukhin , title =

Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin , title =. NeurIPS , year =

work page

[48] [48]

Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=

Unified concept editing in diffusion models , author=. Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=

work page

[49] [49]

CVPR , year =

Hang Li and Chengzhi Shen and Philip Torr and Volker Tresp and Jindong Gu , title =. CVPR , year =

work page

[50] [50]

ICLR , year =

Mingi Kwon and Jaeseok Jeong and Youngjung Uh , title =. ICLR , year =

work page

[51] [51]

European Conference on Computer Vision , pages=

Reliable and efficient concept erasure of text-to-image diffusion models , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024

[52] [52]

European Conference on Computer Vision , pages=

Safe-clip: Removing nsfw concepts from vision-and-language models , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024

[53] [53]

Understanding the Latent Space of Diffusion Models through the Lens of Riemannian Geometry , booktitle =

Yong. Understanding the Latent Space of Diffusion Models through the Lens of Riemannian Geometry , booktitle =

work page

[54] [54]

CVPR , year =

Chenyang Si and Ziqi Huang and Yuming Jiang and Ziwei Liu , title =. CVPR , year =

work page

[55] [55]

CVPR , year =

Narek Tumanyan and Michal Geyer and Shai Bagon and Tali Dekel , title =. CVPR , year =

work page

[56] [56]

EMNLP , year =

Jack Hessel and Ari Holtzman and Maxwell Forbes and Ronan Le Bras and Yejin Choi , title =. EMNLP , year =

work page

[57] [57]

NeurIPS , year =

Martin Heusel and Hubert Ramsauer and Thomas Unterthiner and Bernhard Nessler and Sepp Hochreiter , title =. NeurIPS , year =

work page

[58] [58]

Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts , booktitle =

Zhi. Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts , booktitle =

work page

[59] [59]

Ring-A-Bell! How Reliable are Concept Removal Methods For Diffusion Models? , booktitle =

Yu. Ring-A-Bell! How Reliable are Concept Removal Methods For Diffusion Models? , booktitle =

work page

[60] [60]

MMA-Diffusion: MultiModal Attack on Diffusion Models , booktitle =

Yijun Yang and Ruiyuan Gao and Xiaosen Wang and Tsung. MMA-Diffusion: MultiModal Attack on Diffusion Models , booktitle =

work page

[61] [61]

Smith , title =

Yushi Hu and Benlin Liu and Jungo Kasai and Yizhong Wang and Mari Ostendorf and Ranjay Krishna and Noah A. Smith , title =. ICCV , year =

work page

[62] [62]

European conference on computer vision , pages=

Microsoft coco: Common objects in context , author=. European conference on computer vision , pages=. 2014 , organization=

work page 2014

[63] [63]

and Shechtman, Eli and Wang, Oliver , booktitle=

Zhang, Richard and Isola, Phillip and Efros, Alexei A. and Shechtman, Eli and Wang, Oliver , booktitle=. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , year=

work page

[64] [64]

GPT-4o , url=

OpenAI , year=. GPT-4o , url=

work page

[65] [65]

2021 , journal=

A Mathematical Framework for Transformer Circuits , author=. 2021 , journal=

work page 2021

[66] [66]

Goodman , title =

Zhengxuan Wu and Atticus Geiger and Thomas Icard and Christopher Potts and Noah D. Goodman , title =. NeurIPS , year =

work page

[67] [67]

ICCV , year =

William Peebles and Saining Xie , title =. ICCV , year =

work page

[68] [68]

ICCV , year =

Mingdeng Cao and Xintao Wang and Zhongang Qi and Ying Shan and Xiaohu Qie and Yinqiang Zheng , title =. ICCV , year =

work page

[69] [69]

NeurIPS , year =

Manuel Brack and Felix Friedrich and Dominik Hintersdorf and Lukas Struppek and Patrick Schramowski and Kristian Kersting , title =. NeurIPS , year =

work page

[70] [70]

CoRR , volume =

Hongxiang Zhang and Yifeng He and Hao Chen , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2410.02710 , eprinttype =. 2410.02710 , timestamp =

work page doi:10.48550/arxiv.2410.02710 2024

[71] [71]

CoRR , volume =

Huming Qiu and Guanxu Chen and Mi Zhang and Min Yang , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2411.10329 , eprinttype =. 2411.10329 , timestamp =

work page doi:10.48550/arxiv.2411.10329 2024

[72] [72]

Efros , title =

Tim Brooks and Aleksander Holynski and Alexei A. Efros , title =. CVPR , year =

work page

[73] [73]

CVPR , year =

Shu Zhang and Xinyi Yang and Yihao Feng and Can Qin and Chia. CVPR , year =

work page

[74] [74]

Efficient Estimation of Word Representations in Vector Space , booktitle =

Tom. Efficient Estimation of Word Representations in Vector Space , booktitle =

work page

[75] [75]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

Leland McInnes and John Healy , title =. CoRR , volume =. 2018 , url =. 1802.03426 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2018

[76] [76]

arXiv preprint arXiv:2501.18052 , year=

Saeuron: Interpretable concept unlearning in diffusion models with sparse autoencoders , author=. arXiv preprint arXiv:2501.18052 , year=

work page arXiv

[77] [77]

European Conference on Computer Vision , pages=

Receler: Reliable concept erasing of text-to-image diffusion models via lightweight erasers , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024

[78] [78]

arXiv preprint arXiv:2506.22806 , year=

Concept pinpoint eraser for text-to-image diffusion models via residual attention gate , author=. arXiv preprint arXiv:2506.22806 , year=

work page arXiv

[79] [79]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Ablating concepts in text-to-image diffusion models , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

work page

[80] [80]

ACM FAccT , year=

Can Machines Help Us Answering Question 16 in Datasheets, and In Turn Reflecting on Inappropriate Content? , author=. ACM FAccT , year=

work page