Erased but Exploitable: Black-box Embedding-Aware Prompting Against Unlearned Text-to-Image Diffusion Models

AmirMahdi Sadeghzadeh; Arian Komaei Koma; Mohammad Hossein Rohban; Seyed Amir Kasaei

arxiv: 2605.26332 · v1 · pith:QJ4JIKFNnew · submitted 2026-05-25 · 💻 cs.CV · cs.AI

Erased but Exploitable: Black-box Embedding-Aware Prompting Against Unlearned Text-to-Image Diffusion Models

Arian Komaei Koma , Seyed Amir Kasaei , AmirMahdi Sadeghzadeh , Mohammad Hossein Rohban This is my paper

Pith reviewed 2026-06-29 22:12 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords black-box attackadversarial promptingtext-to-image diffusionmachine unlearningconcept erasureLLM-guided searchembedding-aware attacksafety filter bypass

0 comments

The pith

BEAP recovers unlearned concepts from text-to-image models via LLM-guided black-box prompts that evade safety filters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents BEAP as a method to attack machine-unlearned text-to-image diffusion models without access to weights or producing obvious gibberish. It uses an LLM to iteratively refine prompts through an embedding-aware search that scores unlearned concept presence, text-image alignment, and image quality. The approach yields prompts that remain undetectable while generating the erased concepts at high quality. If the method works as described, current unlearning techniques leave models vulnerable to targeted recovery with modest query effort.

Core claim

BEAP performs embedding-aware search in text space by combining multiple reward signals to guide an LLM toward adversarial prompts; this recovers unlearned concepts at over 60 percent higher attack success rate than prior black-box methods while using an average of fifteen prompts per success and keeping the prompts undetectable to safety filters.

What carries the argument

BEAP (black-box embedding-aware adversarial prompting), an iterative LLM-driven search in text space that scores prompts on unlearned concept presence, text-image alignment, and image quality.

If this is right

BEAP raises attack success rate by more than 60 percent relative to earlier black-box attacks.
Successful attacks require only an average of fifteen prompts.
Generated prompts remain below the detection threshold of safety filters.
The recovered images maintain high visual quality of the unlearned concepts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Unlearning methods may need to incorporate text-space regularization beyond weight erasure to close this recovery path.
The same reward-guided search pattern could be tested on other generative modalities such as audio or video models.
A defender could measure how many additional unlearning steps are required to drop BEAP success rate below a chosen threshold.

Load-bearing premise

The reward signals for concept presence, alignment, and quality will steer the LLM to prompts that succeed without triggering detection.

What would settle it

Run BEAP on an unlearned model, submit its output prompts to a standard safety filter, and measure whether the filter blocks more than a small fraction while the attack success rate stays above prior baselines.

Figures

Figures reproduced from arXiv: 2605.26332 by AmirMahdi Sadeghzadeh, Arian Komaei Koma, Mohammad Hossein Rohban, Seyed Amir Kasaei.

**Figure 1.** Figure 1: Overview of BEAP. (A) We first construct an adversarial direction in the embedding space by subtracting the embeddings of neutralized prompts from their original nudity-related counterparts. We then compute this direction’s similarity to the embeddings of a fixed vocabulary (e.g., the Oxford 3000) and sample the top-k most similar words, which are then provided to our LLM. (B) Conditioned on top-k words an… view at source ↗

**Figure 2.** Figure 2: Illustration of score-augmented prompt construction for [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of the text-embedding space: the unlearned [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison across all five unlearning backbones evaluated in our study: UCE, ESD, MACE, SPM, and Receler. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Vulnerability of concept-removal methods. Both UCE [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Complete iteration-wise LLM chat trace used during BEAP optimization, showing instruction prompts, related terms, JSON [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 8.** Figure 8: Effect of the query count (Q) on BEAP performance. Across all backbones, limited exploration with Q=5 reduces ASR, while Q=10 achieves the most stable and reliable results. versity to reliably reach effective phrasing changes, consistently yielding the strongest results. Further increasing the query count to Q=15 does not provide consistent improvements. While some models maintain similar performance, oth… view at source ↗

**Figure 7.** Figure 7: Effect of vocabulary size k on BEAP performance. k=20 provides the strongest and most stable results. D.2. Effect of Query Count (Q) We evaluate how the number of LLM-generated candidate prompts per iteration affects BEAP’s performance by varying Q ∈ {5, 10, 15} [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 9.** Figure 9: Additional qualitative examples for nudity-related concept unlearning. For each backbone, we show outputs from the original I2P [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Additional qualitative examples for nudity-related concept unlearning. Results again showing BEAP’s ability to recover the [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative examples for the “golf ball” unlearning. [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

read the original abstract

Machine unlearning aims to remove specific concepts from pretrained text-to-image diffusion models, yet several white- and black-box attacks have been introduced to make the model generate such unlearned concepts. These attacks, nevertheless, do not assume a realistic threat model, i.e. they either assume access to the model weights, or result in gibberish adversarial prompts that could be easily detected even through naive rule-based safeguarding. We aim to address this gap in this paper. We introduce BEAP, a black-box, embedding-aware adversarial prompting attack that leverages a large language model (LLM) to iteratively generate effective adversarial prompts and exploit such hidden vulnerabilities. BEAP performs an embedding-aware search in text space, combining multiple reward signals: unlearned concept presence, text-image alignment, and image quality, to refine generated prompts. Unlike previous attack methods, BEAP keeps its prompts undetectable to safety filters while producing high-quality images. Extensive experiments show that BEAP improves the Attack Success Rate (ASR) by more than 60% over prior methods, while requiring only an average of fifteen prompts per successful attack. Warning: This paper contains model outputs that may be offensive or upsetting in nature.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BEAP shows a realistic black-box way to attack unlearned diffusion models with LLM prompts that stay natural and effective.

read the letter

The main point is that BEAP offers a black-box attack on unlearned text-to-image models that uses an LLM to iteratively build natural-looking prompts, achieving higher attack success than previous methods while evading detection.

The paper does a solid job laying out the embedding-aware search in text space and the three reward signals for guiding the LLM. It reports clear numbers: more than 60% better ASR and about 15 prompts on average per success. The evaluation includes tests against safety filters and comparisons to cited baselines, with details on how rewards are calculated from CLIP classifiers and standard metrics. That makes the contribution concrete and reproducible in principle.

The soft spots are mostly around the scope of the experiments. The improvement holds for the tested unlearning approaches and models, but results could vary with different unlearning strengths or newer models. The filter evasion is shown, but it might not cover every possible safeguard. These are typical for an empirical attack paper and don't undermine the central results.

This work is aimed at people studying safety mechanisms for diffusion models and adversarial attacks in that space. Anyone evaluating unlearning techniques would find the threat model and method useful.

It deserves peer review because the experiments support the claims without internal contradictions or hidden assumptions that don't hold up. I'd send it for review.

Referee Report

0 major / 2 minor

Summary. The paper presents BEAP, a black-box embedding-aware adversarial prompting technique against unlearned text-to-image diffusion models. Using an LLM, it iteratively searches for effective prompts in text space by combining rewards for unlearned concept presence (CLIP-based), text-image alignment, and image quality. The central claim is that BEAP achieves over 60% higher Attack Success Rate than previous methods, with an average of 15 prompts per successful attack, and generates prompts that remain undetectable by safety filters.

Significance. This result, if substantiated by the experiments, is significant as it addresses a realistic threat model for attacks on unlearned models, unlike prior white-box or easily detectable methods. The approach demonstrates how embedding-aware search can exploit vulnerabilities while maintaining prompt naturalness. The paper's use of multiple reward signals and black-box query interface provides a concrete method that could be reproduced, strengthening its contribution to the field of AI safety and model unlearning.

minor comments (2)

[Abstract] The abstract summarizes the method and results but lacks any mention of the specific datasets, models (e.g., Stable Diffusion versions), or unlearned concepts used in experiments; adding this would improve clarity without altering the claim.
[Method] The description of the iterative search process is clear, but the exact weighting or combination method for the three reward signals could be specified more precisely to aid reproducibility.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work, the recognition of its significance under a realistic threat model, and the recommendation for minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is entirely empirical: it describes an LLM-driven iterative search over reward signals (concept presence via CLIP, alignment, quality) to generate prompts, then reports measured ASR improvements and filter-evasion rates from black-box queries. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text or abstract. The central claims rest on external experimental outcomes rather than reducing to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no free parameters, axioms, or invented entities are specified or required for the central claim.

pith-pipeline@v0.9.1-grok · 5766 in / 1032 out tokens · 23991 ms · 2026-06-29T22:12:16.624909+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 12 canonical work pages · 4 internal anchors

[1]

Im- proving image generation with better captions

James Betker, Gabriel Goh, Li Jing, † TimBrooks, Jian- feng Wang, Linjie Li, † LongOuyang, † JuntangZhuang, † JoyceLee, † YufeiGuo, † WesamManassra, † PrafullaDhari- wal, † CaseyChu, † YunxinJiao, and Aditya Ramesh. Im- proving image generation with better captions. 1
[2]

Mitigating inappropriateness in image generation: Can there be value in reflecting the world’s ugli- ness?arXiv preprint arXiv:2305.18398, 2023

Manuel Brack, Felix Friedrich, Patrick Schramowski, and Kristian Kersting. Mitigating inappropriateness in image generation: Can there be value in reflecting the world’s ugli- ness?arXiv preprint arXiv:2305.18398, 2023. 1

work page arXiv 2023
[3]

Towards making systems for- get with machine unlearning

Yinzhi Cao and Junfeng Yang. Towards making systems for- get with machine unlearning. In2015 IEEE Symposium on Security and Privacy, pages 463–480, 2015. 1, 2

2015
[4]

Prompting4debugging: Red- teaming text-to-image diffusion models by finding problem- atic prompts.arXiv preprint arXiv:2309.06135, 2023

Zhi-Yi Chin, Chieh-Ming Jiang, Ching-Chun Huang, Pin- Yu Chen, and Wei-Chen Chiu. Prompting4debugging: Red- teaming text-to-image diffusion models by finding problem- atic prompts.arXiv preprint arXiv:2309.06135, 2023. 2

work page arXiv 2023
[5]

Stable diffusion safety checker.https:// huggingface.co/CompVis/stable-diffusion- safety-checker, 2022

CompVis. Stable diffusion safety checker.https:// huggingface.co/CompVis/stable-diffusion- safety-checker, 2022. Accessed: 2025-11-12. 1

2022
[6]

Divide-and-conquer at- tack: Harnessing the power of llm to bypass safety filters of text-to-image models.arXiv preprint arXiv:2312.07130,

Yimo Deng and Huangxun Chen. Divide-and-conquer at- tack: Harnessing the power of llm to bypass safety filters of text-to-image models.arXiv preprint arXiv:2312.07130,

work page arXiv
[7]

Bert: Pre-training of deep bidirectional trans- formers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. InProceedings of the 2019 conference of the North American chapter of the asso- ciation for computational linguistics: human language tech- nologies, volume 1 (long and short papers), pages 4171– 4186, 2019. 11

2019
[8]

Aesthetic predictor v2.5.https://github

discus0434. Aesthetic predictor v2.5.https://github. com / discus0434 / aesthetic - predictor - v2 - 5/, 2024. GitHub repository. 3

2024
[9]

Jailbreaking text-to-image models with llm- based agents.arXiv preprint arXiv:2408.00523, 2024

Yingkai Dong, Zheng Li, Xiangtao Meng, Ning Yu, and Shanqing Guo. Jailbreaking text-to-image models with llm- based agents.arXiv preprint arXiv:2408.00523, 2024. 1, 2

work page arXiv 2024
[10]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first International Conference on Machine Learn- ing, 2024. 1

2024
[11]

Reno: Enhancing one-step text-to-image models through reward-based noise optimiza- tion.Advances in Neural Information Processing Systems, 37:125487–125519, 2024

Luca Eyring, Shyamgopal Karthik, Karsten Roth, Alexey Dosovitskiy, and Zeynep Akata. Reno: Enhancing one-step text-to-image models through reward-based noise optimiza- tion.Advances in Neural Information Processing Systems, 37:125487–125519, 2024. 3

2024
[12]

SalUn: Empowering Machine Unlearning via Gradient-based Weight Saliency in Both Image Classification and Generation

Chongyu Fan, Jiancheng Liu, Yihua Zhang, Eric Wong, Den- nis Wei, and Sijia Liu. Salun: Empowering machine unlearn- ing via gradient-based weight saliency in both image classi- fication and generation.arXiv preprint arXiv:2310.12508,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Erasing concepts from diffusion models

Rohit Gandikota, Joanna Materzynska, Jaden Fiotto- Kaufman, and David Bau. Erasing concepts from diffusion models. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 2426–2436, 2023. 1, 2, 5, 11

2023
[14]

Unified concept editing in diffusion models

Rohit Gandikota, Hadas Orgad, Yonatan Belinkov, Joanna Materzy´nska, and David Bau. Unified concept editing in diffusion models. InProceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision, pages 5111–5120, 2024. 2, 5, 11

2024
[15]

The illusion of unlearn- ing: The unstable nature of machine unlearning in text-to- image diffusion models

Naveen George, Karthik Nandan Dasaraju, Rutheesh Reddy Chittepu, and Konda Reddy Mopuri. The illusion of unlearn- ing: The unstable nature of machine unlearning in text-to- image diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13393–13402, 2025. 1, 2

2025
[16]

Making ai forget you: Data deletion in ma- chine learning.Advances in neural information processing systems, 32, 2019

Antonio Ginart, Melody Guan, Gregory Valiant, and James Y Zou. Making ai forget you: Data deletion in ma- chine learning.Advances in neural information processing systems, 32, 2019. 2

2019
[17]

Detoxify

Laura Hanu and Unitary team. Detoxify. Github. https://github.com/unitaryai/detoxify, 2020. 1

2020
[18]

Re- celer: Reliable concept erasing of text-to-image diffusion models via lightweight erasers

Chi-Pin Huang, Kai-Po Chang, Chung-Ting Tsai, Yung- Hsuan Lai, Fu-En Yang, and Yu-Chiang Frank Wang. Re- celer: Reliable concept erasing of text-to-image diffusion models via lightweight erasers. InEuropean Conference on Computer Vision, pages 360–376. Springer, 2024. 2, 5, 11

2024
[19]

Perception-guided jailbreak against text-to-image models

Yihao Huang, Le Liang, Tianlin Li, Xiaojun Jia, Run Wang, Weikai Miao, Geguang Pu, and Yang Liu. Perception-guided jailbreak against text-to-image models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 26238– 26247, 2025. 1

2025
[20]

Openclip, 2021

Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Stella Biderman, Deepak Ganguli, et al. Openclip, 2021. 14

2021
[21]

Gibberish detector: High-accuracy text clas- sification model, 2021

Madhur Jindal. Gibberish detector: High-accuracy text clas- sification model, 2021. 2, 5, 12

2021
[22]

Evaluating the evaluators: Metrics for compositional text-to-image genera- tion, 2025

Seyed Amir Kasaei, Ali Aghayari, Arash Marioriyad, Niki Sepasian, MohammadAmin Fazli, Mahdieh Soleymani Baghshah, and Mohammad Hossein Rohban. Evaluating the evaluators: Metrics for compositional text-to-image genera- tion, 2025. 3

2025
[23]

Ablating con- cepts in text-to-image diffusion models

Nupur Kumari, Bingliang Zhang, Sheng-Yu Wang, Eli Shechtman, Richard Zhang, and Jun-Yan Zhu. Ablating con- cepts in text-to-image diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 22691–22702, 2023. 1, 2

2023
[24]

Nsfw text classifier: A model for classify- ing nsfw content in text.https://huggingface.co/ michellejieli / NSFW _ text _ classifier, 2022

Michelle Li. Nsfw text classifier: A model for classify- ing nsfw content in text.https://huggingface.co/ michellejieli / NSFW _ text _ classifier, 2022. Accessed: 2025-11-12. 1

2022
[25]

Safegen: Mitigating sexually explicit content generation in text-to-image models

Xinfeng Li, Yuchen Yang, Jiangyi Deng, Chen Yan, Yan- jiao Chen, Xiaoyu Ji, and Wenyuan Xu. Safegen: Mitigating sexually explicit content generation in text-to-image models. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 4807–4821,

2024
[26]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu 9 Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 5, 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Mace: Mass concept erasure in diffu- sion models

Shilin Lu, Zilan Wang, Leyang Li, Yanzhu Liu, and Adams Wai-Kin Kong. Mace: Mass concept erasure in diffu- sion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6430– 6440, 2024. 1, 2, 5, 11

2024
[28]

One-dimensional adapter to rule them all: Concepts diffusion models and erasing applications

Mengyao Lyu, Yuhong Yang, Haiwen Hong, Hui Chen, Xuan Jin, Yuan He, Hui Xue, Jungong Han, and Guiguang Ding. One-dimensional adapter to rule them all: Concepts diffusion models and erasing applications. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7559–7568, 2024. 1, 2, 5, 11

2024
[29]

A survey of machine unlearning.ACM Transactions on Intelligent Systems and Technology, 16(5): 1–46, 2025

Thanh Tam Nguyen, Thanh Trung Huynh, Zhao Ren, Phi Le Nguyen, Alan Wee-Chung Liew, Hongzhi Yin, and Quoc Viet Hung Nguyen. A survey of machine unlearning.ACM Transactions on Intelligent Systems and Technology, 16(5): 1–46, 2025. 2

2025
[30]

Nudenet: A nudity detection and cen- soring library.https://github.com/notAI-tech/ NudeNet, 2023

NudeNet Developers. Nudenet: A nudity detection and cen- soring library.https://github.com/notAI-tech/ NudeNet, 2023. GitHub repository. 3

2023
[31]

Chatgpt (gpt-5.1).https://chat.openai

OpenAI. Chatgpt (gpt-5.1).https://chat.openai. com/, 2025. Large language model accessed via web inter- face. 7

2025
[32]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 11, 14

2021
[35]

Red-teaming the stable diffusion safety filter.arXiv preprint arXiv:2210.04610, 2022

Javier Rando, Daniel Paleka, David Lindner, Lennart Heim, and Florian Tram`er. Red-teaming the stable diffusion safety filter.arXiv preprint arXiv:2210.04610, 2022. 1

work page arXiv 2022
[36]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1, 5, 11

2022
[37]

Safe latent diffusion: Mitigating inappro- priate degeneration in diffusion models

Patrick Schramowski, Manuel Brack, Bj ¨orn Deiseroth, and Kristian Kersting. Safe latent diffusion: Mitigating inappro- priate degeneration in diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22522–22531, 2023. 1, 5, 11

2023
[38]

Defining and characterizing reward hacking

Joar Skalse, Nikolaus HR Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward hacking. InProceedings of the 36th International Con- ference on Neural Information Processing Systems, pages 9460–9471, 2022. 3

2022
[39]

Midjourney: A proprietary artificial in- telligence program for creating images from textual descrip- tions, 2023

MidJourney Team. Midjourney: A proprietary artificial in- telligence program for creating images from textual descrip- tions, 2023. Available athttps://www.midjourney. com. 1

2023
[40]

Identifying and eliminating csam in generative ml training data and models

David Thiel. Identifying and eliminating csam in generative ml training data and models. Stanford Digital Repository,
[41]

Available athttps://purl.stanford.edu/ kh752sm9123. 1
[42]

Ring-a-bell! how reliable are concept removal meth- ods for diffusion models?arXiv preprint arXiv:2310.10012,

Yu-Lin Tsai, Chia-Yi Hsu, Chulin Xie, Chih-Hsun Lin, Jia- You Chen, Bo Li, Pin-Yu Chen, Chia-Mu Yu, and Chun-Ying Huang. Ring-a-bell! how reliable are concept removal meth- ods for diffusion models?arXiv preprint arXiv:2310.10012,

work page arXiv
[43]

Erasediff: Erasing data influence in diffusion models.arXiv preprint arXiv:2401.05779, 2024

Jing Wu, Trung Le, Munawar Hayat, and Mehrtash Harandi. Erasediff: Erasing data influence in diffusion models.arXiv preprint arXiv:2401.05779, 2024. 2

work page arXiv 2024
[44]

Imagere- ward: Learning and evaluating human preferences for text- to-image generation.Advances in Neural Information Pro- cessing Systems, 36:15903–15935, 2023

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagere- ward: Learning and evaluating human preferences for text- to-image generation.Advances in Neural Information Pro- cessing Systems, 36:15903–15935, 2023. 3

2023
[45]

Mma-diffusion: Multimodal attack on diffusion models

Yijun Yang, Ruiyuan Gao, Xiaosen Wang, Tsung-Yi Ho, Nan Xu, and Qiang Xu. Mma-diffusion: Multimodal attack on diffusion models. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 7737–7746, 2024. 2

2024
[46]

Sneakyprompt: Jailbreaking text-to-image generative models

Yuchen Yang, Bo Hui, Haolin Yuan, Neil Gong, and Yinzhi Cao. Sneakyprompt: Jailbreaking text-to-image generative models. In2024 IEEE symposium on security and privacy (SP), pages 897–912. IEEE, 2024. 1, 2

2024
[47]

Forget-me-not: Learning to for- get in text-to-image diffusion models

Gong Zhang, Kai Wang, Xingqian Xu, Zhangyang Wang, and Humphrey Shi. Forget-me-not: Learning to for- get in text-to-image diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1755–1764, 2024. 1

2024
[48]

To gener- ate or not? safety-driven unlearned diffusion models are still easy to generate unsafe images

Yimeng Zhang, Jinghan Jia, Xin Chen, Aochuan Chen, Yi- hua Zhang, Jiancheng Liu, Ke Ding, and Sijia Liu. To gener- ate or not? safety-driven unlearned diffusion models are still easy to generate unsafe images... for now. InEuropean Con- ference on Computer Vision, pages 385–403. Springer, 2024. 2

2024
[49]

Shieldlm: Empowering llms as aligned, cus- tomizable and explainable safety detectors.arXiv preprint arXiv:2402.16444, 2024

Zhexin Zhang, Yida Lu, Jingyuan Ma, Di Zhang, Rui Li, Pei Ke, Hao Sun, Lei Sha, Zhifang Sui, Hongning Wang, et al. Shieldlm: Empowering llms as aligned, cus- tomizable and explainable safety detectors.arXiv preprint arXiv:2402.16444, 2024. 1

work page arXiv 2024
[50]

golf ball

Haomin Zhuang, Yihua Zhang, and Sijia Liu. A pilot study of query-free adversarial attack against stable diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 2385–2392, 2023. 1, 2 10 Appendix A. Vulnerability Analysis We assess the vulnerability of two concept-removal meth- ods,UCEandReceler, to simple p...

2023

[1] [1]

Im- proving image generation with better captions

James Betker, Gabriel Goh, Li Jing, † TimBrooks, Jian- feng Wang, Linjie Li, † LongOuyang, † JuntangZhuang, † JoyceLee, † YufeiGuo, † WesamManassra, † PrafullaDhari- wal, † CaseyChu, † YunxinJiao, and Aditya Ramesh. Im- proving image generation with better captions. 1

[2] [2]

Mitigating inappropriateness in image generation: Can there be value in reflecting the world’s ugli- ness?arXiv preprint arXiv:2305.18398, 2023

Manuel Brack, Felix Friedrich, Patrick Schramowski, and Kristian Kersting. Mitigating inappropriateness in image generation: Can there be value in reflecting the world’s ugli- ness?arXiv preprint arXiv:2305.18398, 2023. 1

work page arXiv 2023

[3] [3]

Towards making systems for- get with machine unlearning

Yinzhi Cao and Junfeng Yang. Towards making systems for- get with machine unlearning. In2015 IEEE Symposium on Security and Privacy, pages 463–480, 2015. 1, 2

2015

[4] [4]

Prompting4debugging: Red- teaming text-to-image diffusion models by finding problem- atic prompts.arXiv preprint arXiv:2309.06135, 2023

Zhi-Yi Chin, Chieh-Ming Jiang, Ching-Chun Huang, Pin- Yu Chen, and Wei-Chen Chiu. Prompting4debugging: Red- teaming text-to-image diffusion models by finding problem- atic prompts.arXiv preprint arXiv:2309.06135, 2023. 2

work page arXiv 2023

[5] [5]

Stable diffusion safety checker.https:// huggingface.co/CompVis/stable-diffusion- safety-checker, 2022

CompVis. Stable diffusion safety checker.https:// huggingface.co/CompVis/stable-diffusion- safety-checker, 2022. Accessed: 2025-11-12. 1

2022

[6] [6]

Divide-and-conquer at- tack: Harnessing the power of llm to bypass safety filters of text-to-image models.arXiv preprint arXiv:2312.07130,

Yimo Deng and Huangxun Chen. Divide-and-conquer at- tack: Harnessing the power of llm to bypass safety filters of text-to-image models.arXiv preprint arXiv:2312.07130,

work page arXiv

[7] [7]

Bert: Pre-training of deep bidirectional trans- formers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. InProceedings of the 2019 conference of the North American chapter of the asso- ciation for computational linguistics: human language tech- nologies, volume 1 (long and short papers), pages 4171– 4186, 2019. 11

2019

[8] [8]

Aesthetic predictor v2.5.https://github

discus0434. Aesthetic predictor v2.5.https://github. com / discus0434 / aesthetic - predictor - v2 - 5/, 2024. GitHub repository. 3

2024

[9] [9]

Jailbreaking text-to-image models with llm- based agents.arXiv preprint arXiv:2408.00523, 2024

Yingkai Dong, Zheng Li, Xiangtao Meng, Ning Yu, and Shanqing Guo. Jailbreaking text-to-image models with llm- based agents.arXiv preprint arXiv:2408.00523, 2024. 1, 2

work page arXiv 2024

[10] [10]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first International Conference on Machine Learn- ing, 2024. 1

2024

[11] [11]

Reno: Enhancing one-step text-to-image models through reward-based noise optimiza- tion.Advances in Neural Information Processing Systems, 37:125487–125519, 2024

Luca Eyring, Shyamgopal Karthik, Karsten Roth, Alexey Dosovitskiy, and Zeynep Akata. Reno: Enhancing one-step text-to-image models through reward-based noise optimiza- tion.Advances in Neural Information Processing Systems, 37:125487–125519, 2024. 3

2024

[12] [12]

SalUn: Empowering Machine Unlearning via Gradient-based Weight Saliency in Both Image Classification and Generation

Chongyu Fan, Jiancheng Liu, Yihua Zhang, Eric Wong, Den- nis Wei, and Sijia Liu. Salun: Empowering machine unlearn- ing via gradient-based weight saliency in both image classi- fication and generation.arXiv preprint arXiv:2310.12508,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Erasing concepts from diffusion models

Rohit Gandikota, Joanna Materzynska, Jaden Fiotto- Kaufman, and David Bau. Erasing concepts from diffusion models. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 2426–2436, 2023. 1, 2, 5, 11

2023

[14] [14]

Unified concept editing in diffusion models

Rohit Gandikota, Hadas Orgad, Yonatan Belinkov, Joanna Materzy´nska, and David Bau. Unified concept editing in diffusion models. InProceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision, pages 5111–5120, 2024. 2, 5, 11

2024

[15] [15]

The illusion of unlearn- ing: The unstable nature of machine unlearning in text-to- image diffusion models

Naveen George, Karthik Nandan Dasaraju, Rutheesh Reddy Chittepu, and Konda Reddy Mopuri. The illusion of unlearn- ing: The unstable nature of machine unlearning in text-to- image diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13393–13402, 2025. 1, 2

2025

[16] [16]

Making ai forget you: Data deletion in ma- chine learning.Advances in neural information processing systems, 32, 2019

Antonio Ginart, Melody Guan, Gregory Valiant, and James Y Zou. Making ai forget you: Data deletion in ma- chine learning.Advances in neural information processing systems, 32, 2019. 2

2019

[17] [17]

Detoxify

Laura Hanu and Unitary team. Detoxify. Github. https://github.com/unitaryai/detoxify, 2020. 1

2020

[18] [18]

Re- celer: Reliable concept erasing of text-to-image diffusion models via lightweight erasers

Chi-Pin Huang, Kai-Po Chang, Chung-Ting Tsai, Yung- Hsuan Lai, Fu-En Yang, and Yu-Chiang Frank Wang. Re- celer: Reliable concept erasing of text-to-image diffusion models via lightweight erasers. InEuropean Conference on Computer Vision, pages 360–376. Springer, 2024. 2, 5, 11

2024

[19] [19]

Perception-guided jailbreak against text-to-image models

Yihao Huang, Le Liang, Tianlin Li, Xiaojun Jia, Run Wang, Weikai Miao, Geguang Pu, and Yang Liu. Perception-guided jailbreak against text-to-image models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 26238– 26247, 2025. 1

2025

[20] [20]

Openclip, 2021

Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Stella Biderman, Deepak Ganguli, et al. Openclip, 2021. 14

2021

[21] [21]

Gibberish detector: High-accuracy text clas- sification model, 2021

Madhur Jindal. Gibberish detector: High-accuracy text clas- sification model, 2021. 2, 5, 12

2021

[22] [22]

Evaluating the evaluators: Metrics for compositional text-to-image genera- tion, 2025

Seyed Amir Kasaei, Ali Aghayari, Arash Marioriyad, Niki Sepasian, MohammadAmin Fazli, Mahdieh Soleymani Baghshah, and Mohammad Hossein Rohban. Evaluating the evaluators: Metrics for compositional text-to-image genera- tion, 2025. 3

2025

[23] [23]

Ablating con- cepts in text-to-image diffusion models

Nupur Kumari, Bingliang Zhang, Sheng-Yu Wang, Eli Shechtman, Richard Zhang, and Jun-Yan Zhu. Ablating con- cepts in text-to-image diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 22691–22702, 2023. 1, 2

2023

[24] [24]

Nsfw text classifier: A model for classify- ing nsfw content in text.https://huggingface.co/ michellejieli / NSFW _ text _ classifier, 2022

Michelle Li. Nsfw text classifier: A model for classify- ing nsfw content in text.https://huggingface.co/ michellejieli / NSFW _ text _ classifier, 2022. Accessed: 2025-11-12. 1

2022

[25] [25]

Safegen: Mitigating sexually explicit content generation in text-to-image models

Xinfeng Li, Yuchen Yang, Jiangyi Deng, Chen Yan, Yan- jiao Chen, Xiaoyu Ji, and Wenyuan Xu. Safegen: Mitigating sexually explicit content generation in text-to-image models. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 4807–4821,

2024

[26] [26]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu 9 Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 5, 11

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Mace: Mass concept erasure in diffu- sion models

Shilin Lu, Zilan Wang, Leyang Li, Yanzhu Liu, and Adams Wai-Kin Kong. Mace: Mass concept erasure in diffu- sion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6430– 6440, 2024. 1, 2, 5, 11

2024

[28] [28]

One-dimensional adapter to rule them all: Concepts diffusion models and erasing applications

Mengyao Lyu, Yuhong Yang, Haiwen Hong, Hui Chen, Xuan Jin, Yuan He, Hui Xue, Jungong Han, and Guiguang Ding. One-dimensional adapter to rule them all: Concepts diffusion models and erasing applications. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7559–7568, 2024. 1, 2, 5, 11

2024

[29] [29]

A survey of machine unlearning.ACM Transactions on Intelligent Systems and Technology, 16(5): 1–46, 2025

Thanh Tam Nguyen, Thanh Trung Huynh, Zhao Ren, Phi Le Nguyen, Alan Wee-Chung Liew, Hongzhi Yin, and Quoc Viet Hung Nguyen. A survey of machine unlearning.ACM Transactions on Intelligent Systems and Technology, 16(5): 1–46, 2025. 2

2025

[30] [30]

Nudenet: A nudity detection and cen- soring library.https://github.com/notAI-tech/ NudeNet, 2023

NudeNet Developers. Nudenet: A nudity detection and cen- soring library.https://github.com/notAI-tech/ NudeNet, 2023. GitHub repository. 3

2023

[31] [31]

Chatgpt (gpt-5.1).https://chat.openai

OpenAI. Chatgpt (gpt-5.1).https://chat.openai. com/, 2025. Large language model accessed via web inter- face. 7

2025

[32] [32]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 11, 14

2021

[35] [35]

Red-teaming the stable diffusion safety filter.arXiv preprint arXiv:2210.04610, 2022

Javier Rando, Daniel Paleka, David Lindner, Lennart Heim, and Florian Tram`er. Red-teaming the stable diffusion safety filter.arXiv preprint arXiv:2210.04610, 2022. 1

work page arXiv 2022

[36] [36]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1, 5, 11

2022

[37] [37]

Safe latent diffusion: Mitigating inappro- priate degeneration in diffusion models

Patrick Schramowski, Manuel Brack, Bj ¨orn Deiseroth, and Kristian Kersting. Safe latent diffusion: Mitigating inappro- priate degeneration in diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22522–22531, 2023. 1, 5, 11

2023

[38] [38]

Defining and characterizing reward hacking

Joar Skalse, Nikolaus HR Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward hacking. InProceedings of the 36th International Con- ference on Neural Information Processing Systems, pages 9460–9471, 2022. 3

2022

[39] [39]

Midjourney: A proprietary artificial in- telligence program for creating images from textual descrip- tions, 2023

MidJourney Team. Midjourney: A proprietary artificial in- telligence program for creating images from textual descrip- tions, 2023. Available athttps://www.midjourney. com. 1

2023

[40] [40]

Identifying and eliminating csam in generative ml training data and models

David Thiel. Identifying and eliminating csam in generative ml training data and models. Stanford Digital Repository,

[41] [41]

Available athttps://purl.stanford.edu/ kh752sm9123. 1

[42] [42]

Ring-a-bell! how reliable are concept removal meth- ods for diffusion models?arXiv preprint arXiv:2310.10012,

Yu-Lin Tsai, Chia-Yi Hsu, Chulin Xie, Chih-Hsun Lin, Jia- You Chen, Bo Li, Pin-Yu Chen, Chia-Mu Yu, and Chun-Ying Huang. Ring-a-bell! how reliable are concept removal meth- ods for diffusion models?arXiv preprint arXiv:2310.10012,

work page arXiv

[43] [43]

Erasediff: Erasing data influence in diffusion models.arXiv preprint arXiv:2401.05779, 2024

Jing Wu, Trung Le, Munawar Hayat, and Mehrtash Harandi. Erasediff: Erasing data influence in diffusion models.arXiv preprint arXiv:2401.05779, 2024. 2

work page arXiv 2024

[44] [44]

Imagere- ward: Learning and evaluating human preferences for text- to-image generation.Advances in Neural Information Pro- cessing Systems, 36:15903–15935, 2023

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagere- ward: Learning and evaluating human preferences for text- to-image generation.Advances in Neural Information Pro- cessing Systems, 36:15903–15935, 2023. 3

2023

[45] [45]

Mma-diffusion: Multimodal attack on diffusion models

Yijun Yang, Ruiyuan Gao, Xiaosen Wang, Tsung-Yi Ho, Nan Xu, and Qiang Xu. Mma-diffusion: Multimodal attack on diffusion models. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 7737–7746, 2024. 2

2024

[46] [46]

Sneakyprompt: Jailbreaking text-to-image generative models

Yuchen Yang, Bo Hui, Haolin Yuan, Neil Gong, and Yinzhi Cao. Sneakyprompt: Jailbreaking text-to-image generative models. In2024 IEEE symposium on security and privacy (SP), pages 897–912. IEEE, 2024. 1, 2

2024

[47] [47]

Forget-me-not: Learning to for- get in text-to-image diffusion models

Gong Zhang, Kai Wang, Xingqian Xu, Zhangyang Wang, and Humphrey Shi. Forget-me-not: Learning to for- get in text-to-image diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1755–1764, 2024. 1

2024

[48] [48]

To gener- ate or not? safety-driven unlearned diffusion models are still easy to generate unsafe images

Yimeng Zhang, Jinghan Jia, Xin Chen, Aochuan Chen, Yi- hua Zhang, Jiancheng Liu, Ke Ding, and Sijia Liu. To gener- ate or not? safety-driven unlearned diffusion models are still easy to generate unsafe images... for now. InEuropean Con- ference on Computer Vision, pages 385–403. Springer, 2024. 2

2024

[49] [49]

Shieldlm: Empowering llms as aligned, cus- tomizable and explainable safety detectors.arXiv preprint arXiv:2402.16444, 2024

Zhexin Zhang, Yida Lu, Jingyuan Ma, Di Zhang, Rui Li, Pei Ke, Hao Sun, Lei Sha, Zhifang Sui, Hongning Wang, et al. Shieldlm: Empowering llms as aligned, cus- tomizable and explainable safety detectors.arXiv preprint arXiv:2402.16444, 2024. 1

work page arXiv 2024

[50] [50]

golf ball

Haomin Zhuang, Yihua Zhang, and Sijia Liu. A pilot study of query-free adversarial attack against stable diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 2385–2392, 2023. 1, 2 10 Appendix A. Vulnerability Analysis We assess the vulnerability of two concept-removal meth- ods,UCEandReceler, to simple p...

2023