Erased but Exploitable: Black-box Embedding-Aware Prompting Against Unlearned Text-to-Image Diffusion Models
Pith reviewed 2026-06-29 22:12 UTC · model grok-4.3
The pith
BEAP recovers unlearned concepts from text-to-image models via LLM-guided black-box prompts that evade safety filters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BEAP performs embedding-aware search in text space by combining multiple reward signals to guide an LLM toward adversarial prompts; this recovers unlearned concepts at over 60 percent higher attack success rate than prior black-box methods while using an average of fifteen prompts per success and keeping the prompts undetectable to safety filters.
What carries the argument
BEAP (black-box embedding-aware adversarial prompting), an iterative LLM-driven search in text space that scores prompts on unlearned concept presence, text-image alignment, and image quality.
If this is right
- BEAP raises attack success rate by more than 60 percent relative to earlier black-box attacks.
- Successful attacks require only an average of fifteen prompts.
- Generated prompts remain below the detection threshold of safety filters.
- The recovered images maintain high visual quality of the unlearned concepts.
Where Pith is reading between the lines
- Unlearning methods may need to incorporate text-space regularization beyond weight erasure to close this recovery path.
- The same reward-guided search pattern could be tested on other generative modalities such as audio or video models.
- A defender could measure how many additional unlearning steps are required to drop BEAP success rate below a chosen threshold.
Load-bearing premise
The reward signals for concept presence, alignment, and quality will steer the LLM to prompts that succeed without triggering detection.
What would settle it
Run BEAP on an unlearned model, submit its output prompts to a standard safety filter, and measure whether the filter blocks more than a small fraction while the attack success rate stays above prior baselines.
Figures
read the original abstract
Machine unlearning aims to remove specific concepts from pretrained text-to-image diffusion models, yet several white- and black-box attacks have been introduced to make the model generate such unlearned concepts. These attacks, nevertheless, do not assume a realistic threat model, i.e. they either assume access to the model weights, or result in gibberish adversarial prompts that could be easily detected even through naive rule-based safeguarding. We aim to address this gap in this paper. We introduce BEAP, a black-box, embedding-aware adversarial prompting attack that leverages a large language model (LLM) to iteratively generate effective adversarial prompts and exploit such hidden vulnerabilities. BEAP performs an embedding-aware search in text space, combining multiple reward signals: unlearned concept presence, text-image alignment, and image quality, to refine generated prompts. Unlike previous attack methods, BEAP keeps its prompts undetectable to safety filters while producing high-quality images. Extensive experiments show that BEAP improves the Attack Success Rate (ASR) by more than 60% over prior methods, while requiring only an average of fifteen prompts per successful attack. Warning: This paper contains model outputs that may be offensive or upsetting in nature.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents BEAP, a black-box embedding-aware adversarial prompting technique against unlearned text-to-image diffusion models. Using an LLM, it iteratively searches for effective prompts in text space by combining rewards for unlearned concept presence (CLIP-based), text-image alignment, and image quality. The central claim is that BEAP achieves over 60% higher Attack Success Rate than previous methods, with an average of 15 prompts per successful attack, and generates prompts that remain undetectable by safety filters.
Significance. This result, if substantiated by the experiments, is significant as it addresses a realistic threat model for attacks on unlearned models, unlike prior white-box or easily detectable methods. The approach demonstrates how embedding-aware search can exploit vulnerabilities while maintaining prompt naturalness. The paper's use of multiple reward signals and black-box query interface provides a concrete method that could be reproduced, strengthening its contribution to the field of AI safety and model unlearning.
minor comments (2)
- [Abstract] The abstract summarizes the method and results but lacks any mention of the specific datasets, models (e.g., Stable Diffusion versions), or unlearned concepts used in experiments; adding this would improve clarity without altering the claim.
- [Method] The description of the iterative search process is clear, but the exact weighting or combination method for the three reward signals could be specified more precisely to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our work, the recognition of its significance under a realistic threat model, and the recommendation for minor revision. No specific major comments were raised in the report.
Circularity Check
No significant circularity
full rationale
The paper is entirely empirical: it describes an LLM-driven iterative search over reward signals (concept presence via CLIP, alignment, quality) to generate prompts, then reports measured ASR improvements and filter-evasion rates from black-box queries. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text or abstract. The central claims rest on external experimental outcomes rather than reducing to inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Im- proving image generation with better captions
James Betker, Gabriel Goh, Li Jing, † TimBrooks, Jian- feng Wang, Linjie Li, † LongOuyang, † JuntangZhuang, † JoyceLee, † YufeiGuo, † WesamManassra, † PrafullaDhari- wal, † CaseyChu, † YunxinJiao, and Aditya Ramesh. Im- proving image generation with better captions. 1
-
[2]
Manuel Brack, Felix Friedrich, Patrick Schramowski, and Kristian Kersting. Mitigating inappropriateness in image generation: Can there be value in reflecting the world’s ugli- ness?arXiv preprint arXiv:2305.18398, 2023. 1
-
[3]
Towards making systems for- get with machine unlearning
Yinzhi Cao and Junfeng Yang. Towards making systems for- get with machine unlearning. In2015 IEEE Symposium on Security and Privacy, pages 463–480, 2015. 1, 2
2015
-
[4]
Zhi-Yi Chin, Chieh-Ming Jiang, Ching-Chun Huang, Pin- Yu Chen, and Wei-Chen Chiu. Prompting4debugging: Red- teaming text-to-image diffusion models by finding problem- atic prompts.arXiv preprint arXiv:2309.06135, 2023. 2
-
[5]
Stable diffusion safety checker.https:// huggingface.co/CompVis/stable-diffusion- safety-checker, 2022
CompVis. Stable diffusion safety checker.https:// huggingface.co/CompVis/stable-diffusion- safety-checker, 2022. Accessed: 2025-11-12. 1
2022
-
[6]
Yimo Deng and Huangxun Chen. Divide-and-conquer at- tack: Harnessing the power of llm to bypass safety filters of text-to-image models.arXiv preprint arXiv:2312.07130,
-
[7]
Bert: Pre-training of deep bidirectional trans- formers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. InProceedings of the 2019 conference of the North American chapter of the asso- ciation for computational linguistics: human language tech- nologies, volume 1 (long and short papers), pages 4171– 4186, 2019. 11
2019
-
[8]
Aesthetic predictor v2.5.https://github
discus0434. Aesthetic predictor v2.5.https://github. com / discus0434 / aesthetic - predictor - v2 - 5/, 2024. GitHub repository. 3
2024
-
[9]
Jailbreaking text-to-image models with llm- based agents.arXiv preprint arXiv:2408.00523, 2024
Yingkai Dong, Zheng Li, Xiangtao Meng, Ning Yu, and Shanqing Guo. Jailbreaking text-to-image models with llm- based agents.arXiv preprint arXiv:2408.00523, 2024. 1, 2
-
[10]
Scaling recti- fied flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first International Conference on Machine Learn- ing, 2024. 1
2024
-
[11]
Reno: Enhancing one-step text-to-image models through reward-based noise optimiza- tion.Advances in Neural Information Processing Systems, 37:125487–125519, 2024
Luca Eyring, Shyamgopal Karthik, Karsten Roth, Alexey Dosovitskiy, and Zeynep Akata. Reno: Enhancing one-step text-to-image models through reward-based noise optimiza- tion.Advances in Neural Information Processing Systems, 37:125487–125519, 2024. 3
2024
-
[12]
Chongyu Fan, Jiancheng Liu, Yihua Zhang, Eric Wong, Den- nis Wei, and Sijia Liu. Salun: Empowering machine unlearn- ing via gradient-based weight saliency in both image classi- fication and generation.arXiv preprint arXiv:2310.12508,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Erasing concepts from diffusion models
Rohit Gandikota, Joanna Materzynska, Jaden Fiotto- Kaufman, and David Bau. Erasing concepts from diffusion models. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 2426–2436, 2023. 1, 2, 5, 11
2023
-
[14]
Unified concept editing in diffusion models
Rohit Gandikota, Hadas Orgad, Yonatan Belinkov, Joanna Materzy´nska, and David Bau. Unified concept editing in diffusion models. InProceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision, pages 5111–5120, 2024. 2, 5, 11
2024
-
[15]
The illusion of unlearn- ing: The unstable nature of machine unlearning in text-to- image diffusion models
Naveen George, Karthik Nandan Dasaraju, Rutheesh Reddy Chittepu, and Konda Reddy Mopuri. The illusion of unlearn- ing: The unstable nature of machine unlearning in text-to- image diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13393–13402, 2025. 1, 2
2025
-
[16]
Making ai forget you: Data deletion in ma- chine learning.Advances in neural information processing systems, 32, 2019
Antonio Ginart, Melody Guan, Gregory Valiant, and James Y Zou. Making ai forget you: Data deletion in ma- chine learning.Advances in neural information processing systems, 32, 2019. 2
2019
-
[17]
Detoxify
Laura Hanu and Unitary team. Detoxify. Github. https://github.com/unitaryai/detoxify, 2020. 1
2020
-
[18]
Re- celer: Reliable concept erasing of text-to-image diffusion models via lightweight erasers
Chi-Pin Huang, Kai-Po Chang, Chung-Ting Tsai, Yung- Hsuan Lai, Fu-En Yang, and Yu-Chiang Frank Wang. Re- celer: Reliable concept erasing of text-to-image diffusion models via lightweight erasers. InEuropean Conference on Computer Vision, pages 360–376. Springer, 2024. 2, 5, 11
2024
-
[19]
Perception-guided jailbreak against text-to-image models
Yihao Huang, Le Liang, Tianlin Li, Xiaojun Jia, Run Wang, Weikai Miao, Geguang Pu, and Yang Liu. Perception-guided jailbreak against text-to-image models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 26238– 26247, 2025. 1
2025
-
[20]
Openclip, 2021
Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Stella Biderman, Deepak Ganguli, et al. Openclip, 2021. 14
2021
-
[21]
Gibberish detector: High-accuracy text clas- sification model, 2021
Madhur Jindal. Gibberish detector: High-accuracy text clas- sification model, 2021. 2, 5, 12
2021
-
[22]
Evaluating the evaluators: Metrics for compositional text-to-image genera- tion, 2025
Seyed Amir Kasaei, Ali Aghayari, Arash Marioriyad, Niki Sepasian, MohammadAmin Fazli, Mahdieh Soleymani Baghshah, and Mohammad Hossein Rohban. Evaluating the evaluators: Metrics for compositional text-to-image genera- tion, 2025. 3
2025
-
[23]
Ablating con- cepts in text-to-image diffusion models
Nupur Kumari, Bingliang Zhang, Sheng-Yu Wang, Eli Shechtman, Richard Zhang, and Jun-Yan Zhu. Ablating con- cepts in text-to-image diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 22691–22702, 2023. 1, 2
2023
-
[24]
Nsfw text classifier: A model for classify- ing nsfw content in text.https://huggingface.co/ michellejieli / NSFW _ text _ classifier, 2022
Michelle Li. Nsfw text classifier: A model for classify- ing nsfw content in text.https://huggingface.co/ michellejieli / NSFW _ text _ classifier, 2022. Accessed: 2025-11-12. 1
2022
-
[25]
Safegen: Mitigating sexually explicit content generation in text-to-image models
Xinfeng Li, Yuchen Yang, Jiangyi Deng, Chen Yan, Yan- jiao Chen, Xiaoyu Ji, and Wenyuan Xu. Safegen: Mitigating sexually explicit content generation in text-to-image models. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 4807–4821,
2024
-
[26]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu 9 Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 5, 11
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Mace: Mass concept erasure in diffu- sion models
Shilin Lu, Zilan Wang, Leyang Li, Yanzhu Liu, and Adams Wai-Kin Kong. Mace: Mass concept erasure in diffu- sion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6430– 6440, 2024. 1, 2, 5, 11
2024
-
[28]
One-dimensional adapter to rule them all: Concepts diffusion models and erasing applications
Mengyao Lyu, Yuhong Yang, Haiwen Hong, Hui Chen, Xuan Jin, Yuan He, Hui Xue, Jungong Han, and Guiguang Ding. One-dimensional adapter to rule them all: Concepts diffusion models and erasing applications. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7559–7568, 2024. 1, 2, 5, 11
2024
-
[29]
A survey of machine unlearning.ACM Transactions on Intelligent Systems and Technology, 16(5): 1–46, 2025
Thanh Tam Nguyen, Thanh Trung Huynh, Zhao Ren, Phi Le Nguyen, Alan Wee-Chung Liew, Hongzhi Yin, and Quoc Viet Hung Nguyen. A survey of machine unlearning.ACM Transactions on Intelligent Systems and Technology, 16(5): 1–46, 2025. 2
2025
-
[30]
Nudenet: A nudity detection and cen- soring library.https://github.com/notAI-tech/ NudeNet, 2023
NudeNet Developers. Nudenet: A nudity detection and cen- soring library.https://github.com/notAI-tech/ NudeNet, 2023. GitHub repository. 3
2023
-
[31]
Chatgpt (gpt-5.1).https://chat.openai
OpenAI. Chatgpt (gpt-5.1).https://chat.openai. com/, 2025. Large language model accessed via web inter- face. 7
2025
-
[32]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 11, 14
2021
-
[35]
Red-teaming the stable diffusion safety filter.arXiv preprint arXiv:2210.04610, 2022
Javier Rando, Daniel Paleka, David Lindner, Lennart Heim, and Florian Tram`er. Red-teaming the stable diffusion safety filter.arXiv preprint arXiv:2210.04610, 2022. 1
-
[36]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1, 5, 11
2022
-
[37]
Safe latent diffusion: Mitigating inappro- priate degeneration in diffusion models
Patrick Schramowski, Manuel Brack, Bj ¨orn Deiseroth, and Kristian Kersting. Safe latent diffusion: Mitigating inappro- priate degeneration in diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22522–22531, 2023. 1, 5, 11
2023
-
[38]
Defining and characterizing reward hacking
Joar Skalse, Nikolaus HR Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward hacking. InProceedings of the 36th International Con- ference on Neural Information Processing Systems, pages 9460–9471, 2022. 3
2022
-
[39]
Midjourney: A proprietary artificial in- telligence program for creating images from textual descrip- tions, 2023
MidJourney Team. Midjourney: A proprietary artificial in- telligence program for creating images from textual descrip- tions, 2023. Available athttps://www.midjourney. com. 1
2023
-
[40]
Identifying and eliminating csam in generative ml training data and models
David Thiel. Identifying and eliminating csam in generative ml training data and models. Stanford Digital Repository,
-
[41]
Available athttps://purl.stanford.edu/ kh752sm9123. 1
-
[42]
Yu-Lin Tsai, Chia-Yi Hsu, Chulin Xie, Chih-Hsun Lin, Jia- You Chen, Bo Li, Pin-Yu Chen, Chia-Mu Yu, and Chun-Ying Huang. Ring-a-bell! how reliable are concept removal meth- ods for diffusion models?arXiv preprint arXiv:2310.10012,
-
[43]
Erasediff: Erasing data influence in diffusion models.arXiv preprint arXiv:2401.05779, 2024
Jing Wu, Trung Le, Munawar Hayat, and Mehrtash Harandi. Erasediff: Erasing data influence in diffusion models.arXiv preprint arXiv:2401.05779, 2024. 2
-
[44]
Imagere- ward: Learning and evaluating human preferences for text- to-image generation.Advances in Neural Information Pro- cessing Systems, 36:15903–15935, 2023
Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagere- ward: Learning and evaluating human preferences for text- to-image generation.Advances in Neural Information Pro- cessing Systems, 36:15903–15935, 2023. 3
2023
-
[45]
Mma-diffusion: Multimodal attack on diffusion models
Yijun Yang, Ruiyuan Gao, Xiaosen Wang, Tsung-Yi Ho, Nan Xu, and Qiang Xu. Mma-diffusion: Multimodal attack on diffusion models. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 7737–7746, 2024. 2
2024
-
[46]
Sneakyprompt: Jailbreaking text-to-image generative models
Yuchen Yang, Bo Hui, Haolin Yuan, Neil Gong, and Yinzhi Cao. Sneakyprompt: Jailbreaking text-to-image generative models. In2024 IEEE symposium on security and privacy (SP), pages 897–912. IEEE, 2024. 1, 2
2024
-
[47]
Forget-me-not: Learning to for- get in text-to-image diffusion models
Gong Zhang, Kai Wang, Xingqian Xu, Zhangyang Wang, and Humphrey Shi. Forget-me-not: Learning to for- get in text-to-image diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1755–1764, 2024. 1
2024
-
[48]
To gener- ate or not? safety-driven unlearned diffusion models are still easy to generate unsafe images
Yimeng Zhang, Jinghan Jia, Xin Chen, Aochuan Chen, Yi- hua Zhang, Jiancheng Liu, Ke Ding, and Sijia Liu. To gener- ate or not? safety-driven unlearned diffusion models are still easy to generate unsafe images... for now. InEuropean Con- ference on Computer Vision, pages 385–403. Springer, 2024. 2
2024
-
[49]
Zhexin Zhang, Yida Lu, Jingyuan Ma, Di Zhang, Rui Li, Pei Ke, Hao Sun, Lei Sha, Zhifang Sui, Hongning Wang, et al. Shieldlm: Empowering llms as aligned, cus- tomizable and explainable safety detectors.arXiv preprint arXiv:2402.16444, 2024. 1
-
[50]
golf ball
Haomin Zhuang, Yihua Zhang, and Sijia Liu. A pilot study of query-free adversarial attack against stable diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 2385–2392, 2023. 1, 2 10 Appendix A. Vulnerability Analysis We assess the vulnerability of two concept-removal meth- ods,UCEandReceler, to simple p...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.