PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models
Pith reviewed 2026-05-23 05:45 UTC · model grok-4.3
The pith
An optimized soft prompt in the text embedding space suppresses NSFW generation in text-to-image models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PromptGuard optimizes a universal safety soft prompt P* inside the T2I model's textual embedding space so that it functions as an implicit system prompt to moderate NSFW inputs and produce safe yet realistic images. A divide-and-conquer strategy further optimizes category-specific soft prompts and merges them into unified safety guidance. Across five datasets the method reduces unsafe outputs while preserving benign image quality, runs 3.8 times faster than prior moderation techniques, and outperforms eight existing defenses; evaluations with a multi-head safety classifier and a VLM-based guardrail report average unsafe ratios of 5.84 percent and 6.18 percent.
What carries the argument
The optimized safety soft prompt P* (or merged category-specific prompts) placed in the textual embedding space that directly moderates incoming NSFW prompts.
If this is right
- NSFW generation drops across tested datasets while inference speed remains unchanged and no proxy model is required.
- Benign prompt outputs retain high visual quality.
- The single merged prompt outperforms eight prior defense methods on the same evaluation sets.
- Robustness holds under two different safety classifiers reporting unsafe ratios near 6 percent.
Where Pith is reading between the lines
- The same embedding-space prompt idea could be tested on video or 3D generative models that share similar text-conditioning pathways.
- Deployments might combine this prompt guard with lightweight post-filters rather than replacing them entirely.
- Adversarial prompts crafted to target soft-prompt weaknesses would provide a direct stress test of the method's limits.
- Retraining the underlying diffusion model with the learned prompt as an additional conditioning signal might strengthen the effect further.
Load-bearing premise
An optimized soft prompt placed in the embedding space will reliably block all NSFW categories on unseen inputs and models without creating new failure modes or lowering the quality of ordinary images.
What would settle it
Measure the unsafe generation rate when PromptGuard is applied to a text-to-image model and prompt set never seen during optimization; if the unsafe ratio stays above 10 percent the central claim is weakened.
Figures
read the original abstract
Recent text-to-image (T2I) models have exhibited remarkable performance in generating high-quality images from text descriptions. However, these models are vulnerable to misuse, particularly generating not-safe-for-work (NSFW) content, such as sexually explicit, violent, political, and disturbing images, raising serious ethical concerns. In this work, we present PromptGuard, a novel content moderation technique that draws inspiration from the system prompt mechanism in large language models (LLMs) for safety alignment. Unlike LLMs, T2I models lack a direct interface for enforcing behavioral guidelines. Our key idea is to optimize a safety soft prompt that functions as an implicit system prompt within the T2I model's textual embedding space. This universal soft prompt (P*) directly moderates NSFW inputs, enabling safe yet realistic image generation without affecting inference efficiency or requiring proxy models. We further enhance its reliability and helpfulness through a divide-and-conquer strategy that optimizes category-specific soft prompts and combines them into unified safety guidance. Extensive experiments across five datasets demonstrate that PromptGuard effectively mitigates NSFW content generation while preserving high-quality benign outputs. PromptGuard is 3.8 times faster than prior content moderation methods while outperforming eight state-of-the-art defenses. Evaluations using both a multi-head safety classifier and a VLM-based guardrail further confirm its robustness, with average unsafe ratios of 5.84% and 6.18%, respectively. Our code and dataset are available at https://t2i-promptguard.github.io/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PromptGuard, a content moderation technique for text-to-image (T2I) models that optimizes a universal safety soft prompt (P*) in the textual embedding space to act as an implicit system prompt suppressing NSFW generation. It employs a divide-and-conquer strategy optimizing category-specific soft prompts and merging them for unified guidance. Experiments across five datasets claim that PromptGuard outperforms eight state-of-the-art defenses, is 3.8 times faster than prior methods, achieves average unsafe ratios of 5.84% (multi-head classifier) and 6.18% (VLM guardrail), and preserves high-quality benign outputs without affecting inference efficiency.
Significance. If the empirical claims hold with proper verification of generalization and ablations, PromptGuard would offer a practical, efficient, model-agnostic approach to T2I safety that avoids proxy models or inference overhead, adapting the system-prompt concept from LLMs to generative vision models. The divide-and-conquer merging strategy, if shown to preserve per-category effectiveness without new artifacts, could be a useful template for other safety interventions in embedding spaces.
major comments (3)
- [Abstract] Abstract (key idea paragraph): the central claim that an optimized soft prompt (or its merged version) functions as a universal implicit system prompt suppressing all NSFW categories on unseen inputs rests on an unspecified optimization objective and merging operator; without these, it is impossible to assess whether the procedure produces guidance that generalizes beyond the five datasets or merely overfits the training prompts.
- [Abstract] Abstract (experimental claims): the reported unsafe ratios and outperformance over eight baselines are presented without any description of the loss function, optimization procedure, dataset composition, statistical significance tests, or error bars; these omissions are load-bearing because the soundness of the performance comparison cannot be evaluated from the given information.
- [Abstract] Abstract (benign quality claim): the assertion that PromptGuard preserves high-quality benign outputs is stated without reference to any quantitative metrics (FID, CLIP score, or human evaluation) or ablation comparing merged vs. single-category prompts; this directly undermines the claim that the method avoids degrading benign generation.
Simulated Author's Rebuttal
We thank the referee for their thorough review and constructive feedback on our work. Below, we provide point-by-point responses to the major comments raised.
read point-by-point responses
-
Referee: [Abstract] Abstract (key idea paragraph): the central claim that an optimized soft prompt (or its merged version) functions as a universal implicit system prompt suppressing all NSFW categories on unseen inputs rests on an unspecified optimization objective and merging operator; without these, it is impossible to assess whether the procedure produces guidance that generalizes beyond the five datasets or merely overfits the training prompts.
Authors: The optimization objective and merging operator are fully specified in Section 3 of the manuscript. The abstract provides a high-level overview of the approach. We will revise the abstract to include a concise reference to the optimization objective and merging strategy to address this concern. revision: yes
-
Referee: [Abstract] Abstract (experimental claims): the reported unsafe ratios and outperformance over eight baselines are presented without any description of the loss function, optimization procedure, dataset composition, statistical significance tests, or error bars; these omissions are load-bearing because the soundness of the performance comparison cannot be evaluated from the given information.
Authors: The loss function, optimization procedure, dataset composition, statistical significance tests, and error bars are described in Sections 3 and 4. The abstract summarizes the main experimental outcomes. We will update the abstract to briefly note the evaluation details. revision: yes
-
Referee: [Abstract] Abstract (benign quality claim): the assertion that PromptGuard preserves high-quality benign outputs is stated without reference to any quantitative metrics (FID, CLIP score, or human evaluation) or ablation comparing merged vs. single-category prompts; this directly undermines the claim that the method avoids degrading benign generation.
Authors: Quantitative metrics such as FID and CLIP scores, along with ablations on merged versus single-category prompts, are provided in Section 4. The abstract's claim is backed by these results. We will add a reference to these metrics in the revised abstract. revision: yes
Circularity Check
No circularity: empirical optimization and evaluation on external data
full rationale
The paper proposes PromptGuard as an empirical method that optimizes soft prompts (and category-specific variants) in the embedding space of T2I models, then evaluates the resulting unsafe ratios on five datasets using independent classifiers and VLM guardrails. No derivation, uniqueness theorem, or prediction is claimed; performance numbers (3.8x speedup, 5.84% unsafe ratio) are measured outcomes rather than quantities forced by construction from fitted parameters or self-citations. The central premise is therefore externally falsifiable and does not reduce to its own inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- safety soft prompt P*
axioms (1)
- domain assumption T2I models possess a textual embedding space in which an optimized soft prompt can act as an implicit behavioral guideline equivalent to an LLM system prompt.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our key idea is to optimize a safety soft prompt that functions as an implicit system prompt within the T2I model’s textual embedding space.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
minimizing Lm encourages P∗ to guide the predicted noise to stay far from the original unsafe vision while becoming closer to the safe vision representations.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
SPOT: Selective Prompt Projection via Total Variation for Inference-Only Safe Text-to-Image Generation
SPOT projects prompts to a tau-safe set via total variation to cut inappropriate content 14-44% relative to baselines while preserving benign prompt behavior in frozen T2I models.
-
Dynamic Eraser for Guided Concept Erasure in Diffusion Models
DSS is a lightweight inference-time framework that erases concepts in diffusion models at 91% average rate while preserving image fidelity, outperforming prior methods.
Reference graph
Works this paper leans on
-
[1]
M. V . . L. G. LMU, “Stable Diffusion V1-4,” https://huggingface.co/ CompVis/stable-diffusion-v1-4
-
[2]
AI Porn Is Easy to Make Now. For Women, That’s a Nightmare
T. Hunter, “AI Porn Is Easy to Make Now. For Women, That’s a Nightmare.” https://www.washingtonpost.com/technology/2023/02/13/ ai-porn-deepfakes-women-consent. 8 TABLE VIII PERFORMANCE OFPR O M P TGU A R DUNDER ADVERSARIAL ATTACKS COMPARED WITH EIGHT BASELINES. Type None Model Algin. Content Moderation Adversarial Algorithm SDv1.4 SDv2.1 UCE SafeGen Safet...
work page 2023
-
[3]
Spotting the Deepfakes in This Year of Elections: How AI Detection Tools Work and Where They Fail,
R. V . L. Shirin Anlen, “Spotting the Deepfakes in This Year of Elections: How AI Detection Tools Work and Where They Fail,” https://reutersinstitute.politics.ox.ac.uk/news/ spotting-deepfakes-year-elections-how-ai-detection-tools-work-and-where-they-fail, 2024
work page 2024
-
[4]
Text-to-image AI Models Can Be Tricked Into Generating Disturbing Images,
R. Williams, “Text-to-image AI Models Can Be Tricked Into Generating Disturbing Images,” https: //www.technologyreview.com/2023/11/17/1083593/ text-to-image-ai-models-can-be-tricked-into-generating-disturbing-images, 2023
work page 2023
-
[5]
AI-created Child Sexual Abuse Images ‘Threaten to Overwhelm Internet’,
D. Milmo, “AI-created Child Sexual Abuse Images ‘Threaten to Overwhelm Internet’,” https://www.theguardian.com/technology/2023/oct/ 25/ai-created-child-sexual-abuse-images-threaten-overwhelm-internet
work page 2023
-
[6]
2024: The Election Year of Deepfakes, Doubts and Disinfor- mation?
A. Owen, “2024: The Election Year of Deepfakes, Doubts and Disinfor- mation?” https://onfido.com/blog/deepfakes-and-disinformation/
work page 2024
-
[7]
Erasing Concepts from Diffusion Models,
R. Gandikota, J. Materzynska, J. Fiotto-Kaufman, and D. Bau, “Erasing Concepts from Diffusion Models,” inIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023
work page 2023
-
[8]
Unified Concept Editing in Diffusion Models,
R. Gandikota, H. Orgad, Y . Belinkov, J. Materzynska, and D. Bau, “Unified Concept Editing in Diffusion Models,” inIEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2024, Waikoloa, HI, USA, January 3-8, 2024
work page 2024
-
[9]
SafeGen: Mitigating Sexually Explicit Content Generation in Text-to-Image Models,
X. Li, Y . Yang, J. Deng, C. Yan, Y . Chen, X. Ji, and W. Xu, “SafeGen: Mitigating Sexually Explicit Content Generation in Text-to-Image Models,” inProceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security (CCS), 2024
work page 2024
-
[10]
Direct Unlearning Optimization for Robust and Safe Text-to-image Models,
Y . Park, S. Yun, J. Kim, J. Kim, G. Jang, Y . Jeong, J. Jo, and G. Lee, “Direct Unlearning Optimization for Robust and Safe Text-to-image Models,”CoRR, vol. abs/2407.21035, 2024
-
[11]
S. AI, “Stable Diffusion V2-1,” https://huggingface.co/stabilityai/ stable-diffusion-2-1
-
[12]
Towards Safe Self-distillation of Internet-scale Text-to-image Diffusion Models,
S. Kim, S. Jung, B. Kim, M. Choi, J. Shin, and J. Lee, “Towards Safe Self-distillation of Internet-scale Text-to-image Diffusion Models,”CoRR, vol. abs/2307.05977, 2023
-
[13]
Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models,
Y . Zhang, X. Chen, J. Jia, Y . Zhang, C. Fan, J. Liu, M. Hong, K. Ding, and S. Liu, “Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models,”CoRR, vol. abs/2405.15234, 2024
-
[14]
NSFW Text Classifier on Hugging Face,
M. Li, “NSFW Text Classifier on Hugging Face,” https://huggingface. co/michellejieli/NSFW_text_classifier
-
[15]
M. V . . L. G. LMU, “Safety Checker,” https://huggingface.co/CompVis/ stable-diffusion-safety-checker
-
[16]
Universal Prompt Optimizer for Safe Text-to-image Generation,
Z. Wu, H. Gao, Y . Wang, X. Zhang, and S. Wang, “Universal Prompt Optimizer for Safe Text-to-image Generation,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL 2024, Mexico City, Mexico, June 16-21, 2024, K. Duh, H. Gómez-Adorno,...
work page 2024
-
[17]
OpenAI, “GPT Documentation,” https://platform.openai.com/docs/guides/ chat/introduction, 2022
work page 2022
-
[18]
DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models,
B. Wang, W. Chen, H. Pei, C. Xie, M. Kang, C. Zhang, C. Xu, Z. Xiong, R. Dutta, R. Schaeffer, S. T. Truong, S. Arora, M. Mazeika, D. Hendrycks, Z. Lin, Y . Cheng, S. Koyejo, D. Song, and B. Li, “DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models,” in Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA,...
work page 2023
-
[19]
SDEdit: Guided Image Synthesis and Editing with Stochastic Differ- ential Equations,
C. Meng, Y . He, Y . Song, J. Song, J. Wu, J. Zhu, and S. Ermon, “SDEdit: Guided Image Synthesis and Editing with Stochastic Differ- ential Equations,” inThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022
work page 2022
-
[20]
Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-image Models,
Y . Qu, X. Shen, X. He, M. Backes, S. Zannettou, and Y . Zhang, “Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-image Models,” inProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, CCS 2023, Copenhagen, Denmark, November 26-30, 2023, W. Meng, C. D. Jensen, C. Cremers, and E. Kirda, Eds
work page 2023
-
[21]
Towards Understanding Unsafe Video Generation,
Y . Pang, A. Xiong, Y . Zhang, and T. Wang, “Towards Understanding Unsafe Video Generation,”CoRR, vol. abs/2407.12581, 2024
-
[22]
Latent Guard: a Safety Framework for Text-to-image Generation,
R. Liu, A. Khakzar, J. Gu, Q. Chen, P. Torr, and F. Pizzati, “Latent Guard: a Safety Framework for Text-to-image Generation,”CoRR, vol. abs/2404.08031, 2024
-
[23]
Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models,
P. Schramowski, M. Brack, B. Deiseroth, and K. Kersting, “Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023
work page 2023
-
[24]
Denoising Diffusion Probabilistic Models,
J. Ho, A. Jain, and P. Abbeel, “Denoising Diffusion Probabilistic Models,” inAdvances in Neural Information Processing Systems (NeurIPS) December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds
work page 2020
-
[25]
High- resolution Image Synthesis with Latent Diffusion Models,
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution Image Synthesis with Latent Diffusion Models,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022
work page 2022
-
[26]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,
J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, J. Burstein, C. Doran, and T. Solorio, Eds., 2019
work page 2019
-
[27]
LAION-5B: an Open Large-scale Dataset for Training Next Generation Image-text Models,
C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wight- man, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, P. Schramowski, S. Kundurthy, K. Crowson, L. Schmidt, R. Kaczmarczyk, and J. Jitsev, “LAION-5B: an Open Large-scale Dataset for Training Next Generation Image-text Models,” inAdvances in Neural Information Processing Systems (NeurIPS), N...
work page 2022
-
[28]
Diffusion Lens: Interpreting Text Encoders in Text-to-image Pipelines,
M. Toker, H. Orgad, M. Ventura, D. Arad, and Y . Belinkov, “Diffusion Lens: Interpreting Text Encoders in Text-to-image Pipelines,” inProceed- ings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, L. Ku, A. Martins, and V . Srikumar, Eds
work page 2024
-
[29]
The Prompt Report: A Systematic Survey of Prompt Engineering Techniques
S. Schulhoff, M. Ilie, N. Balepur, K. Kahadze, A. Liu, C. Si, Y . Li, A. Gupta, H. Han, S. Schulhoff, P. S. Dulepet, S. Vidyadhara, D. Ki, S. Agrawal, C. Pham, G. Kroiz, F. Li, H. Tao, A. Srivastava, H. D. Costa, S. Gupta, M. L. Rogers, I. Goncearenco, G. Sarli, I. Galynker, D. Peskoff, M. Carpuat, J. White, S. Anadkat, A. Hoyle, and P. Resnik, “The promp...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Safety system messages in llm,
M. Azure, “Safety system messages in llm,” 2024, accessed: 2025-03-08. [Online]. Available: https://learn.microsoft.com/en-us/azure/ai-services/ openai/concepts/system-message?tabs=top-techniques 9
work page 2024
-
[31]
On Prompt-driven Safeguarding for Large Language Models,
C. Zheng, F. Yin, H. Zhou, F. Meng, J. Zhou, K. Chang, M. Huang, and N. Peng, “On Prompt-driven Safeguarding for Large Language Models,” inForty-first International Conference on Machine Learning (ICML), Vienna, Austria, July 21-27, 2024
work page 2024
-
[32]
The Power of Scale for Parameter-efficient Prompt Tuning,
B. Lester, R. Al-Rfou, and N. Constant, “The Power of Scale for Parameter-efficient Prompt Tuning,” inProceedings of the 2021 Confer- ence on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, M. Moens, X. Huang, L. Specia, and S. W. Yih, Eds
work page 2021
-
[33]
Prefix-Tuning: Optimizing Continuous Prompts for Generation,
X. L. Li and P. Liang, “Prefix-Tuning: Optimizing Continuous Prompts for Generation,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, C. Zong, F. Xia, W. Li, and R....
work page 2021
-
[34]
A. Kim, “NSFW Data Scraper,” https://github.com/alex000kim/nsfw_ data_scraper
-
[35]
GPT-4o Mini: Advancing Cost-efficient Intelligence,
OpenAI, “GPT-4o Mini: Advancing Cost-efficient Intelligence,” https: //openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/
- [36]
-
[37]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “GPT-4 Technical Report,”arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[38]
Microsoft COCO: Common Objects in Context
T.-Y . Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár, “Microsoft coco: Common objects in context,” 2015. [Online]. Available: https://arxiv.org/abs/1405.0312
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[39]
Inaproppriate Image Prompts (I2P),
A. I. M. L. L. at TU Darmstadt, “Inaproppriate Image Prompts (I2P),” https://huggingface.co/datasets/AIML-TUDA/i2p
-
[40]
SneakyPrompt: Jailbreaking Text-to-image Generative Models,
Y . Yang, B. Hui, H. Yuan, N. Gong, and Y . Cao, “SneakyPrompt: Jailbreaking Text-to-image Generative Models,” inIEEE Symposium on Security and Privacy, SP 2024, San Francisco, CA, USA, May 19-23, 2024
work page 2024
-
[41]
MMA-Diffusion: MultiModal Attack on Diffusion Models,
Y . Yang, R. Gao, X. Wang, T.-Y . Ho, N. Xu, and Q. Xu, “MMA-Diffusion: MultiModal Attack on Diffusion Models,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
-
[42]
Learning Transferable Visual Models From Natural Language Supervi- sion,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning Transferable Visual Models From Natural Language Supervi- sion,” inProceedings of the 38th International Conference on Machine Learning (ICML), 18-24 July 2021, Virtual Event, ser. Proceedings of Machine Lea...
work page 2021
-
[43]
The Unreasonable Effectiveness of Deep Features as a Perceptual Metric,
R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The Unreasonable Effectiveness of Deep Features as a Perceptual Metric,” in2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018
work page 2018
-
[44]
X. Li, Y . Yang, J. Deng, and et al., “SafeGen-Pretrained-Weights,” https: //huggingface.co/LetterJohn/SafeGen-Pretrained-Weights, 2024
work page 2024
-
[45]
Diffusers: State-of-the-art diffusion models,
P. von Platen, S. Patil, A. Lozhkov, P. Cuenca, N. Lambert, K. Rasul, M. Davaadorj, D. Nair, S. Paul, W. Berman, Y . Xu, S. Liu, and T. Wolf, “Diffusers: State-of-the-art diffusion models,” https://github. com/huggingface/diffusers, 2022
work page 2022
-
[46]
Safetydpo: Scalable safety alignment for text-to-image generation,
R. Liu, C. I. Chieh, J. Gu, J. Zhang, R. Pi, Q. Chen, P. Torr, A. Khakzar, and F. Pizzati, “Safetydpo: Scalable safety alignment for text-to-image generation,” 2024. [Online]. Available: https://arxiv.org/abs/2412.10493
-
[47]
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” 2023. [Online]. Available: https://arxiv.org/abs/1910.10683
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
Unified concept editing in diffusion models,
“Unified concept editing in diffusion models,” https://github.com/ rohitgandikota/unified-concept-editing
-
[49]
A. I. . M. L. L. at TU Darmstadt, “Safe Stable Diffusion,” https:// huggingface.co/AIML-TUDA/stable-diffusion-safe
-
[50]
Universal prompt optimizer for safe text-to-image generation,
“Universal prompt optimizer for safe text-to-image generation,” https: //github.com/Wu-Zongyu/POSI
-
[51]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rombach, “SDXL: Improving Latent Diffusion Models for High-resolution Image Synthesis,”arXiv, vol. abs/2307.01952, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
D. Lab, “DeepFloyd IF,” https://github.com/deep-floyd/IF. 1 APPENDIX A. Additional Experiment Setup
-
[53]
Test Benchmark:We create a comprehensive test bench- mark using three representative datasets, incorporating diverse prompts from four NSFW categories and benign content: • I2P: Inappropriate Image Prompts [39] consist of manually tailored NSFW text prompts on lexica.art, from which we select violent, political, and disturbing prompts, excluding sexually ...
work page 2017
-
[54]
safe” category or one of several “unsafe
Evaluation Metrics:The additional details of four metrics used for evaluation are as follows: •[ NSFW Removal ] Unsafe Ratio: The unsafe ratio is calculated using the multi-headed safety classifier (Multi- headed SC) introduced by [ 20]. For each generated image, the Multi-headed SC determines whether it falls into a “safe” category or one of several “uns...
work page 2017
-
[55]
Baselines:We compare PromptGuard with eight base- lines, each exemplifying the latest anti-NSFW countermeasures. According to our taxonomy, these baselines can be divided into three groups: (1)N/A: where the original SD serves as the control group without any protective measures. (2)Model Alignment: modifies the T2I model directly by fine-tuning or retrai...
-
[56]
Implementation Details:We implement PromptGuard using Python 3.9, PyTorch 2.4.0 and Diffusers 0.30.0.dev0 on an Ubuntu 20.04.6 server, with all experiments conducted on an NVIDIA RTX 6000 Ada Generation GPU. PromptGuard operates by modifying only the soft prompt embedding, which is appended to the original input prompt. In line with prior work [ 7], [ 9],...
-
[57]
Impact of λ Across NSFW Categories:Similar to the results and analysis in V-E1, increasing the value of λ encourages P∗ to lose its ability to generate unsafe images during latent denoising. Figure 6 illustrates the variations in images generated by the model with embeddings trained using different values ofλ
-
[58]
NSFW Content Moderation:Figure 7 illustrates PromptGuard’s effectiveness in moderating NSFW content generation across various unsafe categories while preserving its helpfulness
-
[59]
Benign Preservation:Figure 8 highlights PromptGuard’s ability to faithfully generate images from benign input prompts, outperforming other baselines
-
[60]
Cross-Category Generalization of Individual Soft Prompt Embedding:In this subsection, we explore the transferability of a single soft prompt embedding trained on one NSFW category and test its effectiveness on prompts from various unseen 𝜆=0.1𝜆=0.2𝜆=0.3 𝜆=0.5𝜆=0.6𝜆=0.7𝜆=0.4 Sexually ExplicitViolentPoliticalDisturbing * * * * Fig. 6. Variation in images ge...
-
[61]
Exploration on Number of Benign Categories.:Our initial six categories were selected based on the COCO dataset [38]. To further investigate the impact of benign prompt diversity, we introduce two additional categories: Technologies 3 Sexually ExplicitViolentPoliticalDisturbing Ours SDv1.4 SLD Strong SLD Max POSI SDv2.1 SafeGen UCE * * ** ** ***** ** Fig. ...
-
[62]
Transfer our framework on other T2I models:Stable Diffusion V1.5.The Stable-Diffusion-v1-5 checkpoint was initialized from Stable-Diffusion-v1-2 and fine-tuned for 595k steps at a resolution of 512x512 on the “laion-aesthetics v2 5+” dataset, with 10% dropout of text-conditioning to improve 4 AnimalsFoodHuman beingsLandscapesTransport Vehicles Ours SDv1.4...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.