PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models

Bo Li; Chejian Xu; Guanhong Tao; Lingzhi Yuan; Wei Dong; Xiaojun Jia; Xinfeng Li; Yang Liu; Yihao Huang

arxiv: 2501.03544 · v5 · submitted 2025-01-07 · 💻 cs.CV · cs.AI· cs.CR

PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models

Lingzhi Yuan , Xinfeng Li , Chejian Xu , Guanhong Tao , Xiaojun Jia , Yihao Huang , Wei Dong , Yang Liu

show 1 more author

Bo Li

This is my paper

Pith reviewed 2026-05-23 05:45 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CR

keywords text-to-image modelscontent moderationsoft promptsNSFW safetydiffusion modelssafety alignmentprompt optimizationunsafe content detection

0 comments

The pith

An optimized soft prompt in the text embedding space suppresses NSFW generation in text-to-image models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PromptGuard as a way to add safety controls to text-to-image models by optimizing a soft prompt that acts like a hidden system instruction inside the model's text embedding space. This prompt is designed to steer the model away from producing sexually explicit, violent, or otherwise unsafe images even when the user's input prompt tries to trigger them. A reader would care because these models are easy to misuse today and existing moderation approaches often slow down generation or require separate models. The method also splits the task by safety category, optimizes separate prompts, and merges them into one unified guard. If the approach holds, models could generate safe images by default without extra compute at inference time.

Core claim

PromptGuard optimizes a universal safety soft prompt P* inside the T2I model's textual embedding space so that it functions as an implicit system prompt to moderate NSFW inputs and produce safe yet realistic images. A divide-and-conquer strategy further optimizes category-specific soft prompts and merges them into unified safety guidance. Across five datasets the method reduces unsafe outputs while preserving benign image quality, runs 3.8 times faster than prior moderation techniques, and outperforms eight existing defenses; evaluations with a multi-head safety classifier and a VLM-based guardrail report average unsafe ratios of 5.84 percent and 6.18 percent.

What carries the argument

The optimized safety soft prompt P* (or merged category-specific prompts) placed in the textual embedding space that directly moderates incoming NSFW prompts.

If this is right

NSFW generation drops across tested datasets while inference speed remains unchanged and no proxy model is required.
Benign prompt outputs retain high visual quality.
The single merged prompt outperforms eight prior defense methods on the same evaluation sets.
Robustness holds under two different safety classifiers reporting unsafe ratios near 6 percent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same embedding-space prompt idea could be tested on video or 3D generative models that share similar text-conditioning pathways.
Deployments might combine this prompt guard with lightweight post-filters rather than replacing them entirely.
Adversarial prompts crafted to target soft-prompt weaknesses would provide a direct stress test of the method's limits.
Retraining the underlying diffusion model with the learned prompt as an additional conditioning signal might strengthen the effect further.

Load-bearing premise

An optimized soft prompt placed in the embedding space will reliably block all NSFW categories on unseen inputs and models without creating new failure modes or lowering the quality of ordinary images.

What would settle it

Measure the unsafe generation rate when PromptGuard is applied to a text-to-image model and prompt set never seen during optimization; if the unsafe ratio stays above 10 percent the central claim is weakened.

Figures

Figures reproduced from arXiv: 2501.03544 by Bo Li, Chejian Xu, Guanhong Tao, Lingzhi Yuan, Wei Dong, Xiaojun Jia, Xinfeng Li, Yang Liu, Yihao Huang.

**Figure 2.** Figure 2: Diagram of PromptGuard. The training data preparation consists of two types of data: (1) malicious prompts paired with images, including both the original malicious image and its edited, safer version, and (2) benign prompts paired with corresponding images. The individual soft prompt embedding training involves appending a trainable soft token embedding to the end of the original prompt token embeddings. … view at source ↗

**Figure 3.** Figure 3: SDEdit [19] could help to build fine-grained image pair for malicious data, which only modifies the unsafe vision region. our goal is to find a soft prompt that can guide safe visual generation, we utilize the T2I model being safeguarded to create images based on these collected prompts. As outlined in IV-A, we then construct safer versions of each malicious image by altering only the unsafe visual element… view at source ↗

**Figure 4.** Figure 4: PromptGuard successfully moderates the unsafe content across four categories. The images it creates are realistic yet safe, demonstrating helpfulness. B. NSFW Content Moderation We compare PromptGuard with eight baselines and report the Unsafe Ratio across four malicious test benchmarks, covering different unsafe categories. Table I shows that PromptGuard outperforms the baselines by achieving the lowest … view at source ↗

**Figure 5.** Figure 5: Adversarial robustness against three red-teaming settings: [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Variation in images generated by the same malicious prompt with [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Detailed comparison of NSFW moderation across different baselines. [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Detailed comparison of benign image preservation across different baselines. [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

read the original abstract

Recent text-to-image (T2I) models have exhibited remarkable performance in generating high-quality images from text descriptions. However, these models are vulnerable to misuse, particularly generating not-safe-for-work (NSFW) content, such as sexually explicit, violent, political, and disturbing images, raising serious ethical concerns. In this work, we present PromptGuard, a novel content moderation technique that draws inspiration from the system prompt mechanism in large language models (LLMs) for safety alignment. Unlike LLMs, T2I models lack a direct interface for enforcing behavioral guidelines. Our key idea is to optimize a safety soft prompt that functions as an implicit system prompt within the T2I model's textual embedding space. This universal soft prompt (P*) directly moderates NSFW inputs, enabling safe yet realistic image generation without affecting inference efficiency or requiring proxy models. We further enhance its reliability and helpfulness through a divide-and-conquer strategy that optimizes category-specific soft prompts and combines them into unified safety guidance. Extensive experiments across five datasets demonstrate that PromptGuard effectively mitigates NSFW content generation while preserving high-quality benign outputs. PromptGuard is 3.8 times faster than prior content moderation methods while outperforming eight state-of-the-art defenses. Evaluations using both a multi-head safety classifier and a VLM-based guardrail further confirm its robustness, with average unsafe ratios of 5.84% and 6.18%, respectively. Our code and dataset are available at https://t2i-promptguard.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The paper introduces PromptGuard, a content moderation technique for text-to-image (T2I) models that optimizes a universal safety soft prompt (P*) in the textual embedding space to act as an implicit system prompt suppressing NSFW generation. It employs a divide-and-conquer strategy optimizing category-specific soft prompts and merging them for unified guidance. Experiments across five datasets claim that PromptGuard outperforms eight state-of-the-art defenses, is 3.8 times faster than prior methods, achieves average unsafe ratios of 5.84% (multi-head classifier) and 6.18% (VLM guardrail), and preserves high-quality benign outputs without affecting inference efficiency.

Significance. If the empirical claims hold with proper verification of generalization and ablations, PromptGuard would offer a practical, efficient, model-agnostic approach to T2I safety that avoids proxy models or inference overhead, adapting the system-prompt concept from LLMs to generative vision models. The divide-and-conquer merging strategy, if shown to preserve per-category effectiveness without new artifacts, could be a useful template for other safety interventions in embedding spaces.

major comments (3)

[Abstract] Abstract (key idea paragraph): the central claim that an optimized soft prompt (or its merged version) functions as a universal implicit system prompt suppressing all NSFW categories on unseen inputs rests on an unspecified optimization objective and merging operator; without these, it is impossible to assess whether the procedure produces guidance that generalizes beyond the five datasets or merely overfits the training prompts.
[Abstract] Abstract (experimental claims): the reported unsafe ratios and outperformance over eight baselines are presented without any description of the loss function, optimization procedure, dataset composition, statistical significance tests, or error bars; these omissions are load-bearing because the soundness of the performance comparison cannot be evaluated from the given information.
[Abstract] Abstract (benign quality claim): the assertion that PromptGuard preserves high-quality benign outputs is stated without reference to any quantitative metrics (FID, CLIP score, or human evaluation) or ablation comparing merged vs. single-category prompts; this directly undermines the claim that the method avoids degrading benign generation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback on our work. Below, we provide point-by-point responses to the major comments raised.

read point-by-point responses

Referee: [Abstract] Abstract (key idea paragraph): the central claim that an optimized soft prompt (or its merged version) functions as a universal implicit system prompt suppressing all NSFW categories on unseen inputs rests on an unspecified optimization objective and merging operator; without these, it is impossible to assess whether the procedure produces guidance that generalizes beyond the five datasets or merely overfits the training prompts.

Authors: The optimization objective and merging operator are fully specified in Section 3 of the manuscript. The abstract provides a high-level overview of the approach. We will revise the abstract to include a concise reference to the optimization objective and merging strategy to address this concern. revision: yes
Referee: [Abstract] Abstract (experimental claims): the reported unsafe ratios and outperformance over eight baselines are presented without any description of the loss function, optimization procedure, dataset composition, statistical significance tests, or error bars; these omissions are load-bearing because the soundness of the performance comparison cannot be evaluated from the given information.

Authors: The loss function, optimization procedure, dataset composition, statistical significance tests, and error bars are described in Sections 3 and 4. The abstract summarizes the main experimental outcomes. We will update the abstract to briefly note the evaluation details. revision: yes
Referee: [Abstract] Abstract (benign quality claim): the assertion that PromptGuard preserves high-quality benign outputs is stated without reference to any quantitative metrics (FID, CLIP score, or human evaluation) or ablation comparing merged vs. single-category prompts; this directly undermines the claim that the method avoids degrading benign generation.

Authors: Quantitative metrics such as FID and CLIP scores, along with ablations on merged versus single-category prompts, are provided in Section 4. The abstract's claim is backed by these results. We will add a reference to these metrics in the revised abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical optimization and evaluation on external data

full rationale

The paper proposes PromptGuard as an empirical method that optimizes soft prompts (and category-specific variants) in the embedding space of T2I models, then evaluates the resulting unsafe ratios on five datasets using independent classifiers and VLM guardrails. No derivation, uniqueness theorem, or prediction is claimed; performance numbers (3.8x speedup, 5.84% unsafe ratio) are measured outcomes rather than quantities forced by construction from fitted parameters or self-citations. The central premise is therefore externally falsifiable and does not reduce to its own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption that soft prompts can function as implicit system prompts in T2I embedding spaces; the soft prompt itself is the fitted object but is the output of the proposed technique rather than an ad-hoc constant.

free parameters (1)

safety soft prompt P*
The universal and category-specific prompts are optimized on safety data; their values are learned rather than derived from first principles.

axioms (1)

domain assumption T2I models possess a textual embedding space in which an optimized soft prompt can act as an implicit behavioral guideline equivalent to an LLM system prompt.
This is the key idea stated in the abstract that enables the entire approach.

pith-pipeline@v0.9.0 · 5836 in / 1383 out tokens · 39567 ms · 2026-05-23T05:45:12.155361+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our key idea is to optimize a safety soft prompt that functions as an implicit system prompt within the T2I model’s textual embedding space.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

minimizing Lm encourages P∗ to guide the predicted noise to stay far from the original unsafe vision while becoming closer to the safe vision representations.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SPOT: Selective Prompt Projection via Total Variation for Inference-Only Safe Text-to-Image Generation
cs.AI 2026-01 unverdicted novelty 6.0

SPOT projects prompts to a tau-safe set via total variation to cut inappropriate content 14-44% relative to baselines while preserving benign prompt behavior in frozen T2I models.
Dynamic Eraser for Guided Concept Erasure in Diffusion Models
cs.CV 2026-04 unverdicted novelty 5.0

DSS is a lightweight inference-time framework that erases concepts in diffusion models at 91% average rate while preserving image fidelity, outperforming prior methods.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · cited by 2 Pith papers · 5 internal anchors

[1]

Stable Diffusion V1-4,

M. V . . L. G. LMU, “Stable Diffusion V1-4,” https://huggingface.co/ CompVis/stable-diffusion-v1-4

work page
[2]

AI Porn Is Easy to Make Now. For Women, That’s a Nightmare

T. Hunter, “AI Porn Is Easy to Make Now. For Women, That’s a Nightmare.” https://www.washingtonpost.com/technology/2023/02/13/ ai-porn-deepfakes-women-consent. 8 TABLE VIII PERFORMANCE OFPR O M P TGU A R DUNDER ADVERSARIAL ATTACKS COMPARED WITH EIGHT BASELINES. Type None Model Algin. Content Moderation Adversarial Algorithm SDv1.4 SDv2.1 UCE SafeGen Safet...

work page 2023
[3]

Spotting the Deepfakes in This Year of Elections: How AI Detection Tools Work and Where They Fail,

R. V . L. Shirin Anlen, “Spotting the Deepfakes in This Year of Elections: How AI Detection Tools Work and Where They Fail,” https://reutersinstitute.politics.ox.ac.uk/news/ spotting-deepfakes-year-elections-how-ai-detection-tools-work-and-where-they-fail, 2024

work page 2024
[4]

Text-to-image AI Models Can Be Tricked Into Generating Disturbing Images,

R. Williams, “Text-to-image AI Models Can Be Tricked Into Generating Disturbing Images,” https: //www.technologyreview.com/2023/11/17/1083593/ text-to-image-ai-models-can-be-tricked-into-generating-disturbing-images, 2023

work page 2023
[5]

AI-created Child Sexual Abuse Images ‘Threaten to Overwhelm Internet’,

D. Milmo, “AI-created Child Sexual Abuse Images ‘Threaten to Overwhelm Internet’,” https://www.theguardian.com/technology/2023/oct/ 25/ai-created-child-sexual-abuse-images-threaten-overwhelm-internet

work page 2023
[6]

2024: The Election Year of Deepfakes, Doubts and Disinfor- mation?

A. Owen, “2024: The Election Year of Deepfakes, Doubts and Disinfor- mation?” https://onfido.com/blog/deepfakes-and-disinformation/

work page 2024
[7]

Erasing Concepts from Diffusion Models,

R. Gandikota, J. Materzynska, J. Fiotto-Kaufman, and D. Bau, “Erasing Concepts from Diffusion Models,” inIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023

work page 2023
[8]

Unified Concept Editing in Diffusion Models,

R. Gandikota, H. Orgad, Y . Belinkov, J. Materzynska, and D. Bau, “Unified Concept Editing in Diffusion Models,” inIEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2024, Waikoloa, HI, USA, January 3-8, 2024

work page 2024
[9]

SafeGen: Mitigating Sexually Explicit Content Generation in Text-to-Image Models,

X. Li, Y . Yang, J. Deng, C. Yan, Y . Chen, X. Ji, and W. Xu, “SafeGen: Mitigating Sexually Explicit Content Generation in Text-to-Image Models,” inProceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security (CCS), 2024

work page 2024
[10]

Direct Unlearning Optimization for Robust and Safe Text-to-image Models,

Y . Park, S. Yun, J. Kim, J. Kim, G. Jang, Y . Jeong, J. Jo, and G. Lee, “Direct Unlearning Optimization for Robust and Safe Text-to-image Models,”CoRR, vol. abs/2407.21035, 2024

work page arXiv 2024
[11]

Stable Diffusion V2-1,

S. AI, “Stable Diffusion V2-1,” https://huggingface.co/stabilityai/ stable-diffusion-2-1

work page
[12]

Towards Safe Self-distillation of Internet-scale Text-to-image Diffusion Models,

S. Kim, S. Jung, B. Kim, M. Choi, J. Shin, and J. Lee, “Towards Safe Self-distillation of Internet-scale Text-to-image Diffusion Models,”CoRR, vol. abs/2307.05977, 2023

work page arXiv 2023
[13]

Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models,

Y . Zhang, X. Chen, J. Jia, Y . Zhang, C. Fan, J. Liu, M. Hong, K. Ding, and S. Liu, “Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models,”CoRR, vol. abs/2405.15234, 2024

work page arXiv 2024
[14]

NSFW Text Classifier on Hugging Face,

M. Li, “NSFW Text Classifier on Hugging Face,” https://huggingface. co/michellejieli/NSFW_text_classifier

work page
[15]

Safety Checker,

M. V . . L. G. LMU, “Safety Checker,” https://huggingface.co/CompVis/ stable-diffusion-safety-checker

work page
[16]

Universal Prompt Optimizer for Safe Text-to-image Generation,

Z. Wu, H. Gao, Y . Wang, X. Zhang, and S. Wang, “Universal Prompt Optimizer for Safe Text-to-image Generation,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL 2024, Mexico City, Mexico, June 16-21, 2024, K. Duh, H. Gómez-Adorno,...

work page 2024
[17]

GPT Documentation,

OpenAI, “GPT Documentation,” https://platform.openai.com/docs/guides/ chat/introduction, 2022

work page 2022
[18]

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models,

B. Wang, W. Chen, H. Pei, C. Xie, M. Kang, C. Zhang, C. Xu, Z. Xiong, R. Dutta, R. Schaeffer, S. T. Truong, S. Arora, M. Mazeika, D. Hendrycks, Z. Lin, Y . Cheng, S. Koyejo, D. Song, and B. Li, “DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models,” in Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA,...

work page 2023
[19]

SDEdit: Guided Image Synthesis and Editing with Stochastic Differ- ential Equations,

C. Meng, Y . He, Y . Song, J. Song, J. Wu, J. Zhu, and S. Ermon, “SDEdit: Guided Image Synthesis and Editing with Stochastic Differ- ential Equations,” inThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022

work page 2022
[20]

Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-image Models,

Y . Qu, X. Shen, X. He, M. Backes, S. Zannettou, and Y . Zhang, “Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-image Models,” inProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, CCS 2023, Copenhagen, Denmark, November 26-30, 2023, W. Meng, C. D. Jensen, C. Cremers, and E. Kirda, Eds

work page 2023
[21]

Towards Understanding Unsafe Video Generation,

Y . Pang, A. Xiong, Y . Zhang, and T. Wang, “Towards Understanding Unsafe Video Generation,”CoRR, vol. abs/2407.12581, 2024

work page arXiv 2024
[22]

Latent Guard: a Safety Framework for Text-to-image Generation,

R. Liu, A. Khakzar, J. Gu, Q. Chen, P. Torr, and F. Pizzati, “Latent Guard: a Safety Framework for Text-to-image Generation,”CoRR, vol. abs/2404.08031, 2024

work page arXiv 2024
[23]

Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models,

P. Schramowski, M. Brack, B. Deiseroth, and K. Kersting, “Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023

work page 2023
[24]

Denoising Diffusion Probabilistic Models,

J. Ho, A. Jain, and P. Abbeel, “Denoising Diffusion Probabilistic Models,” inAdvances in Neural Information Processing Systems (NeurIPS) December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds

work page 2020
[25]

High- resolution Image Synthesis with Latent Diffusion Models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution Image Synthesis with Latent Diffusion Models,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022

work page 2022
[26]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,

J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, J. Burstein, C. Doran, and T. Solorio, Eds., 2019

work page 2019
[27]

LAION-5B: an Open Large-scale Dataset for Training Next Generation Image-text Models,

C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wight- man, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, P. Schramowski, S. Kundurthy, K. Crowson, L. Schmidt, R. Kaczmarczyk, and J. Jitsev, “LAION-5B: an Open Large-scale Dataset for Training Next Generation Image-text Models,” inAdvances in Neural Information Processing Systems (NeurIPS), N...

work page 2022
[28]

Diffusion Lens: Interpreting Text Encoders in Text-to-image Pipelines,

M. Toker, H. Orgad, M. Ventura, D. Arad, and Y . Belinkov, “Diffusion Lens: Interpreting Text Encoders in Text-to-image Pipelines,” inProceed- ings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, L. Ku, A. Martins, and V . Srikumar, Eds

work page 2024
[29]

The Prompt Report: A Systematic Survey of Prompt Engineering Techniques

S. Schulhoff, M. Ilie, N. Balepur, K. Kahadze, A. Liu, C. Si, Y . Li, A. Gupta, H. Han, S. Schulhoff, P. S. Dulepet, S. Vidyadhara, D. Ki, S. Agrawal, C. Pham, G. Kroiz, F. Li, H. Tao, A. Srivastava, H. D. Costa, S. Gupta, M. L. Rogers, I. Goncearenco, G. Sarli, I. Galynker, D. Peskoff, M. Carpuat, J. White, S. Anadkat, A. Hoyle, and P. Resnik, “The promp...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Safety system messages in llm,

M. Azure, “Safety system messages in llm,” 2024, accessed: 2025-03-08. [Online]. Available: https://learn.microsoft.com/en-us/azure/ai-services/ openai/concepts/system-message?tabs=top-techniques 9

work page 2024
[31]

On Prompt-driven Safeguarding for Large Language Models,

C. Zheng, F. Yin, H. Zhou, F. Meng, J. Zhou, K. Chang, M. Huang, and N. Peng, “On Prompt-driven Safeguarding for Large Language Models,” inForty-first International Conference on Machine Learning (ICML), Vienna, Austria, July 21-27, 2024

work page 2024
[32]

The Power of Scale for Parameter-efficient Prompt Tuning,

B. Lester, R. Al-Rfou, and N. Constant, “The Power of Scale for Parameter-efficient Prompt Tuning,” inProceedings of the 2021 Confer- ence on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, M. Moens, X. Huang, L. Specia, and S. W. Yih, Eds

work page 2021
[33]

Prefix-Tuning: Optimizing Continuous Prompts for Generation,

X. L. Li and P. Liang, “Prefix-Tuning: Optimizing Continuous Prompts for Generation,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, C. Zong, F. Xia, W. Li, and R....

work page 2021
[34]

NSFW Data Scraper,

A. Kim, “NSFW Data Scraper,” https://github.com/alex000kim/nsfw_ data_scraper

work page
[35]

GPT-4o Mini: Advancing Cost-efficient Intelligence,

OpenAI, “GPT-4o Mini: Advancing Cost-efficient Intelligence,” https: //openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/

work page
[36]

Scholar gpt,

“Scholar gpt,” https://chatgpt.com/g/g-kZ0eYXlJe-scholar-gpt

work page
[37]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “GPT-4 Technical Report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Microsoft COCO: Common Objects in Context

T.-Y . Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár, “Microsoft coco: Common objects in context,” 2015. [Online]. Available: https://arxiv.org/abs/1405.0312

work page internal anchor Pith review Pith/arXiv arXiv 2015
[39]

Inaproppriate Image Prompts (I2P),

A. I. M. L. L. at TU Darmstadt, “Inaproppriate Image Prompts (I2P),” https://huggingface.co/datasets/AIML-TUDA/i2p

work page
[40]

SneakyPrompt: Jailbreaking Text-to-image Generative Models,

Y . Yang, B. Hui, H. Yuan, N. Gong, and Y . Cao, “SneakyPrompt: Jailbreaking Text-to-image Generative Models,” inIEEE Symposium on Security and Privacy, SP 2024, San Francisco, CA, USA, May 19-23, 2024

work page 2024
[41]

MMA-Diffusion: MultiModal Attack on Diffusion Models,

Y . Yang, R. Gao, X. Wang, T.-Y . Ho, N. Xu, and Q. Xu, “MMA-Diffusion: MultiModal Attack on Diffusion Models,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[42]

Learning Transferable Visual Models From Natural Language Supervi- sion,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning Transferable Visual Models From Natural Language Supervi- sion,” inProceedings of the 38th International Conference on Machine Learning (ICML), 18-24 July 2021, Virtual Event, ser. Proceedings of Machine Lea...

work page 2021
[43]

The Unreasonable Effectiveness of Deep Features as a Perceptual Metric,

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The Unreasonable Effectiveness of Deep Features as a Perceptual Metric,” in2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018

work page 2018
[44]

SafeGen-Pretrained-Weights,

X. Li, Y . Yang, J. Deng, and et al., “SafeGen-Pretrained-Weights,” https: //huggingface.co/LetterJohn/SafeGen-Pretrained-Weights, 2024

work page 2024
[45]

Diffusers: State-of-the-art diffusion models,

P. von Platen, S. Patil, A. Lozhkov, P. Cuenca, N. Lambert, K. Rasul, M. Davaadorj, D. Nair, S. Paul, W. Berman, Y . Xu, S. Liu, and T. Wolf, “Diffusers: State-of-the-art diffusion models,” https://github. com/huggingface/diffusers, 2022

work page 2022
[46]

Safetydpo: Scalable safety alignment for text-to-image generation,

R. Liu, C. I. Chieh, J. Gu, J. Zhang, R. Pi, Q. Chen, P. Torr, A. Khakzar, and F. Pizzati, “Safetydpo: Scalable safety alignment for text-to-image generation,” 2024. [Online]. Available: https://arxiv.org/abs/2412.10493

work page arXiv 2024
[47]

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” 2023. [Online]. Available: https://arxiv.org/abs/1910.10683

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Unified concept editing in diffusion models,

“Unified concept editing in diffusion models,” https://github.com/ rohitgandikota/unified-concept-editing

work page
[49]

Safe Stable Diffusion,

A. I. . M. L. L. at TU Darmstadt, “Safe Stable Diffusion,” https:// huggingface.co/AIML-TUDA/stable-diffusion-safe

work page
[50]

Universal prompt optimizer for safe text-to-image generation,

“Universal prompt optimizer for safe text-to-image generation,” https: //github.com/Wu-Zongyu/POSI

work page
[51]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rombach, “SDXL: Improving Latent Diffusion Models for High-resolution Image Synthesis,”arXiv, vol. abs/2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

DeepFloyd IF,

D. Lab, “DeepFloyd IF,” https://github.com/deep-floyd/IF. 1 APPENDIX A. Additional Experiment Setup

work page
[53]

• NSFW-200: To compensate for the shortcomings of I2P dataset in pornographic data, we use the NSFW dataset from [40] for the sexual category

Test Benchmark:We create a comprehensive test bench- mark using three representative datasets, incorporating diverse prompts from four NSFW categories and benign content: • I2P: Inappropriate Image Prompts [39] consist of manually tailored NSFW text prompts on lexica.art, from which we select violent, political, and disturbing prompts, excluding sexually ...

work page 2017
[54]

safe” category or one of several “unsafe

Evaluation Metrics:The additional details of four metrics used for evaluation are as follows: •[ NSFW Removal ] Unsafe Ratio: The unsafe ratio is calculated using the multi-headed safety classifier (Multi- headed SC) introduced by [ 20]. For each generated image, the Multi-headed SC determines whether it falls into a “safe” category or one of several “uns...

work page 2017
[55]

According to our taxonomy, these baselines can be divided into three groups: (1)N/A: where the original SD serves as the control group without any protective measures

Baselines:We compare PromptGuard with eight base- lines, each exemplifying the latest anti-NSFW countermeasures. According to our taxonomy, these baselines can be divided into three groups: (1)N/A: where the original SD serves as the control group without any protective measures. (2)Model Alignment: modifies the T2I model directly by fine-tuning or retrai...

work page
[56]

laion-aesthetics v2 5+

Implementation Details:We implement PromptGuard using Python 3.9, PyTorch 2.4.0 and Diffusers 0.30.0.dev0 on an Ubuntu 20.04.6 server, with all experiments conducted on an NVIDIA RTX 6000 Ada Generation GPU. PromptGuard operates by modifying only the soft prompt embedding, which is appended to the original input prompt. In line with prior work [ 7], [ 9],...

work page
[57]

Figure 6 illustrates the variations in images generated by the model with embeddings trained using different values ofλ

Impact of λ Across NSFW Categories:Similar to the results and analysis in V-E1, increasing the value of λ encourages P∗ to lose its ability to generate unsafe images during latent denoising. Figure 6 illustrates the variations in images generated by the model with embeddings trained using different values ofλ

work page
[58]

NSFW Content Moderation:Figure 7 illustrates PromptGuard’s effectiveness in moderating NSFW content generation across various unsafe categories while preserving its helpfulness

work page
[59]

Benign Preservation:Figure 8 highlights PromptGuard’s ability to faithfully generate images from benign input prompts, outperforming other baselines

work page
[60]

Cross-Category Generalization of Individual Soft Prompt Embedding:In this subsection, we explore the transferability of a single soft prompt embedding trained on one NSFW category and test its effectiveness on prompts from various unseen 𝜆=0.1𝜆=0.2𝜆=0.3 𝜆=0.5𝜆=0.6𝜆=0.7𝜆=0.4 Sexually ExplicitViolentPoliticalDisturbing * * * * Fig. 6. Variation in images ge...

work page
[61]

Exploration on Number of Benign Categories.:Our initial six categories were selected based on the COCO dataset [38]. To further investigate the impact of benign prompt diversity, we introduce two additional categories: Technologies 3 Sexually ExplicitViolentPoliticalDisturbing Ours SDv1.4 SLD Strong SLD Max POSI SDv2.1 SafeGen UCE * * ** ** ***** ** Fig. ...

work page
[62]

laion-aesthetics v2 5+

Transfer our framework on other T2I models:Stable Diffusion V1.5.The Stable-Diffusion-v1-5 checkpoint was initialized from Stable-Diffusion-v1-2 and fine-tuned for 595k steps at a resolution of 512x512 on the “laion-aesthetics v2 5+” dataset, with 10% dropout of text-conditioning to improve 4 AnimalsFoodHuman beingsLandscapesTransport Vehicles Ours SDv1.4...

work page

[1] [1]

Stable Diffusion V1-4,

M. V . . L. G. LMU, “Stable Diffusion V1-4,” https://huggingface.co/ CompVis/stable-diffusion-v1-4

work page

[2] [2]

AI Porn Is Easy to Make Now. For Women, That’s a Nightmare

T. Hunter, “AI Porn Is Easy to Make Now. For Women, That’s a Nightmare.” https://www.washingtonpost.com/technology/2023/02/13/ ai-porn-deepfakes-women-consent. 8 TABLE VIII PERFORMANCE OFPR O M P TGU A R DUNDER ADVERSARIAL ATTACKS COMPARED WITH EIGHT BASELINES. Type None Model Algin. Content Moderation Adversarial Algorithm SDv1.4 SDv2.1 UCE SafeGen Safet...

work page 2023

[3] [3]

Spotting the Deepfakes in This Year of Elections: How AI Detection Tools Work and Where They Fail,

R. V . L. Shirin Anlen, “Spotting the Deepfakes in This Year of Elections: How AI Detection Tools Work and Where They Fail,” https://reutersinstitute.politics.ox.ac.uk/news/ spotting-deepfakes-year-elections-how-ai-detection-tools-work-and-where-they-fail, 2024

work page 2024

[4] [4]

Text-to-image AI Models Can Be Tricked Into Generating Disturbing Images,

R. Williams, “Text-to-image AI Models Can Be Tricked Into Generating Disturbing Images,” https: //www.technologyreview.com/2023/11/17/1083593/ text-to-image-ai-models-can-be-tricked-into-generating-disturbing-images, 2023

work page 2023

[5] [5]

AI-created Child Sexual Abuse Images ‘Threaten to Overwhelm Internet’,

D. Milmo, “AI-created Child Sexual Abuse Images ‘Threaten to Overwhelm Internet’,” https://www.theguardian.com/technology/2023/oct/ 25/ai-created-child-sexual-abuse-images-threaten-overwhelm-internet

work page 2023

[6] [6]

2024: The Election Year of Deepfakes, Doubts and Disinfor- mation?

A. Owen, “2024: The Election Year of Deepfakes, Doubts and Disinfor- mation?” https://onfido.com/blog/deepfakes-and-disinformation/

work page 2024

[7] [7]

Erasing Concepts from Diffusion Models,

R. Gandikota, J. Materzynska, J. Fiotto-Kaufman, and D. Bau, “Erasing Concepts from Diffusion Models,” inIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023

work page 2023

[8] [8]

Unified Concept Editing in Diffusion Models,

R. Gandikota, H. Orgad, Y . Belinkov, J. Materzynska, and D. Bau, “Unified Concept Editing in Diffusion Models,” inIEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2024, Waikoloa, HI, USA, January 3-8, 2024

work page 2024

[9] [9]

SafeGen: Mitigating Sexually Explicit Content Generation in Text-to-Image Models,

X. Li, Y . Yang, J. Deng, C. Yan, Y . Chen, X. Ji, and W. Xu, “SafeGen: Mitigating Sexually Explicit Content Generation in Text-to-Image Models,” inProceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security (CCS), 2024

work page 2024

[10] [10]

Direct Unlearning Optimization for Robust and Safe Text-to-image Models,

Y . Park, S. Yun, J. Kim, J. Kim, G. Jang, Y . Jeong, J. Jo, and G. Lee, “Direct Unlearning Optimization for Robust and Safe Text-to-image Models,”CoRR, vol. abs/2407.21035, 2024

work page arXiv 2024

[11] [11]

Stable Diffusion V2-1,

S. AI, “Stable Diffusion V2-1,” https://huggingface.co/stabilityai/ stable-diffusion-2-1

work page

[12] [12]

Towards Safe Self-distillation of Internet-scale Text-to-image Diffusion Models,

S. Kim, S. Jung, B. Kim, M. Choi, J. Shin, and J. Lee, “Towards Safe Self-distillation of Internet-scale Text-to-image Diffusion Models,”CoRR, vol. abs/2307.05977, 2023

work page arXiv 2023

[13] [13]

Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models,

Y . Zhang, X. Chen, J. Jia, Y . Zhang, C. Fan, J. Liu, M. Hong, K. Ding, and S. Liu, “Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models,”CoRR, vol. abs/2405.15234, 2024

work page arXiv 2024

[14] [14]

NSFW Text Classifier on Hugging Face,

M. Li, “NSFW Text Classifier on Hugging Face,” https://huggingface. co/michellejieli/NSFW_text_classifier

work page

[15] [15]

Safety Checker,

M. V . . L. G. LMU, “Safety Checker,” https://huggingface.co/CompVis/ stable-diffusion-safety-checker

work page

[16] [16]

Universal Prompt Optimizer for Safe Text-to-image Generation,

Z. Wu, H. Gao, Y . Wang, X. Zhang, and S. Wang, “Universal Prompt Optimizer for Safe Text-to-image Generation,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL 2024, Mexico City, Mexico, June 16-21, 2024, K. Duh, H. Gómez-Adorno,...

work page 2024

[17] [17]

GPT Documentation,

OpenAI, “GPT Documentation,” https://platform.openai.com/docs/guides/ chat/introduction, 2022

work page 2022

[18] [18]

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models,

B. Wang, W. Chen, H. Pei, C. Xie, M. Kang, C. Zhang, C. Xu, Z. Xiong, R. Dutta, R. Schaeffer, S. T. Truong, S. Arora, M. Mazeika, D. Hendrycks, Z. Lin, Y . Cheng, S. Koyejo, D. Song, and B. Li, “DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models,” in Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA,...

work page 2023

[19] [19]

SDEdit: Guided Image Synthesis and Editing with Stochastic Differ- ential Equations,

C. Meng, Y . He, Y . Song, J. Song, J. Wu, J. Zhu, and S. Ermon, “SDEdit: Guided Image Synthesis and Editing with Stochastic Differ- ential Equations,” inThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022

work page 2022

[20] [20]

Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-image Models,

Y . Qu, X. Shen, X. He, M. Backes, S. Zannettou, and Y . Zhang, “Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-image Models,” inProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, CCS 2023, Copenhagen, Denmark, November 26-30, 2023, W. Meng, C. D. Jensen, C. Cremers, and E. Kirda, Eds

work page 2023

[21] [21]

Towards Understanding Unsafe Video Generation,

Y . Pang, A. Xiong, Y . Zhang, and T. Wang, “Towards Understanding Unsafe Video Generation,”CoRR, vol. abs/2407.12581, 2024

work page arXiv 2024

[22] [22]

Latent Guard: a Safety Framework for Text-to-image Generation,

R. Liu, A. Khakzar, J. Gu, Q. Chen, P. Torr, and F. Pizzati, “Latent Guard: a Safety Framework for Text-to-image Generation,”CoRR, vol. abs/2404.08031, 2024

work page arXiv 2024

[23] [23]

Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models,

P. Schramowski, M. Brack, B. Deiseroth, and K. Kersting, “Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023

work page 2023

[24] [24]

Denoising Diffusion Probabilistic Models,

J. Ho, A. Jain, and P. Abbeel, “Denoising Diffusion Probabilistic Models,” inAdvances in Neural Information Processing Systems (NeurIPS) December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds

work page 2020

[25] [25]

High- resolution Image Synthesis with Latent Diffusion Models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution Image Synthesis with Latent Diffusion Models,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022

work page 2022

[26] [26]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,

J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, J. Burstein, C. Doran, and T. Solorio, Eds., 2019

work page 2019

[27] [27]

LAION-5B: an Open Large-scale Dataset for Training Next Generation Image-text Models,

C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wight- man, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, P. Schramowski, S. Kundurthy, K. Crowson, L. Schmidt, R. Kaczmarczyk, and J. Jitsev, “LAION-5B: an Open Large-scale Dataset for Training Next Generation Image-text Models,” inAdvances in Neural Information Processing Systems (NeurIPS), N...

work page 2022

[28] [28]

Diffusion Lens: Interpreting Text Encoders in Text-to-image Pipelines,

M. Toker, H. Orgad, M. Ventura, D. Arad, and Y . Belinkov, “Diffusion Lens: Interpreting Text Encoders in Text-to-image Pipelines,” inProceed- ings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, L. Ku, A. Martins, and V . Srikumar, Eds

work page 2024

[29] [29]

The Prompt Report: A Systematic Survey of Prompt Engineering Techniques

S. Schulhoff, M. Ilie, N. Balepur, K. Kahadze, A. Liu, C. Si, Y . Li, A. Gupta, H. Han, S. Schulhoff, P. S. Dulepet, S. Vidyadhara, D. Ki, S. Agrawal, C. Pham, G. Kroiz, F. Li, H. Tao, A. Srivastava, H. D. Costa, S. Gupta, M. L. Rogers, I. Goncearenco, G. Sarli, I. Galynker, D. Peskoff, M. Carpuat, J. White, S. Anadkat, A. Hoyle, and P. Resnik, “The promp...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Safety system messages in llm,

M. Azure, “Safety system messages in llm,” 2024, accessed: 2025-03-08. [Online]. Available: https://learn.microsoft.com/en-us/azure/ai-services/ openai/concepts/system-message?tabs=top-techniques 9

work page 2024

[31] [31]

On Prompt-driven Safeguarding for Large Language Models,

C. Zheng, F. Yin, H. Zhou, F. Meng, J. Zhou, K. Chang, M. Huang, and N. Peng, “On Prompt-driven Safeguarding for Large Language Models,” inForty-first International Conference on Machine Learning (ICML), Vienna, Austria, July 21-27, 2024

work page 2024

[32] [32]

The Power of Scale for Parameter-efficient Prompt Tuning,

B. Lester, R. Al-Rfou, and N. Constant, “The Power of Scale for Parameter-efficient Prompt Tuning,” inProceedings of the 2021 Confer- ence on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, M. Moens, X. Huang, L. Specia, and S. W. Yih, Eds

work page 2021

[33] [33]

Prefix-Tuning: Optimizing Continuous Prompts for Generation,

X. L. Li and P. Liang, “Prefix-Tuning: Optimizing Continuous Prompts for Generation,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, C. Zong, F. Xia, W. Li, and R....

work page 2021

[34] [34]

NSFW Data Scraper,

A. Kim, “NSFW Data Scraper,” https://github.com/alex000kim/nsfw_ data_scraper

work page

[35] [35]

GPT-4o Mini: Advancing Cost-efficient Intelligence,

OpenAI, “GPT-4o Mini: Advancing Cost-efficient Intelligence,” https: //openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/

work page

[36] [36]

Scholar gpt,

“Scholar gpt,” https://chatgpt.com/g/g-kZ0eYXlJe-scholar-gpt

work page

[37] [37]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “GPT-4 Technical Report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [38]

Microsoft COCO: Common Objects in Context

T.-Y . Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár, “Microsoft coco: Common objects in context,” 2015. [Online]. Available: https://arxiv.org/abs/1405.0312

work page internal anchor Pith review Pith/arXiv arXiv 2015

[39] [39]

Inaproppriate Image Prompts (I2P),

A. I. M. L. L. at TU Darmstadt, “Inaproppriate Image Prompts (I2P),” https://huggingface.co/datasets/AIML-TUDA/i2p

work page

[40] [40]

SneakyPrompt: Jailbreaking Text-to-image Generative Models,

Y . Yang, B. Hui, H. Yuan, N. Gong, and Y . Cao, “SneakyPrompt: Jailbreaking Text-to-image Generative Models,” inIEEE Symposium on Security and Privacy, SP 2024, San Francisco, CA, USA, May 19-23, 2024

work page 2024

[41] [41]

MMA-Diffusion: MultiModal Attack on Diffusion Models,

Y . Yang, R. Gao, X. Wang, T.-Y . Ho, N. Xu, and Q. Xu, “MMA-Diffusion: MultiModal Attack on Diffusion Models,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[42] [42]

Learning Transferable Visual Models From Natural Language Supervi- sion,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning Transferable Visual Models From Natural Language Supervi- sion,” inProceedings of the 38th International Conference on Machine Learning (ICML), 18-24 July 2021, Virtual Event, ser. Proceedings of Machine Lea...

work page 2021

[43] [43]

The Unreasonable Effectiveness of Deep Features as a Perceptual Metric,

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The Unreasonable Effectiveness of Deep Features as a Perceptual Metric,” in2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018

work page 2018

[44] [44]

SafeGen-Pretrained-Weights,

X. Li, Y . Yang, J. Deng, and et al., “SafeGen-Pretrained-Weights,” https: //huggingface.co/LetterJohn/SafeGen-Pretrained-Weights, 2024

work page 2024

[45] [45]

Diffusers: State-of-the-art diffusion models,

P. von Platen, S. Patil, A. Lozhkov, P. Cuenca, N. Lambert, K. Rasul, M. Davaadorj, D. Nair, S. Paul, W. Berman, Y . Xu, S. Liu, and T. Wolf, “Diffusers: State-of-the-art diffusion models,” https://github. com/huggingface/diffusers, 2022

work page 2022

[46] [46]

Safetydpo: Scalable safety alignment for text-to-image generation,

R. Liu, C. I. Chieh, J. Gu, J. Zhang, R. Pi, Q. Chen, P. Torr, A. Khakzar, and F. Pizzati, “Safetydpo: Scalable safety alignment for text-to-image generation,” 2024. [Online]. Available: https://arxiv.org/abs/2412.10493

work page arXiv 2024

[47] [47]

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” 2023. [Online]. Available: https://arxiv.org/abs/1910.10683

work page internal anchor Pith review Pith/arXiv arXiv 2023

[48] [48]

Unified concept editing in diffusion models,

“Unified concept editing in diffusion models,” https://github.com/ rohitgandikota/unified-concept-editing

work page

[49] [49]

Safe Stable Diffusion,

A. I. . M. L. L. at TU Darmstadt, “Safe Stable Diffusion,” https:// huggingface.co/AIML-TUDA/stable-diffusion-safe

work page

[50] [50]

Universal prompt optimizer for safe text-to-image generation,

“Universal prompt optimizer for safe text-to-image generation,” https: //github.com/Wu-Zongyu/POSI

work page

[51] [51]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rombach, “SDXL: Improving Latent Diffusion Models for High-resolution Image Synthesis,”arXiv, vol. abs/2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[52] [52]

DeepFloyd IF,

D. Lab, “DeepFloyd IF,” https://github.com/deep-floyd/IF. 1 APPENDIX A. Additional Experiment Setup

work page

[53] [53]

• NSFW-200: To compensate for the shortcomings of I2P dataset in pornographic data, we use the NSFW dataset from [40] for the sexual category

Test Benchmark:We create a comprehensive test bench- mark using three representative datasets, incorporating diverse prompts from four NSFW categories and benign content: • I2P: Inappropriate Image Prompts [39] consist of manually tailored NSFW text prompts on lexica.art, from which we select violent, political, and disturbing prompts, excluding sexually ...

work page 2017

[54] [54]

safe” category or one of several “unsafe

Evaluation Metrics:The additional details of four metrics used for evaluation are as follows: •[ NSFW Removal ] Unsafe Ratio: The unsafe ratio is calculated using the multi-headed safety classifier (Multi- headed SC) introduced by [ 20]. For each generated image, the Multi-headed SC determines whether it falls into a “safe” category or one of several “uns...

work page 2017

[55] [55]

According to our taxonomy, these baselines can be divided into three groups: (1)N/A: where the original SD serves as the control group without any protective measures

Baselines:We compare PromptGuard with eight base- lines, each exemplifying the latest anti-NSFW countermeasures. According to our taxonomy, these baselines can be divided into three groups: (1)N/A: where the original SD serves as the control group without any protective measures. (2)Model Alignment: modifies the T2I model directly by fine-tuning or retrai...

work page

[56] [56]

laion-aesthetics v2 5+

Implementation Details:We implement PromptGuard using Python 3.9, PyTorch 2.4.0 and Diffusers 0.30.0.dev0 on an Ubuntu 20.04.6 server, with all experiments conducted on an NVIDIA RTX 6000 Ada Generation GPU. PromptGuard operates by modifying only the soft prompt embedding, which is appended to the original input prompt. In line with prior work [ 7], [ 9],...

work page

[57] [57]

Figure 6 illustrates the variations in images generated by the model with embeddings trained using different values ofλ

Impact of λ Across NSFW Categories:Similar to the results and analysis in V-E1, increasing the value of λ encourages P∗ to lose its ability to generate unsafe images during latent denoising. Figure 6 illustrates the variations in images generated by the model with embeddings trained using different values ofλ

work page

[58] [58]

NSFW Content Moderation:Figure 7 illustrates PromptGuard’s effectiveness in moderating NSFW content generation across various unsafe categories while preserving its helpfulness

work page

[59] [59]

Benign Preservation:Figure 8 highlights PromptGuard’s ability to faithfully generate images from benign input prompts, outperforming other baselines

work page

[60] [60]

Cross-Category Generalization of Individual Soft Prompt Embedding:In this subsection, we explore the transferability of a single soft prompt embedding trained on one NSFW category and test its effectiveness on prompts from various unseen 𝜆=0.1𝜆=0.2𝜆=0.3 𝜆=0.5𝜆=0.6𝜆=0.7𝜆=0.4 Sexually ExplicitViolentPoliticalDisturbing * * * * Fig. 6. Variation in images ge...

work page

[61] [61]

Exploration on Number of Benign Categories.:Our initial six categories were selected based on the COCO dataset [38]. To further investigate the impact of benign prompt diversity, we introduce two additional categories: Technologies 3 Sexually ExplicitViolentPoliticalDisturbing Ours SDv1.4 SLD Strong SLD Max POSI SDv2.1 SafeGen UCE * * ** ** ***** ** Fig. ...

work page

[62] [62]

laion-aesthetics v2 5+

Transfer our framework on other T2I models:Stable Diffusion V1.5.The Stable-Diffusion-v1-5 checkpoint was initialized from Stable-Diffusion-v1-2 and fine-tuned for 595k steps at a resolution of 512x512 on the “laion-aesthetics v2 5+” dataset, with 10% dropout of text-conditioning to improve 4 AnimalsFoodHuman beingsLandscapesTransport Vehicles Ours SDv1.4...

work page