Promptguard: Soft prompt-guided unsafe content moderation for text-to- image models.CoRR, abs/2501.03544

Yuan, L · 2025 · cs.CV · arXiv 2501.03544

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Recent text-to-image (T2I) models have exhibited remarkable performance in generating high-quality images from text descriptions. However, these models are vulnerable to misuse, particularly generating not-safe-for-work (NSFW) content, such as sexually explicit, violent, political, and disturbing images, raising serious ethical concerns. In this work, we present PromptGuard, a novel content moderation technique that draws inspiration from the system prompt mechanism in large language models (LLMs) for safety alignment. Unlike LLMs, T2I models lack a direct interface for enforcing behavioral guidelines. Our key idea is to optimize a safety soft prompt that functions as an implicit system prompt within the T2I model's textual embedding space. This universal soft prompt (P*) directly moderates NSFW inputs, enabling safe yet realistic image generation without affecting inference efficiency or requiring proxy models. We further enhance its reliability and helpfulness through a divide-and-conquer strategy that optimizes category-specific soft prompts and combines them into unified safety guidance. Extensive experiments across five datasets demonstrate that PromptGuard effectively mitigates NSFW content generation while preserving high-quality benign outputs. PromptGuard is 3.8 times faster than prior content moderation methods while outperforming eight state-of-the-art defenses. Evaluations using both a multi-head safety classifier and a VLM-based guardrail further confirm its robustness, with average unsafe ratios of 5.84% and 6.18%, respectively. Our code and dataset are available at https://t2i-promptguard.github.io/.

representative citing papers

SPOT: Selective Prompt Projection via Total Variation for Inference-Only Safe Text-to-Image Generation

cs.AI · 2026-01-31 · unverdicted · novelty 6.0

SPOT projects prompts to a tau-safe set via total variation to cut inappropriate content 14-44% relative to baselines while preserving benign prompt behavior in frozen T2I models.

Dynamic Eraser for Guided Concept Erasure in Diffusion Models

cs.CV · 2026-04-13 · unverdicted · novelty 5.0

DSS is a lightweight inference-time framework that erases concepts in diffusion models at 91% average rate while preserving image fidelity, outperforming prior methods.

citing papers explorer

Showing 2 of 2 citing papers.

SPOT: Selective Prompt Projection via Total Variation for Inference-Only Safe Text-to-Image Generation cs.AI · 2026-01-31 · unverdicted · none · ref 14 · internal anchor
SPOT projects prompts to a tau-safe set via total variation to cut inappropriate content 14-44% relative to baselines while preserving benign prompt behavior in frozen T2I models.
Dynamic Eraser for Guided Concept Erasure in Diffusion Models cs.CV · 2026-04-13 · unverdicted · none · ref 35 · internal anchor
DSS is a lightweight inference-time framework that erases concepts in diffusion models at 91% average rate while preserving image fidelity, outperforming prior methods.

Promptguard: Soft prompt-guided unsafe content moderation for text-to- image models.CoRR, abs/2501.03544

fields

years

verdicts

representative citing papers

citing papers explorer