pith. sign in

arxiv: 2505.18979 · v2 · pith:75LRLKFFnew · submitted 2025-05-25 · 💻 cs.LG

Dynamic Optimization and Safety Indicator Injection for Jailbreaking Text-to-Image Models with Multimodal Safety Filters

classification 💻 cs.LG
keywords filterssafetydefensesdynamicinjectionmultimodaloptimizationsemantic
0
0 comments X
read the original abstract

Text-to-image (T2I) models can generate not-safe-for-work (NSFW) content, motivating multi-stage safety pipelines with both text and image filters. Newer LLM-based filters detect latent intent beyond keywords, making token-level perturbation attacks unreliable. Our evaluation further shows that existing jailbreak methods exhibit a sharp trade-off between filter evasion and semantic fidelity, while also requiring excessive queries to succeed. We introduce \textbf{OptJail}, an automated jailbreak framework that combines dynamic prompt optimization with multimodal feedback. It consists of two key components: (i) \textit{Dynamic Optimization}, an iterative process that leverages text-filter feedback and semantic consistency to rewrite prompts into adversarial variants; and (ii) \textit{Adaptive Safety Indicator Injection}, which formulates the injection of benign visual cues as a reinforcement learning problem to bypass image-level filters. OptJail achieves state-of-the-art performance, increasing the ShieldLM-7B bypass rate from 8.9\% (Sneakyprompt) to 99.0\%, improving CLIP score from 0.2637 to 0.2762. Moreover, it generalizes to unseen filters and successfully jailbreaks DALL E 3 in our evaluation. Mechanistic analysis reveals why these defenses fail: optimized prompts are projected into the ``safe'' region of the filter's representation space yet remain nearly stationary in the generative model's semantic space, and injected safety indicators redirect image detectors' attention away from NSFW content toward benign visual cues. This study reveals systemic vulnerabilities in current multimodal defenses and motivates stronger adaptive defenses.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Whispers in the Noise: Surrogate-Guided Concept Awakening via a Multi-Agent Framework

    cs.AI 2026-05 unverdicted novelty 6.0

    ConceptAgent is a black-box multi-agent system that awakens erased concepts in diffusion models by initializing denoising trajectories from surrogate-guided noisy states.