When the Prompt Becomes Visual: Vision-Centric Jailbreak Attacks for Large Image Editing Models

Alex Jinpeng Wang; Fangming Liu; Haochen Han; Jiacheng Hou; Ruochong Jin; Wai Kin Victor Chan; Yining Sun

arxiv: 2602.10179 · v2 · pith:2XK55FDCnew · submitted 2026-02-10 · 💻 cs.CV · cs.AI

When the Prompt Becomes Visual: Vision-Centric Jailbreak Attacks for Large Image Editing Models

Jiacheng Hou , Yining Sun , Ruochong Jin , Haochen Han , Fangming Liu , Wai Kin Victor Chan , Alex Jinpeng Wang This is my paper

classification 💻 cs.CV cs.AI

keywords modelseditingimageattackvisualjailbreaklargebecomes

0 comments

read the original abstract

Recent advances in large image editing models have shifted the paradigm from text-driven instructions to vision-prompt editing, where user intent is inferred directly from visual inputs such as marks, arrows, and visual-text prompts. While this paradigm greatly expands usability, it also introduces a critical and underexplored safety risk: the attack surface itself becomes visual. In this work, we propose Vision-Centric Jailbreak Attack (VJA), the first visual-to-visual jailbreak attack that conveys malicious instructions purely through visual inputs. To systematically study this emerging threat, we introduce IESBench, a safety-oriented benchmark for image editing models. Extensive experiments on IESBench demonstrate that VJA effectively compromises state-of-the-art commercial models, achieving attack success rates of up to 80.9% on Nano Banana Pro and 70.1% on GPT-Image-1.5. To mitigate this vulnerability, we propose a training-free defense based on introspective multimodal reasoning, which substantially improves the safety of poorly aligned models to a level comparable with commercial systems, without auxiliary guard models and with negligible computational overhead. Our findings expose new vulnerabilities, provide both a benchmark and practical defense to advance safe and trustworthy modern image editing systems. Warning: This paper contains offensive images created by large image editing models.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Generate "Normal", Edit Poisoned: Branding Injection via Hint Embedding in Image Editing
cs.CR 2026-05 unverdicted novelty 7.0

Invisible hints such as logos embedded in images are re-rendered by diffusion models during text-guided editing, enabling phishing and model-poisoning attacks with average success rates of 44.4% and 32.2%.