VPA-Guard: Defending and Benchmarking Image-to-Video Generation Against Visual Prompt Attacks
Pith reviewed 2026-06-25 21:08 UTC · model grok-4.3
The pith
Image-to-video models interpret visual cues in input images as instructions that can trigger harmful video content.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that visual prompts in images act as executable instructions for image-to-video models, enabling attacks that unfold into harmful actions. VVA-Bench provides the first categorized collection of such attacks and shows that leading systems reach attack success rates up to 100 percent. VPA-Guard counters this by retrieving relevant examples and using few-shot reasoning to detect latent harmful intent, lowering average attack success by 44.2 percent and harmfulness scores by 73.4 percent while keeping utility for legitimate edits intact.
What carries the argument
VPA-Guard, a retrieval-augmented self-evolving defense that applies few-shot reasoning on retrieved examples to identify latent malicious intents in visual prompts.
If this is right
- Standardized testing with VVA-Bench can quantify safety across additional image-to-video systems.
- Retrieval-augmented reasoning of the type used in VPA-Guard can be applied at generation time without removing the controllability that visual prompts provide.
- Models equipped with the defense maintain their ability to follow user-directed edits that do not contain hidden harmful instructions.
- The measured reductions in attack success and harmfulness scores hold across the models evaluated in the benchmark.
Where Pith is reading between the lines
- Similar retrieval-based detection could be tested on text-to-video or audio-to-video pipelines that also accept visual conditioning.
- Attack categories may shift over time, requiring periodic expansion of the benchmark to keep the defense effective.
- Interface designers could expose the guard as an optional layer so users retain full control when they want it.
- The same few-shot pattern might help surface other subtle misalignments between intended and generated motion in controllable generators.
Load-bearing premise
The attacks collected and categorized in VVA-Bench stand in for the real-world visual prompt threats that models will encounter, and few-shot reasoning on retrieved examples can reliably separate malicious from benign intents.
What would settle it
A new collection of visual prompt attacks drawn from outside the VVA-Bench categories that still achieves high success rates against models protected by VPA-Guard, or a test showing that the defense blocks a large fraction of ordinary legitimate image edits.
Figures
read the original abstract
Recent advancements in Image-to-Video (I2V) generation have transformed input images from simple appearance references into interactive control interfaces where visual cues such as arrows, sketches, and emojis orchestrate complex video dynamics with unprecedented controllability. However, these seemingly innocuous static cues can be interpreted by models as executable temporal instructions, unfolding into harmful actions in the generated videos. Despite the severity of this threat, existing safety benchmarks remain predominantly focused on text-based and content-only image-based jailbreaks, leaving implicit visual prompt attacks insufficiently explored. To bridge this gap, we present VVA-Bench, the first systematic benchmark for evaluating video generation safety under categorized vision-centric prompt attacks. Extensive experiments on VVA-Bench demonstrate that state-of-the-art models are highly susceptible to such attacks, with Attack Success Rates (ASR) reaching 100.0\% on Wan 2.7 and 74.8\% on Veo 3.1. To mitigate these risks, we propose VPA-Guard, a retrieval-augmented and self-evolving defense framework. By leveraging few-shot reasoning to identify latent malicious intents, our method reduces the attack ASR by 44.2\% and the harmfulness score by 73.4\% on average, while maintaining the model's utility for legitimate user edits. Our work provides both a rigorous benchmark and an effective defense strategy to advance safe and socially responsible multimodal generation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces VVA-Bench as the first benchmark for categorized vision-centric prompt attacks on image-to-video (I2V) generation models and proposes VPA-Guard, a retrieval-augmented self-evolving defense using few-shot reasoning to detect latent malicious intents. It reports that state-of-the-art I2V models are highly susceptible, with ASR reaching 100.0% on Wan 2.7 and 74.8% on Veo 3.1, and claims that VPA-Guard reduces average ASR by 44.2% and harmfulness score by 73.4% while preserving utility on legitimate edits.
Significance. If the benchmark construction is representative and the defense mechanism is validated, the work fills a gap in safety evaluation for multimodal generative models by shifting focus from text-based or content-only image jailbreaks to implicit visual prompt attacks that models interpret as temporal instructions. The combination of a new benchmark and a practical defense strategy would be a meaningful contribution to responsible development of I2V systems.
major comments (2)
- [Abstract] Abstract: the central quantitative claims (ASR values, 44.2% reduction, 73.4% harmfulness drop) are stated without any description of the experimental protocol, VVA-Bench construction method, attack taxonomy validation, or metrics for benign-edit utility, rendering the support for the susceptibility and mitigation results impossible to assess.
- [VPA-Guard framework] VPA-Guard description: the retrieval index construction, self-evolving mechanism, and few-shot reasoning procedure are not specified, nor are any ablation studies or benign-prompt performance metrics provided; these omissions directly undermine evaluation of the load-bearing assumption that the defense reliably flags malicious intents without degrading legitimate edits.
minor comments (1)
- [Abstract] The abstract would benefit from stating the number of attack categories and total prompts in VVA-Bench to give immediate context for the reported ASR figures.
Simulated Author's Rebuttal
We thank the referee for the constructive comments highlighting areas where additional clarity is needed. We will revise the manuscript to incorporate more detail in the abstract and expand the VPA-Guard description with the requested specifications, ablations, and metrics.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central quantitative claims (ASR values, 44.2% reduction, 73.4% harmfulness drop) are stated without any description of the experimental protocol, VVA-Bench construction method, attack taxonomy validation, or metrics for benign-edit utility, rendering the support for the susceptibility and mitigation results impossible to assess.
Authors: We agree that the abstract is a high-level summary and does not detail the experimental protocol, VVA-Bench construction, taxonomy validation, or benign-edit utility metrics. These elements are described in the main body (Sections 3 and 4). To address the concern, we will revise the abstract to briefly reference the benchmark construction approach, evaluation protocol, and utility metrics while remaining within length limits. revision: yes
-
Referee: [VPA-Guard framework] VPA-Guard description: the retrieval index construction, self-evolving mechanism, and few-shot reasoning procedure are not specified, nor are any ablation studies or benign-prompt performance metrics provided; these omissions directly undermine evaluation of the load-bearing assumption that the defense reliably flags malicious intents without degrading legitimate edits.
Authors: We acknowledge that the VPA-Guard section currently lacks explicit details on retrieval index construction, the self-evolving mechanism, and the few-shot reasoning procedure, and does not include ablation studies or benign-prompt metrics. In the revision we will expand the framework description to specify these components, add ablation studies, and report benign-prompt performance results to directly validate that malicious intent detection does not degrade legitimate edits. revision: yes
Circularity Check
No circularity in empirical benchmark and defense evaluation
full rationale
The paper presents VVA-Bench as a new benchmark and VPA-Guard as a retrieval-augmented few-shot defense, reporting empirical ASR and harmfulness metrics on external models (Wan 2.7, Veo 3.1). No equations, parameter fits, or derivation steps exist that could reduce to self-defined quantities or self-citations. All claims rest on external experimental outcomes rather than internal construction, satisfying the self-contained empirical case with score 0.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Visual cues such as arrows, sketches, and emojis can function as executable temporal instructions in I2V models
Reference graph
Works this paper leans on
-
[1]
2024 , eprint =
Image Hijacks: Adversarial Images can Control Generative Models at Runtime , author =. 2024 , eprint =
2024
-
[2]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Instructpix2pix: Learning to follow image editing instructions , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[3]
Advances in Neural Information Processing Systems , volume=
Jailbreakbench: An open robustness benchmark for jailbreaking large language models , author=. Advances in Neural Information Processing Systems , volume=
-
[4]
2026 , eprint =
Visually Prompted Benchmarks Are Surprisingly Fragile , author =. 2026 , eprint =
2026
-
[5]
2024 , url =
Guiding Instruction-based Image Editing via Multimodal Large Language Models , author =. 2024 , url =
2024
-
[6]
2022 , eprint =
Prompt-to-Prompt Image Editing with Cross Attention Control , author =. 2022 , eprint =
2022
-
[7]
ICLR 2025 Workshop on Building Trust in Language Models and Applications , year =
VideoJail: Exploiting Video-Modality Vulnerabilities for Jailbreak Attacks on Multimodal Large Language Models , author =. ICLR 2025 Workshop on Building Trust in Language Models and Applications , year =
2025
-
[8]
2017 , doi =
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , author =. 2017 , doi =
2017
-
[9]
2025 , eprint =
Formalizing and Benchmarking Prompt Injection Attacks and Defenses , author =. 2025 , eprint =
2025
-
[10]
2024 , eprint =
MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models , author =. 2024 , eprint =
2024
-
[11]
2024 , eprint =
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal , author =. 2024 , eprint =
2024
-
[12]
2022 , eprint =
SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations , author =. 2022 , eprint =
2022
-
[13]
2024 , eprint =
T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models , author =. 2024 , eprint =
2024
-
[14]
2024 , eprint =
Towards Understanding Unsafe Video Generation , author =. 2024 , eprint =
2024
-
[15]
2022 , eprint =
Ignore Previous Prompt: Attack Techniques For Language Models , author =. 2022 , eprint =
2022
-
[16]
2024 , doi =
Visual Adversarial Examples Jailbreak Aligned Large Language Models , author =. 2024 , doi =
2024
-
[17]
2025 , eprint =
Qwen-Image Technical Report , author =. 2025 , eprint =
2025
-
[18]
2025 , eprint=
RunawayEvil: Jailbreaking the Image-to-Video Generative Models , author=. 2025 , eprint=
2025
-
[19]
2026 , eprint =
When the Prompt Becomes Visual: Vision-Centric Jailbreak Attacks for Large Image Editing Models , author =. 2026 , eprint =
2026
-
[20]
2025 , eprint =
VP-Bench: A Comprehensive Benchmark for Visual Prompting in Multimodal Large Language Models , author =. 2025 , eprint =
2025
-
[21]
2026 , eprint=
VII: Visual Instruction Injection for Jailbreaking Image-to-Video Generation Models , author=. 2026 , eprint=
2026
-
[22]
2016 , doi =
Visual7W: Grounded Question Answering in Images , author =. 2016 , doi =
2016
-
[23]
2024 , eprint=
Perception Tokens Enhance Visual Reasoning in Multimodal Language Models , author=. 2024 , eprint=
2024
-
[24]
2023 , eprint=
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets , author=. 2023 , eprint=
2023
-
[25]
2024 , eprint=
MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark , author=. 2024 , eprint=
2024
-
[26]
2024 , eprint=
Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations , author=. 2024 , eprint=
2024
-
[27]
2025 , eprint=
In-Video Instructions: Visual Signals as Generative Control , author=. 2025 , eprint=
2025
-
[28]
2024 , eprint=
BLINK: Multimodal Large Language Models Can See but Not Perceive , author=. 2024 , eprint=
2024
-
[29]
2025 , eprint=
FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts , author=. 2025 , eprint=
2025
-
[30]
Findings of the Association for Computational Linguistics: ACL 2024 , pages=
Visual hallucinations of multi-modal large language models , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=
2024
-
[31]
2025 , url=
Kling v2.5 Turbo: Video Generation Model , author=. 2025 , url=
2025
-
[32]
2025 , eprint=
Jailbreaking on Text-to-Video Models via Scene Splitting Strategy , author=. 2025 , eprint=
2025
-
[33]
2025 , eprint=
MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance , author=. 2025 , eprint=
2025
-
[34]
2025 , eprint=
T2VShield: Model-Agnostic Jailbreak Defense for Text-to-Video Models , author=. 2025 , eprint=
2025
-
[35]
2024 , eprint=
VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning , author=. 2024 , eprint=
2024
-
[36]
2025 , eprint=
T2V-OptJail: Discrete Prompt Optimization for Text-to-Video Jailbreak Attacks , author=. 2025 , eprint=
2025
-
[37]
2025 , eprint=
ConceptGuard: Proactive Safety in Text-and-Image-to-Video Generation through Multimodal Risk Detection , author=. 2025 , eprint=
2025
-
[38]
2026 , eprint=
Controllable Video Generation: A Survey , author=. 2026 , eprint=
2026
-
[39]
IEEE Global Communications Conference (GLOBECOM) , pages=
PromptNeedling: Jailbreaking Text-to-Video Generative Models , author=. IEEE Global Communications Conference (GLOBECOM) , pages=
-
[40]
2025 , eprint=
Modular-Cam: Modular Dynamic Camera-view Video Generation with LLM , author=. 2025 , eprint=
2025
-
[41]
2025 , url=
PixVerse: AI Video Generator , author=. 2025 , url=
2025
-
[42]
Proceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security , year=
UnsafeBench: Benchmarking Image Safety Classifiers on Real-World and AI-Generated Images , author=. Proceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security , year=. 2405.03486 , archivePrefix=
arXiv 2025
-
[43]
2025 , eprint=
Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model , author=. 2025 , eprint=
2025
-
[44]
2025 , url=
Veo 3.1 , author=. 2025 , url=
2025
-
[45]
2025 , eprint=
Wan: Open and Advanced Large-Scale Video Generative Models , author=. 2025 , eprint=
2025
-
[46]
2026 , month = apr, day =
2026
-
[47]
Advances in Neural Information Processing Systems , year=
VideoComposer: Compositional Video Synthesis with Motion Controllability , author=. Advances in Neural Information Processing Systems , year=. 2306.02018 , archivePrefix=
-
[48]
2025 , eprint=
Video models are zero-shot learners and reasoners , author=. 2025 , eprint=
2025
-
[49]
2025 , eprint=
NSFW-Classifier Guided Prompt Sanitization for Safe Text-to-Image Generation , author=. 2025 , eprint=
2025
-
[50]
2025 , eprint=
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer , author=. 2025 , eprint=
2025
-
[51]
2025 , eprint=
Distraction is All You Need for Multimodal Large Language Model Jailbreaking , author=. 2025 , eprint=
2025
-
[52]
2024 , eprint=
Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt , author=. 2024 , eprint=
2024
-
[53]
2026 , eprint=
SPARK: Jailbreaking T2V Models by Synergistically Prompting Auditory and Recontextualized Knowledge , author=. 2026 , eprint=
2026
-
[54]
2025 , eprint=
SAFREE: Training-Free and Adaptive Guard for Safe Text-to-Image And Video Generation , author=. 2025 , eprint=
2025
-
[55]
2023 , eprint=
I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models , author=. 2023 , eprint=
2023
-
[56]
2025 , month = dec, howpublished =
Kling AI Launches Video 2.6 Model with ``Simultaneous Audio-Visual Generation'' Capability, Redefining AI Video Creation Workflow , author =. 2025 , month = dec, howpublished =
2025
-
[57]
2025 , month = oct, howpublished =
MiniMax Hailuo 2.3: A New Level of Complex Video Performance and Media Agent , author =. 2025 , month = oct, howpublished =
2025
-
[58]
2026 , month = jan, howpublished =
Veo 3.1 Ingredients to Video: More Consistency, Creativity and Control , author =. 2026 , month = jan, howpublished =
2026
-
[59]
arXiv preprint arXiv:2503.00948 , year=
Extrapolating and Decoupling Image-to-Video Generation Models: Motion Modeling is Easier Than You Think , author=. arXiv preprint arXiv:2503.00948 , year=
-
[60]
2024 , eprint=
ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation , author=. 2024 , eprint=
2024
-
[61]
2025 , eprint=
Visual Prompting for One-shot Controllable Video Editing without Inversion , author=. 2025 , eprint=
2025
-
[62]
Advances in Neural Information Processing Systems , volume=
Video Diffusion Models are Training-free Motion Interpreter and Controller , author=. Advances in Neural Information Processing Systems , volume=
-
[63]
arXiv preprint arXiv:2312.02216 , year=
DragVideo: Interactive Drag-style Video Editing , author=. arXiv preprint arXiv:2312.02216 , year=
-
[64]
ICML , year=
Stair: Improving safety alignment with introspective reasoning , author=. ICML , year=
-
[65]
ICML , year=
Weak-to-strong jailbreaking on large language models , author=. ICML , year=
-
[66]
arXiv preprint arXiv:2506.07891 , year=
Video Unlearning via Low-Rank Refusal Vector , author=. arXiv preprint arXiv:2506.07891 , year=
-
[67]
2026 , eprint=
Vid-Freeze: Protecting Images from Malicious Image-to-Video Generation via Temporal Freezing , author=. 2026 , eprint=
2026
-
[68]
arXiv preprint arXiv:2402.13126 , year=
VGMShield: Mitigating Misuse of Video Generative Models , author=. arXiv preprint arXiv:2402.13126 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.