pith. sign in

arxiv: 2606.25592 · v1 · pith:72GTMEQOnew · submitted 2026-06-24 · 💻 cs.CV

VPA-Guard: Defending and Benchmarking Image-to-Video Generation Against Visual Prompt Attacks

Pith reviewed 2026-06-25 21:08 UTC · model grok-4.3

classification 💻 cs.CV
keywords visual prompt attacksimage-to-video generationsafety benchmarkdefense frameworkmultimodal jailbreaksVVA-BenchVPA-Guardretrieval-augmented defense
0
0 comments X

The pith

Image-to-video models interpret visual cues in input images as instructions that can trigger harmful video content.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that static elements such as arrows, sketches, and emojis in input images can function as temporal commands that cause image-to-video generators to produce unwanted or harmful sequences. It builds VVA-Bench to categorize these vision-centric attacks and runs tests that expose high vulnerability across current models. The authors then introduce VPA-Guard, which retrieves similar past examples and applies few-shot reasoning to spot hidden malicious intent before generation occurs. This combination matters because it shows a previously unmeasured safety gap in controllable video synthesis and supplies both measurement tools and a practical countermeasure that preserves normal editing performance.

Core claim

The central claim is that visual prompts in images act as executable instructions for image-to-video models, enabling attacks that unfold into harmful actions. VVA-Bench provides the first categorized collection of such attacks and shows that leading systems reach attack success rates up to 100 percent. VPA-Guard counters this by retrieving relevant examples and using few-shot reasoning to detect latent harmful intent, lowering average attack success by 44.2 percent and harmfulness scores by 73.4 percent while keeping utility for legitimate edits intact.

What carries the argument

VPA-Guard, a retrieval-augmented self-evolving defense that applies few-shot reasoning on retrieved examples to identify latent malicious intents in visual prompts.

If this is right

  • Standardized testing with VVA-Bench can quantify safety across additional image-to-video systems.
  • Retrieval-augmented reasoning of the type used in VPA-Guard can be applied at generation time without removing the controllability that visual prompts provide.
  • Models equipped with the defense maintain their ability to follow user-directed edits that do not contain hidden harmful instructions.
  • The measured reductions in attack success and harmfulness scores hold across the models evaluated in the benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar retrieval-based detection could be tested on text-to-video or audio-to-video pipelines that also accept visual conditioning.
  • Attack categories may shift over time, requiring periodic expansion of the benchmark to keep the defense effective.
  • Interface designers could expose the guard as an optional layer so users retain full control when they want it.
  • The same few-shot pattern might help surface other subtle misalignments between intended and generated motion in controllable generators.

Load-bearing premise

The attacks collected and categorized in VVA-Bench stand in for the real-world visual prompt threats that models will encounter, and few-shot reasoning on retrieved examples can reliably separate malicious from benign intents.

What would settle it

A new collection of visual prompt attacks drawn from outside the VVA-Bench categories that still achieves high success rates against models protected by VPA-Guard, or a test showing that the defense blocks a large fraction of ordinary legitimate image edits.

Figures

Figures reproduced from arXiv: 2606.25592 by Alex Jinpeng Wang, Danyang Zhang, Fangming Liu, Haochen Han, Haoyu Kang, Heng Zhang, Jiajun Wu, Wai Kin Victor Chan, Yining Sun, Zhenjun Zhao.

Figure 1
Figure 1. Figure 1: Overview and statistics of VVA-Bench. Left panels showcase representative examples of five VP mechanisms and 2frames transitions. Right panels provide quantitative distributions of attack forms, subtypes, and seven major safety categories, highlighting the balanced and multi-dimensional nature of our dataset. affect the safety of video generation across different models. To address this gap, we introduce V… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison between traditional text-prompt attack and visual-prompt attack. Top: Samples of attack scheme com￾parison. Bottom: Performance comparison. Visual prompt attack poses a higher threat to video models. An image-to-video (I2V) generator M(I, T) → V generates video V with input image I and input text T. Unlike the tradi￾tional jailbreak attack methods inject malicious information into the text promp… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed defense pipeline. 1. We select and store 42 representative attack samples with explicit rationales to prerequisite knowledge before safe generation. 2. The input image and text are encoded together and then used to extract their attack intent and retrieve similar attack patterns. 3. The retrieved samples and their rationales will be used as few-shot prompt to ensure safer generatio… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the proposed VVA-Bench pipeline. The workflow illustrates benchmark construction, the two attack forms, and the four-stage rating protocol utilized for safety evaluation. Structure Attack Construction: The clean sources are then transformed into unsafe attack inputs through the addition of either a VP layer or a 2frames temporal structure. For VP, the perturbation assumes the form of arrows, em… view at source ↗
Figure 5
Figure 5. Figure 5: HV Distribution across models. Here HV = 0 stands for the sample is rejected by the safety mechanism. B. Additional Results B.1. Sensitivity Analysis We rewrite the prompt in a small subset of our VVA-Bench with two-level expressions including direct explicit expression and indirect referential expression, then test on five different I2V models. Detailed prompts are shown in Section D.4.3. The results in … view at source ↗
Figure 6
Figure 6. Figure 6: Showcase of the inputs and outputs of all 6 attack formats in our constructed VVA-Bench 22 [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Showcase of failure cases in VVA-Bench. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Interactive Interface provided for Human Annotators 30 [PITH_FULL_IMAGE:figures/full_fig_p030_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Interactive Interface provided for Human Judges 31 [PITH_FULL_IMAGE:figures/full_fig_p031_9.png] view at source ↗
read the original abstract

Recent advancements in Image-to-Video (I2V) generation have transformed input images from simple appearance references into interactive control interfaces where visual cues such as arrows, sketches, and emojis orchestrate complex video dynamics with unprecedented controllability. However, these seemingly innocuous static cues can be interpreted by models as executable temporal instructions, unfolding into harmful actions in the generated videos. Despite the severity of this threat, existing safety benchmarks remain predominantly focused on text-based and content-only image-based jailbreaks, leaving implicit visual prompt attacks insufficiently explored. To bridge this gap, we present VVA-Bench, the first systematic benchmark for evaluating video generation safety under categorized vision-centric prompt attacks. Extensive experiments on VVA-Bench demonstrate that state-of-the-art models are highly susceptible to such attacks, with Attack Success Rates (ASR) reaching 100.0\% on Wan 2.7 and 74.8\% on Veo 3.1. To mitigate these risks, we propose VPA-Guard, a retrieval-augmented and self-evolving defense framework. By leveraging few-shot reasoning to identify latent malicious intents, our method reduces the attack ASR by 44.2\% and the harmfulness score by 73.4\% on average, while maintaining the model's utility for legitimate user edits. Our work provides both a rigorous benchmark and an effective defense strategy to advance safe and socially responsible multimodal generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces VVA-Bench as the first benchmark for categorized vision-centric prompt attacks on image-to-video (I2V) generation models and proposes VPA-Guard, a retrieval-augmented self-evolving defense using few-shot reasoning to detect latent malicious intents. It reports that state-of-the-art I2V models are highly susceptible, with ASR reaching 100.0% on Wan 2.7 and 74.8% on Veo 3.1, and claims that VPA-Guard reduces average ASR by 44.2% and harmfulness score by 73.4% while preserving utility on legitimate edits.

Significance. If the benchmark construction is representative and the defense mechanism is validated, the work fills a gap in safety evaluation for multimodal generative models by shifting focus from text-based or content-only image jailbreaks to implicit visual prompt attacks that models interpret as temporal instructions. The combination of a new benchmark and a practical defense strategy would be a meaningful contribution to responsible development of I2V systems.

major comments (2)
  1. [Abstract] Abstract: the central quantitative claims (ASR values, 44.2% reduction, 73.4% harmfulness drop) are stated without any description of the experimental protocol, VVA-Bench construction method, attack taxonomy validation, or metrics for benign-edit utility, rendering the support for the susceptibility and mitigation results impossible to assess.
  2. [VPA-Guard framework] VPA-Guard description: the retrieval index construction, self-evolving mechanism, and few-shot reasoning procedure are not specified, nor are any ablation studies or benign-prompt performance metrics provided; these omissions directly undermine evaluation of the load-bearing assumption that the defense reliably flags malicious intents without degrading legitimate edits.
minor comments (1)
  1. [Abstract] The abstract would benefit from stating the number of attack categories and total prompts in VVA-Bench to give immediate context for the reported ASR figures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting areas where additional clarity is needed. We will revise the manuscript to incorporate more detail in the abstract and expand the VPA-Guard description with the requested specifications, ablations, and metrics.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central quantitative claims (ASR values, 44.2% reduction, 73.4% harmfulness drop) are stated without any description of the experimental protocol, VVA-Bench construction method, attack taxonomy validation, or metrics for benign-edit utility, rendering the support for the susceptibility and mitigation results impossible to assess.

    Authors: We agree that the abstract is a high-level summary and does not detail the experimental protocol, VVA-Bench construction, taxonomy validation, or benign-edit utility metrics. These elements are described in the main body (Sections 3 and 4). To address the concern, we will revise the abstract to briefly reference the benchmark construction approach, evaluation protocol, and utility metrics while remaining within length limits. revision: yes

  2. Referee: [VPA-Guard framework] VPA-Guard description: the retrieval index construction, self-evolving mechanism, and few-shot reasoning procedure are not specified, nor are any ablation studies or benign-prompt performance metrics provided; these omissions directly undermine evaluation of the load-bearing assumption that the defense reliably flags malicious intents without degrading legitimate edits.

    Authors: We acknowledge that the VPA-Guard section currently lacks explicit details on retrieval index construction, the self-evolving mechanism, and the few-shot reasoning procedure, and does not include ablation studies or benign-prompt metrics. In the revision we will expand the framework description to specify these components, add ablation studies, and report benign-prompt performance results to directly validate that malicious intent detection does not degrade legitimate edits. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical benchmark and defense evaluation

full rationale

The paper presents VVA-Bench as a new benchmark and VPA-Guard as a retrieval-augmented few-shot defense, reporting empirical ASR and harmfulness metrics on external models (Wan 2.7, Veo 3.1). No equations, parameter fits, or derivation steps exist that could reduce to self-defined quantities or self-citations. All claims rest on external experimental outcomes rather than internal construction, satisfying the self-contained empirical case with score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Work rests on standard domain assumptions in AI safety evaluation; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Visual cues such as arrows, sketches, and emojis can function as executable temporal instructions in I2V models
    This premise underpins the threat model and the need for VVA-Bench.

pith-pipeline@v0.9.1-grok · 5817 in / 1238 out tokens · 26671 ms · 2026-06-25T21:08:17.779657+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references

  1. [1]

    2024 , eprint =

    Image Hijacks: Adversarial Images can Control Generative Models at Runtime , author =. 2024 , eprint =

  2. [2]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Instructpix2pix: Learning to follow image editing instructions , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  3. [3]

    Advances in Neural Information Processing Systems , volume=

    Jailbreakbench: An open robustness benchmark for jailbreaking large language models , author=. Advances in Neural Information Processing Systems , volume=

  4. [4]

    2026 , eprint =

    Visually Prompted Benchmarks Are Surprisingly Fragile , author =. 2026 , eprint =

  5. [5]

    2024 , url =

    Guiding Instruction-based Image Editing via Multimodal Large Language Models , author =. 2024 , url =

  6. [6]

    2022 , eprint =

    Prompt-to-Prompt Image Editing with Cross Attention Control , author =. 2022 , eprint =

  7. [7]

    ICLR 2025 Workshop on Building Trust in Language Models and Applications , year =

    VideoJail: Exploiting Video-Modality Vulnerabilities for Jailbreak Attacks on Multimodal Large Language Models , author =. ICLR 2025 Workshop on Building Trust in Language Models and Applications , year =

  8. [8]

    2017 , doi =

    Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , author =. 2017 , doi =

  9. [9]

    2025 , eprint =

    Formalizing and Benchmarking Prompt Injection Attacks and Defenses , author =. 2025 , eprint =

  10. [10]

    2024 , eprint =

    MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models , author =. 2024 , eprint =

  11. [11]

    2024 , eprint =

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal , author =. 2024 , eprint =

  12. [12]

    2022 , eprint =

    SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations , author =. 2022 , eprint =

  13. [13]

    2024 , eprint =

    T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models , author =. 2024 , eprint =

  14. [14]

    2024 , eprint =

    Towards Understanding Unsafe Video Generation , author =. 2024 , eprint =

  15. [15]

    2022 , eprint =

    Ignore Previous Prompt: Attack Techniques For Language Models , author =. 2022 , eprint =

  16. [16]

    2024 , doi =

    Visual Adversarial Examples Jailbreak Aligned Large Language Models , author =. 2024 , doi =

  17. [17]

    2025 , eprint =

    Qwen-Image Technical Report , author =. 2025 , eprint =

  18. [18]

    2025 , eprint=

    RunawayEvil: Jailbreaking the Image-to-Video Generative Models , author=. 2025 , eprint=

  19. [19]

    2026 , eprint =

    When the Prompt Becomes Visual: Vision-Centric Jailbreak Attacks for Large Image Editing Models , author =. 2026 , eprint =

  20. [20]

    2025 , eprint =

    VP-Bench: A Comprehensive Benchmark for Visual Prompting in Multimodal Large Language Models , author =. 2025 , eprint =

  21. [21]

    2026 , eprint=

    VII: Visual Instruction Injection for Jailbreaking Image-to-Video Generation Models , author=. 2026 , eprint=

  22. [22]

    2016 , doi =

    Visual7W: Grounded Question Answering in Images , author =. 2016 , doi =

  23. [23]

    2024 , eprint=

    Perception Tokens Enhance Visual Reasoning in Multimodal Language Models , author=. 2024 , eprint=

  24. [24]

    2023 , eprint=

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets , author=. 2023 , eprint=

  25. [25]

    2024 , eprint=

    MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark , author=. 2024 , eprint=

  26. [26]

    2024 , eprint=

    Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations , author=. 2024 , eprint=

  27. [27]

    2025 , eprint=

    In-Video Instructions: Visual Signals as Generative Control , author=. 2025 , eprint=

  28. [28]

    2024 , eprint=

    BLINK: Multimodal Large Language Models Can See but Not Perceive , author=. 2024 , eprint=

  29. [29]

    2025 , eprint=

    FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts , author=. 2025 , eprint=

  30. [30]

    Findings of the Association for Computational Linguistics: ACL 2024 , pages=

    Visual hallucinations of multi-modal large language models , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

  31. [31]

    2025 , url=

    Kling v2.5 Turbo: Video Generation Model , author=. 2025 , url=

  32. [32]

    2025 , eprint=

    Jailbreaking on Text-to-Video Models via Scene Splitting Strategy , author=. 2025 , eprint=

  33. [33]

    2025 , eprint=

    MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance , author=. 2025 , eprint=

  34. [34]

    2025 , eprint=

    T2VShield: Model-Agnostic Jailbreak Defense for Text-to-Video Models , author=. 2025 , eprint=

  35. [35]

    2024 , eprint=

    VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning , author=. 2024 , eprint=

  36. [36]

    2025 , eprint=

    T2V-OptJail: Discrete Prompt Optimization for Text-to-Video Jailbreak Attacks , author=. 2025 , eprint=

  37. [37]

    2025 , eprint=

    ConceptGuard: Proactive Safety in Text-and-Image-to-Video Generation through Multimodal Risk Detection , author=. 2025 , eprint=

  38. [38]

    2026 , eprint=

    Controllable Video Generation: A Survey , author=. 2026 , eprint=

  39. [39]

    IEEE Global Communications Conference (GLOBECOM) , pages=

    PromptNeedling: Jailbreaking Text-to-Video Generative Models , author=. IEEE Global Communications Conference (GLOBECOM) , pages=

  40. [40]

    2025 , eprint=

    Modular-Cam: Modular Dynamic Camera-view Video Generation with LLM , author=. 2025 , eprint=

  41. [41]

    2025 , url=

    PixVerse: AI Video Generator , author=. 2025 , url=

  42. [42]

    Proceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security , year=

    UnsafeBench: Benchmarking Image Safety Classifiers on Real-World and AI-Generated Images , author=. Proceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security , year=. 2405.03486 , archivePrefix=

  43. [43]

    2025 , eprint=

    Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model , author=. 2025 , eprint=

  44. [44]

    2025 , url=

    Veo 3.1 , author=. 2025 , url=

  45. [45]

    2025 , eprint=

    Wan: Open and Advanced Large-Scale Video Generative Models , author=. 2025 , eprint=

  46. [46]

    2026 , month = apr, day =

  47. [47]

    Advances in Neural Information Processing Systems , year=

    VideoComposer: Compositional Video Synthesis with Motion Controllability , author=. Advances in Neural Information Processing Systems , year=. 2306.02018 , archivePrefix=

  48. [48]

    2025 , eprint=

    Video models are zero-shot learners and reasoners , author=. 2025 , eprint=

  49. [49]

    2025 , eprint=

    NSFW-Classifier Guided Prompt Sanitization for Safe Text-to-Image Generation , author=. 2025 , eprint=

  50. [50]

    2025 , eprint=

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer , author=. 2025 , eprint=

  51. [51]

    2025 , eprint=

    Distraction is All You Need for Multimodal Large Language Model Jailbreaking , author=. 2025 , eprint=

  52. [52]

    2024 , eprint=

    Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt , author=. 2024 , eprint=

  53. [53]

    2026 , eprint=

    SPARK: Jailbreaking T2V Models by Synergistically Prompting Auditory and Recontextualized Knowledge , author=. 2026 , eprint=

  54. [54]

    2025 , eprint=

    SAFREE: Training-Free and Adaptive Guard for Safe Text-to-Image And Video Generation , author=. 2025 , eprint=

  55. [55]

    2023 , eprint=

    I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models , author=. 2023 , eprint=

  56. [56]

    2025 , month = dec, howpublished =

    Kling AI Launches Video 2.6 Model with ``Simultaneous Audio-Visual Generation'' Capability, Redefining AI Video Creation Workflow , author =. 2025 , month = dec, howpublished =

  57. [57]

    2025 , month = oct, howpublished =

    MiniMax Hailuo 2.3: A New Level of Complex Video Performance and Media Agent , author =. 2025 , month = oct, howpublished =

  58. [58]

    2026 , month = jan, howpublished =

    Veo 3.1 Ingredients to Video: More Consistency, Creativity and Control , author =. 2026 , month = jan, howpublished =

  59. [59]

    arXiv preprint arXiv:2503.00948 , year=

    Extrapolating and Decoupling Image-to-Video Generation Models: Motion Modeling is Easier Than You Think , author=. arXiv preprint arXiv:2503.00948 , year=

  60. [60]

    2024 , eprint=

    ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation , author=. 2024 , eprint=

  61. [61]

    2025 , eprint=

    Visual Prompting for One-shot Controllable Video Editing without Inversion , author=. 2025 , eprint=

  62. [62]

    Advances in Neural Information Processing Systems , volume=

    Video Diffusion Models are Training-free Motion Interpreter and Controller , author=. Advances in Neural Information Processing Systems , volume=

  63. [63]

    arXiv preprint arXiv:2312.02216 , year=

    DragVideo: Interactive Drag-style Video Editing , author=. arXiv preprint arXiv:2312.02216 , year=

  64. [64]

    ICML , year=

    Stair: Improving safety alignment with introspective reasoning , author=. ICML , year=

  65. [65]

    ICML , year=

    Weak-to-strong jailbreaking on large language models , author=. ICML , year=

  66. [66]

    arXiv preprint arXiv:2506.07891 , year=

    Video Unlearning via Low-Rank Refusal Vector , author=. arXiv preprint arXiv:2506.07891 , year=

  67. [67]

    2026 , eprint=

    Vid-Freeze: Protecting Images from Malicious Image-to-Video Generation via Temporal Freezing , author=. 2026 , eprint=

  68. [68]

    arXiv preprint arXiv:2402.13126 , year=

    VGMShield: Mitigating Misuse of Video Generative Models , author=. arXiv preprint arXiv:2402.13126 , year=