pith. sign in

arxiv: 2605.16090 · v1 · pith:CODUHPBNnew · submitted 2026-05-15 · 💻 cs.CR · cs.CV

A Cross-Modal Prompt Injection Attack against Large Vision-Language Models with Image-Only Perturbation

Pith reviewed 2026-05-20 17:16 UTC · model grok-4.3

classification 💻 cs.CR cs.CV
keywords cross-modal prompt injectionlarge vision-language modelsimage perturbationadversarial attackmultimodal securityhidden state optimizationprompt injection
0
0 comments X

The pith

Perturbing only an image can steer how large vision-language models interpret both pictures and text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CrossMPI, an attack that achieves cross-modal prompt injection by changing the image input alone. It shifts the optimization target from the visual embedding space to the model's much larger hidden state space, where multimodal information integrates. The method selects middle layers for the perturbation because they prove critical for this integration, and applies a distance-decremental budget to limit pixel changes away from semantic regions. A sympathetic reader would care because prior attacks either stay within one modality or fail to transfer effects across modalities, leaving a gap in understanding the full attack surface of deployed LVLMs.

Core claim

CrossMPI steers the model's interpretation of both textual and visual inputs via image-only prompt injection by moving optimization to the hidden state space, selecting middle layers as the optimal location for multimodal integration, and using a distance-decremental perturbation budget assignment to constrain the image space.

What carries the argument

Hidden-state optimization at middle layers with distance-decremental budget assignment, which moves the attack focus to the larger parameter space responsible for fusing image and text information.

If this is right

  • The attack succeeds across multiple LVLMs and datasets with higher rates than single-modality baselines.
  • Middle layers outperform last layers for this type of hidden-state perturbation in LVLMs.
  • The budget strategy constrains the optimization space while preserving attack performance.
  • Image-only changes can produce effects on textual interpretation that were previously thought to require text input changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Defenses could focus monitoring or regularization on middle-layer hidden states rather than input embeddings alone.
  • The layer preference suggests multimodal fusion occurs primarily in the middle of these transformer stacks.
  • Similar image-only cross-modal techniques might extend to other paired modalities such as audio-text models.
  • Security audits of LVLMs should test for transfer of perturbations from one modality to another.

Load-bearing premise

That middle layers are the best location for perturbation and that the distance-decremental budget will keep image changes small enough to remain effective while achieving cross-modal steering.

What would settle it

An experiment that applies the same image perturbation but targets the final layers instead of middle layers and measures whether cross-modal prompt injection success rate drops sharply.

Figures

Figures reproduced from arXiv: 2605.16090 by Guancheng Wang, Hao Yang, Jianfeng Ma, Yang Liu, Yilong Yang, Zhuo Ma.

Figure 1
Figure 1. Figure 1: Comparison of prompt injection paradigms in [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Layer-wise hidden state classification accuracy un [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of our attack. CrossMPI consists of three parts: (1) fusion-critical layer selection, which localizes the layers where visual evidence and textual intent are integrated; (2) perturbation budget assignment, which uses Grad-ECLIP saliency and distance-decremental weighting to construct a perturbation budget mask for budget reallocation; (3) Cross-Modal Perturbation Optimization, which jointly optimi… view at source ↗
Figure 4
Figure 4. Figure 4: Heatmap of CrossMPI across LVLMs. Perturbations are optimized on MiniGPT4-llama2, MiniGPT4-vicuna, Instruct￾BLIP, and Qwen2-VL, and then evaluated on target models. the weakest overall, achieving only 4.41% average ASR. Although it embeds malicious textual instructions into the image and uses prompts that encourage the model to read image text, this strat￾egy does not reliably trigger the desired cross-mod… view at source ↗
Figure 5
Figure 5. Figure 5: Ablation of perturbation budget and optimized [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Layer-wise hidden state classification accuracy un [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Layer-wise hidden state classification accuracy un [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Visual comparison of attacked images optimized by different attacks. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Visualization of perturbation budget masks under different values of [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
read the original abstract

Large vision-language models (LVLMs) have emerged as a powerful paradigm for multimodal intelligence, but their growing deployment also expands the attack surface of prompt injection. Despite this growing concern, existing attacks still suffer from a critical limitation: the injected prompt for one modality only steers the model's interpretation of that singular input. Alternatively, these attacks remain multimodal but fail to achieve cross-modal prompt perturbation. To bridge this gap, we introduce a novel cross-modal prompt injection attack CrossMPI, which can steer the model's interpretation of both textual and visual inputs via image-only prompt injection. Our design is underpinned by the following key breakthroughs. First, we turn the focus of the injected prompt perturbation optimization from the visual embedding space (typically with only $10^5$ parameters) to the model hidden state space (for multimodal information integration and with $10^7$ parameters). Then, two strategies are adopted to mitigate the optimization challenges posed by the larger parameter space. To constrain the optimized model parameter space, we introduce a layer selection strategy that identifies the layers most critical to multimodal integration. Interestingly, deviating from the past experience, our analysis reveals that the optimal layers for LVLM prompt perturbation reside in the middle of the model rather than the last. To constrain the image perturbation space, we propose a new distance-decremental perturbation budget assignment strategy that allocates budgets decrementally as the pixel distance to semantic-critical regions increases. Extensive experiments across multiple LVLMs and datasets show that our method significantly outperforms baseline approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CrossMPI, a cross-modal prompt injection attack on large vision-language models (LVLMs) that uses image-only perturbations to steer the model's interpretation of both visual and textual inputs. It shifts optimization focus from visual embedding space to hidden state space at selected middle layers (identified via analysis as optimal for multimodal integration, contrary to prior last-layer focus), and applies a distance-decremental perturbation budget to constrain the image space while targeting semantic regions. Extensive experiments across multiple LVLMs and datasets are reported to show significant outperformance over baselines.

Significance. If the empirical results hold with supporting ablations and metrics, the work would be significant for demonstrating a new attack surface in LVLMs where single-modality (image) perturbations can achieve cross-modal effects on text interpretation. The layer selection insight and budget strategy provide concrete techniques that could guide both attacks and defenses in multimodal security. Credit is due for the broad experimental scope across models and datasets, which strengthens the practical relevance of the findings.

major comments (2)
  1. [Analysis of optimal layers] Analysis section on layer selection: the claim that middle layers are optimal for multimodal integration and yield superior cross-modal steering is load-bearing for the novelty, yet the manuscript provides no ablation comparing middle-layer hidden-state targeting against last-layer targeting specifically on textual interpretation metrics (e.g., success rate in altering responses to the text prompt). Without this, it remains unclear whether the cross-modal effect exceeds what simpler last-layer attacks achieve.
  2. [Perturbation budget strategy] Section describing the distance-decremental budget assignment: the strategy is presented as constraining the perturbation space while preserving attack success, but no quantitative comparison to uniform budget allocation or ablation on textual vs. visual task performance is shown. This assumption is central to feasibility of the larger hidden-state optimization and requires explicit validation to support the cross-modal claim.
minor comments (2)
  1. [Abstract] The abstract summarizes outperformance but omits any numerical results, baseline names, or statistical details; including at least one key metric (e.g., attack success rate improvement) would improve clarity for readers.
  2. [Method] Notation for the hidden-state optimization objective and the exact definition of the distance-decremental schedule could be formalized with an equation to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate additional ablations that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [Analysis of optimal layers] Analysis section on layer selection: the claim that middle layers are optimal for multimodal integration and yield superior cross-modal steering is load-bearing for the novelty, yet the manuscript provides no ablation comparing middle-layer hidden-state targeting against last-layer targeting specifically on textual interpretation metrics (e.g., success rate in altering responses to the text prompt). Without this, it remains unclear whether the cross-modal effect exceeds what simpler last-layer attacks achieve.

    Authors: We acknowledge that our layer selection analysis focused on the impact of different layers on multimodal integration but did not include a direct ablation of attack success rates on textual interpretation metrics when comparing middle-layer versus last-layer hidden-state targeting. To strengthen the evidence for the cross-modal advantage, we will add this specific ablation in the revised manuscript, reporting success rates for altering text-prompt responses under both targeting strategies. revision: yes

  2. Referee: [Perturbation budget strategy] Section describing the distance-decremental budget assignment: the strategy is presented as constraining the perturbation space while preserving attack success, but no quantitative comparison to uniform budget allocation or ablation on textual vs. visual task performance is shown. This assumption is central to feasibility of the larger hidden-state optimization and requires explicit validation to support the cross-modal claim.

    Authors: We agree that explicit validation of the distance-decremental budget is needed. We will add quantitative comparisons of attack performance under the distance-decremental strategy versus uniform budget allocation, including separate metrics for visual and textual task success. These results will be included in the revised version to demonstrate how the strategy supports effective optimization in the hidden-state space. revision: yes

Circularity Check

0 steps flagged

Empirical layer selection and budget strategy are data-driven with no definitional reduction

full rationale

The paper's core contributions rest on empirical analysis to select middle layers for hidden-state perturbation and a distance-decremental budget for image constraints. These choices are presented as outcomes of their own investigation rather than inputs that are redefined as predictions. No equations, fitted parameters, or self-citations are shown to force the claimed cross-modal success by construction. Experiments on multiple LVLMs and datasets provide external validation, keeping the derivation self-contained against benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The method rests on empirical identification of middle layers and a heuristic budget allocation rule rather than closed-form derivation; no new physical entities are postulated.

free parameters (2)
  • Layer selection rule
    Identifies middle layers as critical for multimodal integration based on internal analysis.
  • Distance-decremental budget schedule
    Allocates perturbation magnitude according to pixel distance from semantic regions.
axioms (1)
  • domain assumption Middle layers are optimal for cross-modal prompt perturbation in LVLMs
    Stated as a finding that deviates from prior experience with last-layer perturbations.
invented entities (1)
  • CrossMPI attack procedure no independent evidence
    purpose: Achieve cross-modal steering via image-only hidden-state optimization
    Newly introduced method combining layer selection and distance-based budgeting.

pith-pipeline@v0.9.0 · 5818 in / 1268 out tokens · 101721 ms · 2026-05-20T17:16:14.959122+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · 6 internal anchors

  1. [1]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)

  2. [2]

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al

  3. [3]

    Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems35 (2022), 23716–23736

  4. [4]

    Hezekiah J Branch, Jonathan Rodriguez Cefalu, Jeremy McHugh, Leyla Hujer, Aditya Bahl, Daniel del Castillo Iglesias, Ron Heichman, and Ramesh Darwishi

  5. [5]

    Evaluating the susceptibility of pre-trained language models via handcrafted adversarial examples.arXiv preprint arXiv:2209.02128(2022)

  6. [6]

    Xiaowen Cai, Daizong Liu, Xiaoye Qu, Xiang Fang, Jianfeng Dong, Keke Tang, Pan Zhou, Lichao Sun, and Wei Hu. 2025. Towards Building Model/Prompt- Transferable Attackers against Large Vision-Language Models. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems

  7. [7]

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. 2024. Sharegpt4v: Improving large multi-modal models with better captions. InEuropean Conference on Computer Vision. Springer, 370– 387

  8. [8]

    Xiangxiang Chen, Peixin Zhang, Jun Sun, Wenhai Wang, and Jingyi Wang. 2025. Rounding-Guided Backdoor Injection in Deep Learning Model Quantization. arXiv preprint arXiv:2510.09647(2025)

  9. [9]

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. 2023. Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in neural information processing systems36 (2023), 49250–49267

  10. [10]

    Iuri Frosio and Jan Kautz. 2023. The best defense is a good offense: Adver- sarial augmentation against adversarial attacks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4067–4076

  11. [11]

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh

  12. [12]

    InProceedings of the IEEE conference on computer vision and pattern recognition

    Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition. 6904–6913

  13. [13]

    Chuan Guo, Mayank Rana, Moustapha Cisse, and Laurens van der Maaten. 2018. Countering Adversarial Images using Input Transformations. InInternational Conference on Learning Representations

  14. [14]

    Rich Harang. 2023. Securing LLM Systems Against Prompt Injection. https: //developer.nvidia.com/blog/securing-llm-systems-against-prompt-injection

  15. [15]

    Wenbo Hu, Yifan Xu, Yi Li, Weiyue Li, Zeyuan Chen, and Zhuowen Tu. 2024. Bliva: A simple multimodal llm for better handling of text-rich visual questions. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 2256–2264

  16. [16]

    Yuke Hu, Zheng Li, Zhihao Liu, Yang Zhang, Zhan Qin, Kui Ren, and Chun Chen

  17. [17]

    Membership inference attacks against vision-language models.arXiv preprint arXiv:2501.18624(2025)

  18. [18]

    Andrey Labunets, Nishit V Pandya, Ashish Hooda, Xiaohan Fu, and Earlence Fernandes. 2025. Fun-tuning: Characterizing the vulnerability of proprietary llms to optimization-based prompt injection attacks via the fine-tuning interface. In2025 IEEE Symposium on Security and Privacy (SP). IEEE, 411–429

  19. [19]

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning. PMLR, 19730–19742

  20. [20]

    Yanjie Li, Yiming Cao, Dong Wang, and Bin Xiao. 2025. AgentTypo: Adaptive Typographic Prompt Injection Attacks against Black-box Multimodal Agents. arXiv preprint arXiv:2510.04257(2025)

  21. [21]

    Zhan Li, Yongtao Wu, Yihang Chen, Francesco Tonin, Elias Abad Rocamora, and Volkan Cevher. 2024. Membership inference attacks against large vision- language models.Advances in Neural Information Processing Systems37 (2024), 98645–98674

  22. [22]

    Zhichao Li, Hongshan Yang, Zhibo Wang, Huiyu Xu, Junhong Lai, Yaopeng Wang, Kui Ren, and Chun Chen. [n. d.]. On Evaluating the Robustness of Large Vision- Language Models via Untargeted Modality Alignment Breaking Adversarial Attack. ([n. d.])

  23. [23]

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. InEuropean conference on computer vision. Springer, 740–755

  24. [24]

    Daizong Liu, Mingyu Yang, Xiaoye Qu, Pan Zhou, Xiang Fang, Keke Tang, Yao Wan, and Lichao Sun. 2024. Pandora’s box: Towards building universal attackers against real-world large vision-language models.Advances in Neural Information Processing Systems37 (2024), 52127–52158

  25. [25]

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual in- struction tuning.Advances in neural information processing systems36 (2023), 34892–34916

  26. [26]

    Xiaogeng Liu, Zhiyuan Yu, Yizhe Zhang, Ning Zhang, and Chaowei Xiao. 2024. Automatic and universal prompt injection attacks against large language models. arXiv preprint arXiv:2403.04957(2024)

  27. [27]

    Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. 2024. Formalizing and benchmarking prompt injection attacks and defenses. In33rd USENIX Security Symposium (USENIX Security 24). 1831–1847

  28. [28]

    Zhaoyi Liu and Huan Zhang. 2025. Stealthy Backdoor Attack in Self-Supervised Learning Vision Encoders for Large Vision Language Models. InProceedings of the Computer Vision and Pattern Recognition Conference. 25060–25070

  29. [29]

    Haochen Luo, Jindong Gu, Fengyuan Liu, and Philip Torr. 2024. An image is worth 1000 lies: Adversarial transferability across prompts on vision-language models.arXiv preprint arXiv:2403.09766(2024)

  30. [30]

    Peizhuo Lv, Chang Yue, Ruigang Liang, Yunfei Yang, Shengzhi Zhang, Hualong Ma, and Kai Chen. 2023. A data-free backdoor injection approach in neural networks. In32nd USENIX Security Symposium (USENIX Security 23). 2671–2688

  31. [31]

    Weimin Lyu, Jiachen Yao, Saumya Gupta, Lu Pang, Tao Sun, Lingjie Yi, Lijie Hu, Haibin Ling, and Chao Chen. [n. d.]. Backdooring Vision-Language Models with Out-Of-Distribution Data. InThe Thirteenth International Conference on Learning Representations

  32. [32]

    Clement Neo, Luke Ong, Philip Torr, Mor Geva, David Krueger, and Fazl Barez

  33. [33]

    InThe Thirteenth International Conference on Learning Representations

    Towards Interpreting Visual Information Processing in Vision-Language Models. InThe Thirteenth International Conference on Learning Representations

  34. [34]

    Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. 2022. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. InInternational Conference on Machine Learning. PMLR, 16784–16804

  35. [35]

    31st Aug 2020

    Nostalgebraist. 31st Aug 2020. Interpreting GPT: The logit lens. https://www.alignmentforum.org/posts/AcKRB8wDpdaN6v6ru/interpreting- gpt-the-logit-lens

  36. [36]

    OWASP. 2023. OWASP Top 10 for Large Language Model Applica- tions. https://owasp.org/www-project-top-10-for-large-language-model- applications/assets/PDF/OWASP-Top-10-for-LLMs-2023-v1_1.pdf

  37. [37]

    Dario Pasquini, Martin Strohmeier, and Carmela Troncoso. 2024. Neural exec: Learning (and learning from) execution triggers for prompt injection attacks. In Proceedings of the 2024 Workshop on Artificial Intelligence and Security. 89–100. Conference’17, July 2017, Washington, DC, USA Hao Yang, Zhuo Ma, Yang Liu, Yilong Yang, Guancheng Wang, and JianFeng Ma

  38. [38]

    Fábio Perez and Ian Ribeiro. 2022. Ignore previous prompt: Attack techniques for language models.arXiv preprint arXiv:2211.09527(2022)

  39. [39]

    Julien Piet, Maha Alrashed, Chawin Sitawarin, Sizhe Chen, Zeming Wei, Elizabeth Sun, Basel Alomair, and David Wagner. 2024. Jatmo: Prompt injection defense by task-specific finetuning. InEuropean Symposium on Research in Computer Security. Springer, 105–124

  40. [40]

    Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Peter Henderson, Mengdi Wang, and Prateek Mittal. 2024. Visual adversarial examples jailbreak aligned large language models. InProceedings of the AAAI conference on artificial intelligence, Vol. 38. 21527–21536

  41. [41]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

  42. [42]

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen

  43. [43]

    Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.061251, 2 (2022), 3

  44. [44]

    Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). 3982–3992

  45. [45]

    Rafael Reisenhofer, Sebastian Bosse, Gitta Kutyniok, and Thomas Wiegand. 2018. A Haar wavelet-based perceptual similarity index for image quality assessment. Signal Processing: Image Communication61 (2018), 33–43

  46. [46]

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695

  47. [47]

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al

  48. [48]

    Imagenet large scale visual recognition challenge.International journal of computer vision115, 3 (2015), 211–252

  49. [49]

    do anything now

    Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. 2024. "do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. 1671–1685

  50. [50]

    Jiawen Shi, Zenghui Yuan, Yinuo Liu, Yue Huang, Pan Zhou, Lichao Sun, and Neil Zhenqiang Gong. 2024. Optimization-based prompt injection attack to llm- as-a-judge. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. 660–674

  51. [51]

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8317–8326

  52. [52]

    Dawn Song, Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Florian Tramer, Atul Prakash, and Tadayoshi Kohno. 2018. Physical adversarial examples for object detectors. In12th USENIX workshop on offensive technologies (WOOT 18)

  53. [53]

    Jiachen Sun, Changsheng Wang, Jiongxiao Wang, Yiwei Zhang, and Chaowei Xiao. 2024. Safeguarding vision-language models against patched visual prompt injectors.arXiv preprint arXiv:2405.10529(2024)

  54. [54]

    Qwen Team. 2025. Qwen2.5-VL. https://qwenlm.github.io/blog/qwen2.5-vl/

  55. [55]

    Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke Bailey, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell, et al. [n. d.]. Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game. InThe Twelfth International Conference on Learning Representations

  56. [56]

    Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Eslami, Oriol Vinyals, and Felix Hill. 2021. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems34 (2021), 200–212

  57. [57]

    Haodi Wang, Kai Dong, Zhilei Zhu, Haotong Qin, Aishan Liu, Xiaolin Fang, Jiakai Wang, and Xianglong Liu. 2024. Transferable multimodal attack on vision- language pre-training models. In2024 IEEE Symposium on Security and Privacy (SP). IEEE, 1722–1740

  58. [58]

    Le Wang, Zonghao Ying, Tianyuan Zhang, Siyuan Liang, Shengshan Hu, Mingchuan Zhang, Aishan Liu, and Xianglong Liu. 2025. Manipulating multi- modal agents via cross-modal prompt injection. InProceedings of the 33rd ACM International Conference on Multimedia. 10955–10964

  59. [59]

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. 2024. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution.arXiv preprint arXiv:2409.12191(2024)

  60. [60]

    Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Song XiXuan, et al. 2024. Cogvlm: Visual expert for pretrained language models.Advances in Neural Information Processing Systems 37 (2024), 121475–121499

  61. [61]

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing13, 4 (2004), 600–612

  62. [62]

    Zhibo Wang, Hengchang Guo, Zhifei Zhang, Wenxin Liu, Zhan Qin, and Kui Ren

  63. [63]

    InProceedings of the IEEE/CVF international conference on computer vision

    Feature importance-aware transferable adversarial attacks. InProceedings of the IEEE/CVF international conference on computer vision. 7639–7648

  64. [64]

    Simon Willison. 2022. Prompt injection attacks against GPT-3. https:// simonwillison.net/2022/Sep/12/prompt-injection/

  65. [65]

    Simon Willison. 2023. Delimiters won’t save you from prompt injection. https: //simonwillison.net/2023/May/11/delimiters-wont-save-you/

  66. [66]

    Chen Henry Wu, Rishi Rajesh Shah, Jing Yu Koh, Russ Salakhutdinov, Daniel Fried, and Aditi Raghunathan. [n. d.]. Dissecting Adversarial Robustness of Multimodal LM Agents. InThe Thirteenth International Conference on Learning Representations

  67. [67]

    Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. 2024. Next-gpt: Any-to-any multimodal llm. InForty-first International Conference on Machine Learning

  68. [68]

    Cihang Xie, Jianyu Wang, Zhishuai Zhang, Zhou Ren, and Alan Yuille. 2018. Mit- igating Adversarial Effects Through Randomization. InInternational Conference on Learning Representations

  69. [69]

    Chejian Xu, Mintong Kang, Jiawei Zhang, Zeyi Liao, Lingbo Mo, Mengqi Yuan, Huan Sun, and Bo Li. 2024. Advweb: Controllable black-box attacks on vlm- powered web agents. (2024)

  70. [70]

    Hai Yan, Haijian Ma, Xiaowen Cai, Daizong Liu, Zenghui Yuan, Xiaoye Qu, Jianfeng Dong, Runwei Guan, Xiang Fang, Hongyang He, et al . 2025. Fit the Distribution: Cross-Image/Prompt Adversarial Attacks on Multimodal Large Language Models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  71. [71]

    Dingchen Yang, Bowen Cao, Anran Zhang, Weibo Gu, Winston Hu, and Guang Chen. 2025. Beyond intermediate states: Explaining visual redundancy through language.arXiv preprint arXiv:2503.20540(2025)

  72. [72]

    Zuopeng Yang, Jiluan Fan, Anli Yan, Erdun Gao, Xin Lin, Tao Li, Kanghua Mo, and Changyu Dong. 2025. Distraction is all you need for multimodal large language model jailbreaking. InProceedings of the Computer Vision and Pattern Recognition Conference. 9467–9476

  73. [73]

    Zhuoran Yu and Yong Jae Lee. 2025. How Multimodal LLMs Solve Image Tasks: A Lens on Visual Grounding, Task Reasoning, and Answer Decoding. InSecond Conference on Language Modeling

  74. [74]

    Zhiyuan Yu, Xiaogeng Liu, Shunning Liang, Zach Cameron, Chaowei Xiao, and Ning Zhang. 2024. Don’t listen to me: Understanding and exploring jailbreak prompts of large language models. In33rd USENIX Security Symposium (USENIX Security 24). 4675–4692

  75. [75]

    Andrew Yuan, Alina Oprea, and Cheng Tan. 2024. Dropout attacks. In2024 IEEE Symposium on Security and Privacy (SP). IEEE, 1255–1269

  76. [76]

    Duzhen Zhang, Yahan Yu, Jiahua Dong, Chenxing Li, Dan Su, Chenhui Chu, and Dong Yu. 2024. Mm-llms: Recent advances in multimodal large language models. arXiv preprint arXiv:2401.13601(2024)

  77. [77]

    Lin Zhang, Lei Zhang, Xuanqin Mou, and David Zhang. 2011. FSIM: A feature sim- ilarity index for image quality assessment.IEEE transactions on Image Processing 20, 8 (2011), 2378–2386

  78. [78]

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang

  79. [79]

    InProceedings of the IEEE conference on computer vision and pattern recognition

    The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition. 586–595

  80. [80]

    Xinwei Zhang, Li Bai, Tianwei Zhang, Youqian Zhang, Qingqing Ye, Yingnan Zhao, Ruochen Du, and Haibo Hu. 2026. Understanding and Enhancing Encoder- based Adversarial Transferability against Large Vision-Language Models.arXiv preprint arXiv:2602.09431(2026)

Showing first 80 references.