A Cross-Modal Prompt Injection Attack against Large Vision-Language Models with Image-Only Perturbation

Guancheng Wang; Hao Yang; Jianfeng Ma; Yang Liu; Yilong Yang; Zhuo Ma

arxiv: 2605.16090 · v1 · pith:CODUHPBNnew · submitted 2026-05-15 · 💻 cs.CR · cs.CV

A Cross-Modal Prompt Injection Attack against Large Vision-Language Models with Image-Only Perturbation

Hao Yang , Zhuo Ma , Yang Liu , Yilong Yang , Guancheng Wang , JianFeng Ma This is my paper

Pith reviewed 2026-05-20 17:16 UTC · model grok-4.3

classification 💻 cs.CR cs.CV

keywords cross-modal prompt injectionlarge vision-language modelsimage perturbationadversarial attackmultimodal securityhidden state optimizationprompt injection

0 comments

The pith

Perturbing only an image can steer how large vision-language models interpret both pictures and text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CrossMPI, an attack that achieves cross-modal prompt injection by changing the image input alone. It shifts the optimization target from the visual embedding space to the model's much larger hidden state space, where multimodal information integrates. The method selects middle layers for the perturbation because they prove critical for this integration, and applies a distance-decremental budget to limit pixel changes away from semantic regions. A sympathetic reader would care because prior attacks either stay within one modality or fail to transfer effects across modalities, leaving a gap in understanding the full attack surface of deployed LVLMs.

Core claim

CrossMPI steers the model's interpretation of both textual and visual inputs via image-only prompt injection by moving optimization to the hidden state space, selecting middle layers as the optimal location for multimodal integration, and using a distance-decremental perturbation budget assignment to constrain the image space.

What carries the argument

Hidden-state optimization at middle layers with distance-decremental budget assignment, which moves the attack focus to the larger parameter space responsible for fusing image and text information.

If this is right

The attack succeeds across multiple LVLMs and datasets with higher rates than single-modality baselines.
Middle layers outperform last layers for this type of hidden-state perturbation in LVLMs.
The budget strategy constrains the optimization space while preserving attack performance.
Image-only changes can produce effects on textual interpretation that were previously thought to require text input changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Defenses could focus monitoring or regularization on middle-layer hidden states rather than input embeddings alone.
The layer preference suggests multimodal fusion occurs primarily in the middle of these transformer stacks.
Similar image-only cross-modal techniques might extend to other paired modalities such as audio-text models.
Security audits of LVLMs should test for transfer of perturbations from one modality to another.

Load-bearing premise

That middle layers are the best location for perturbation and that the distance-decremental budget will keep image changes small enough to remain effective while achieving cross-modal steering.

What would settle it

An experiment that applies the same image perturbation but targets the final layers instead of middle layers and measures whether cross-modal prompt injection success rate drops sharply.

Figures

Figures reproduced from arXiv: 2605.16090 by Guancheng Wang, Hao Yang, Jianfeng Ma, Yang Liu, Yilong Yang, Zhuo Ma.

**Figure 2.** Figure 2: Layer-wise hidden state classification accuracy un [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of our attack. CrossMPI consists of three parts: (1) fusion-critical layer selection, which localizes the layers where visual evidence and textual intent are integrated; (2) perturbation budget assignment, which uses Grad-ECLIP saliency and distance-decremental weighting to construct a perturbation budget mask for budget reallocation; (3) Cross-Modal Perturbation Optimization, which jointly optimi… view at source ↗

**Figure 4.** Figure 4: Heatmap of CrossMPI across LVLMs. Perturbations are optimized on MiniGPT4-llama2, MiniGPT4-vicuna, InstructBLIP, and Qwen2-VL, and then evaluated on target models. the weakest overall, achieving only 4.41% average ASR. Although it embeds malicious textual instructions into the image and uses prompts that encourage the model to read image text, this strategy does not reliably trigger the desired cross-mod… view at source ↗

**Figure 5.** Figure 5: Ablation of perturbation budget and optimized [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 7.** Figure 7: Layer-wise hidden state classification accuracy un [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Layer-wise hidden state classification accuracy un [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Visual comparison of attacked images optimized by different attacks. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Visualization of perturbation budget masks under different values of [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

read the original abstract

Large vision-language models (LVLMs) have emerged as a powerful paradigm for multimodal intelligence, but their growing deployment also expands the attack surface of prompt injection. Despite this growing concern, existing attacks still suffer from a critical limitation: the injected prompt for one modality only steers the model's interpretation of that singular input. Alternatively, these attacks remain multimodal but fail to achieve cross-modal prompt perturbation. To bridge this gap, we introduce a novel cross-modal prompt injection attack CrossMPI, which can steer the model's interpretation of both textual and visual inputs via image-only prompt injection. Our design is underpinned by the following key breakthroughs. First, we turn the focus of the injected prompt perturbation optimization from the visual embedding space (typically with only $10^5$ parameters) to the model hidden state space (for multimodal information integration and with $10^7$ parameters). Then, two strategies are adopted to mitigate the optimization challenges posed by the larger parameter space. To constrain the optimized model parameter space, we introduce a layer selection strategy that identifies the layers most critical to multimodal integration. Interestingly, deviating from the past experience, our analysis reveals that the optimal layers for LVLM prompt perturbation reside in the middle of the model rather than the last. To constrain the image perturbation space, we propose a new distance-decremental perturbation budget assignment strategy that allocates budgets decrementally as the pixel distance to semantic-critical regions increases. Extensive experiments across multiple LVLMs and datasets show that our method significantly outperforms baseline approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shifts prompt-injection optimization to middle-layer hidden states with a distance-decremental budget to get cross-modal steering from image pixels alone, but the abstract gives no numbers or ablations to show this beats simpler last-layer baselines on textual metrics.

read the letter

The main point is that they move the perturbation target from visual embeddings to hidden states at middle layers and add a layer-selection step plus a distance-decremental budget so that image changes can influence how the model reads the accompanying text. That combination is not in the prior single-modality work they cite, and the claim that middle layers are better for multimodal integration is a concrete departure from the usual last-layer focus in adversarial examples for transformers. If the full experiments show measurable changes in text-token behavior that cannot be matched by uniform budgets or final-layer attacks, the idea is worth following up. The paper does a reasonable job laying out why the larger hidden-state space needs these two constraints and why the budget should shrink away from semantic regions. Those design choices are explained plainly and tied to the multimodal integration problem. The soft spots are exactly where the stress-test note flags them. The abstract reports outperformance across LVLMs and datasets but supplies no accuracy deltas, no statistical tests, no ablation on layer choice, and no direct comparison of textual interpretation metrics between middle-layer and last-layer targeting. Without those numbers it is impossible to tell whether the cross-modal effect is real or whether the middle-layer choice mainly helps visual tasks while text steering stays comparable to baselines. The distance-decremental schedule is presented as a solution to the larger search space, yet there is no evidence shown that it is necessary or that a simpler uniform budget fails. This is an empirical optimization paper, so the central claim lives or dies on the unreported details. The work is aimed at people who already follow prompt-injection and multimodal security papers. A reader who wants concrete attack recipes or reproducible code would get limited value until the numbers appear. It is coherent enough on its own terms to deserve a serious referee who can ask for the missing ablations and metric tables rather than a desk reject.

Referee Report

2 major / 2 minor

Summary. The paper introduces CrossMPI, a cross-modal prompt injection attack on large vision-language models (LVLMs) that uses image-only perturbations to steer the model's interpretation of both visual and textual inputs. It shifts optimization focus from visual embedding space to hidden state space at selected middle layers (identified via analysis as optimal for multimodal integration, contrary to prior last-layer focus), and applies a distance-decremental perturbation budget to constrain the image space while targeting semantic regions. Extensive experiments across multiple LVLMs and datasets are reported to show significant outperformance over baselines.

Significance. If the empirical results hold with supporting ablations and metrics, the work would be significant for demonstrating a new attack surface in LVLMs where single-modality (image) perturbations can achieve cross-modal effects on text interpretation. The layer selection insight and budget strategy provide concrete techniques that could guide both attacks and defenses in multimodal security. Credit is due for the broad experimental scope across models and datasets, which strengthens the practical relevance of the findings.

major comments (2)

[Analysis of optimal layers] Analysis section on layer selection: the claim that middle layers are optimal for multimodal integration and yield superior cross-modal steering is load-bearing for the novelty, yet the manuscript provides no ablation comparing middle-layer hidden-state targeting against last-layer targeting specifically on textual interpretation metrics (e.g., success rate in altering responses to the text prompt). Without this, it remains unclear whether the cross-modal effect exceeds what simpler last-layer attacks achieve.
[Perturbation budget strategy] Section describing the distance-decremental budget assignment: the strategy is presented as constraining the perturbation space while preserving attack success, but no quantitative comparison to uniform budget allocation or ablation on textual vs. visual task performance is shown. This assumption is central to feasibility of the larger hidden-state optimization and requires explicit validation to support the cross-modal claim.

minor comments (2)

[Abstract] The abstract summarizes outperformance but omits any numerical results, baseline names, or statistical details; including at least one key metric (e.g., attack success rate improvement) would improve clarity for readers.
[Method] Notation for the hidden-state optimization objective and the exact definition of the distance-decremental schedule could be formalized with an equation to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate additional ablations that directly respond to the concerns raised.

read point-by-point responses

Referee: [Analysis of optimal layers] Analysis section on layer selection: the claim that middle layers are optimal for multimodal integration and yield superior cross-modal steering is load-bearing for the novelty, yet the manuscript provides no ablation comparing middle-layer hidden-state targeting against last-layer targeting specifically on textual interpretation metrics (e.g., success rate in altering responses to the text prompt). Without this, it remains unclear whether the cross-modal effect exceeds what simpler last-layer attacks achieve.

Authors: We acknowledge that our layer selection analysis focused on the impact of different layers on multimodal integration but did not include a direct ablation of attack success rates on textual interpretation metrics when comparing middle-layer versus last-layer hidden-state targeting. To strengthen the evidence for the cross-modal advantage, we will add this specific ablation in the revised manuscript, reporting success rates for altering text-prompt responses under both targeting strategies. revision: yes
Referee: [Perturbation budget strategy] Section describing the distance-decremental budget assignment: the strategy is presented as constraining the perturbation space while preserving attack success, but no quantitative comparison to uniform budget allocation or ablation on textual vs. visual task performance is shown. This assumption is central to feasibility of the larger hidden-state optimization and requires explicit validation to support the cross-modal claim.

Authors: We agree that explicit validation of the distance-decremental budget is needed. We will add quantitative comparisons of attack performance under the distance-decremental strategy versus uniform budget allocation, including separate metrics for visual and textual task success. These results will be included in the revised version to demonstrate how the strategy supports effective optimization in the hidden-state space. revision: yes

Circularity Check

0 steps flagged

Empirical layer selection and budget strategy are data-driven with no definitional reduction

full rationale

The paper's core contributions rest on empirical analysis to select middle layers for hidden-state perturbation and a distance-decremental budget for image constraints. These choices are presented as outcomes of their own investigation rather than inputs that are redefined as predictions. No equations, fitted parameters, or self-citations are shown to force the claimed cross-modal success by construction. Experiments on multiple LVLMs and datasets provide external validation, keeping the derivation self-contained against benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The method rests on empirical identification of middle layers and a heuristic budget allocation rule rather than closed-form derivation; no new physical entities are postulated.

free parameters (2)

Layer selection rule
Identifies middle layers as critical for multimodal integration based on internal analysis.
Distance-decremental budget schedule
Allocates perturbation magnitude according to pixel distance from semantic regions.

axioms (1)

domain assumption Middle layers are optimal for cross-modal prompt perturbation in LVLMs
Stated as a finding that deviates from prior experience with last-layer perturbations.

invented entities (1)

CrossMPI attack procedure no independent evidence
purpose: Achieve cross-modal steering via image-only hidden-state optimization
Newly introduced method combining layer selection and distance-based budgeting.

pith-pipeline@v0.9.0 · 5818 in / 1268 out tokens · 101721 ms · 2026-05-20T17:16:14.959122+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we turn the focus of the injected prompt perturbation optimization from the visual embedding space ... to the model hidden state space ... fusion-critical layer selection strategy ... distance-decremental perturbation budget assignment strategy
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

middle layers are where textual intent and visual evidence are actively fused into task-level representations

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · 6 internal anchors

[1]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al

work page
[3]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems35 (2022), 23716–23736

work page 2022
[4]

Hezekiah J Branch, Jonathan Rodriguez Cefalu, Jeremy McHugh, Leyla Hujer, Aditya Bahl, Daniel del Castillo Iglesias, Ron Heichman, and Ramesh Darwishi

work page
[5]

Evaluating the susceptibility of pre-trained language models via handcrafted adversarial examples.arXiv preprint arXiv:2209.02128(2022)

work page arXiv 2022
[6]

Xiaowen Cai, Daizong Liu, Xiaoye Qu, Xiang Fang, Jianfeng Dong, Keke Tang, Pan Zhou, Lichao Sun, and Wei Hu. 2025. Towards Building Model/Prompt- Transferable Attackers against Large Vision-Language Models. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems

work page 2025
[7]

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. 2024. Sharegpt4v: Improving large multi-modal models with better captions. InEuropean Conference on Computer Vision. Springer, 370– 387

work page 2024
[8]

Xiangxiang Chen, Peixin Zhang, Jun Sun, Wenhai Wang, and Jingyi Wang. 2025. Rounding-Guided Backdoor Injection in Deep Learning Model Quantization. arXiv preprint arXiv:2510.09647(2025)

work page arXiv 2025
[9]

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. 2023. Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in neural information processing systems36 (2023), 49250–49267

work page 2023
[10]

Iuri Frosio and Jan Kautz. 2023. The best defense is a good offense: Adver- sarial augmentation against adversarial attacks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4067–4076

work page 2023
[11]

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh

work page
[12]

InProceedings of the IEEE conference on computer vision and pattern recognition

Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition. 6904–6913

work page
[13]

Chuan Guo, Mayank Rana, Moustapha Cisse, and Laurens van der Maaten. 2018. Countering Adversarial Images using Input Transformations. InInternational Conference on Learning Representations

work page 2018
[14]

Rich Harang. 2023. Securing LLM Systems Against Prompt Injection. https: //developer.nvidia.com/blog/securing-llm-systems-against-prompt-injection

work page 2023
[15]

Wenbo Hu, Yifan Xu, Yi Li, Weiyue Li, Zeyuan Chen, and Zhuowen Tu. 2024. Bliva: A simple multimodal llm for better handling of text-rich visual questions. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 2256–2264

work page 2024
[16]

Yuke Hu, Zheng Li, Zhihao Liu, Yang Zhang, Zhan Qin, Kui Ren, and Chun Chen

work page
[17]

Membership inference attacks against vision-language models.arXiv preprint arXiv:2501.18624(2025)

work page arXiv 2025
[18]

Andrey Labunets, Nishit V Pandya, Ashish Hooda, Xiaohan Fu, and Earlence Fernandes. 2025. Fun-tuning: Characterizing the vulnerability of proprietary llms to optimization-based prompt injection attacks via the fine-tuning interface. In2025 IEEE Symposium on Security and Privacy (SP). IEEE, 411–429

work page 2025
[19]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning. PMLR, 19730–19742

work page 2023
[20]

Yanjie Li, Yiming Cao, Dong Wang, and Bin Xiao. 2025. AgentTypo: Adaptive Typographic Prompt Injection Attacks against Black-box Multimodal Agents. arXiv preprint arXiv:2510.04257(2025)

work page arXiv 2025
[21]

Zhan Li, Yongtao Wu, Yihang Chen, Francesco Tonin, Elias Abad Rocamora, and Volkan Cevher. 2024. Membership inference attacks against large vision- language models.Advances in Neural Information Processing Systems37 (2024), 98645–98674

work page 2024
[22]

Zhichao Li, Hongshan Yang, Zhibo Wang, Huiyu Xu, Junhong Lai, Yaopeng Wang, Kui Ren, and Chun Chen. [n. d.]. On Evaluating the Robustness of Large Vision- Language Models via Untargeted Modality Alignment Breaking Adversarial Attack. ([n. d.])

work page
[23]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. InEuropean conference on computer vision. Springer, 740–755

work page 2014
[24]

Daizong Liu, Mingyu Yang, Xiaoye Qu, Pan Zhou, Xiang Fang, Keke Tang, Yao Wan, and Lichao Sun. 2024. Pandora’s box: Towards building universal attackers against real-world large vision-language models.Advances in Neural Information Processing Systems37 (2024), 52127–52158

work page 2024
[25]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual in- struction tuning.Advances in neural information processing systems36 (2023), 34892–34916

work page 2023
[26]

Xiaogeng Liu, Zhiyuan Yu, Yizhe Zhang, Ning Zhang, and Chaowei Xiao. 2024. Automatic and universal prompt injection attacks against large language models. arXiv preprint arXiv:2403.04957(2024)

work page arXiv 2024
[27]

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. 2024. Formalizing and benchmarking prompt injection attacks and defenses. In33rd USENIX Security Symposium (USENIX Security 24). 1831–1847

work page 2024
[28]

Zhaoyi Liu and Huan Zhang. 2025. Stealthy Backdoor Attack in Self-Supervised Learning Vision Encoders for Large Vision Language Models. InProceedings of the Computer Vision and Pattern Recognition Conference. 25060–25070

work page 2025
[29]

Haochen Luo, Jindong Gu, Fengyuan Liu, and Philip Torr. 2024. An image is worth 1000 lies: Adversarial transferability across prompts on vision-language models.arXiv preprint arXiv:2403.09766(2024)

work page arXiv 2024
[30]

Peizhuo Lv, Chang Yue, Ruigang Liang, Yunfei Yang, Shengzhi Zhang, Hualong Ma, and Kai Chen. 2023. A data-free backdoor injection approach in neural networks. In32nd USENIX Security Symposium (USENIX Security 23). 2671–2688

work page 2023
[31]

Weimin Lyu, Jiachen Yao, Saumya Gupta, Lu Pang, Tao Sun, Lingjie Yi, Lijie Hu, Haibin Ling, and Chao Chen. [n. d.]. Backdooring Vision-Language Models with Out-Of-Distribution Data. InThe Thirteenth International Conference on Learning Representations

work page
[32]

Clement Neo, Luke Ong, Philip Torr, Mor Geva, David Krueger, and Fazl Barez

work page
[33]

InThe Thirteenth International Conference on Learning Representations

Towards Interpreting Visual Information Processing in Vision-Language Models. InThe Thirteenth International Conference on Learning Representations

work page
[34]

Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. 2022. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. InInternational Conference on Machine Learning. PMLR, 16784–16804

work page 2022
[35]

31st Aug 2020

Nostalgebraist. 31st Aug 2020. Interpreting GPT: The logit lens. https://www.alignmentforum.org/posts/AcKRB8wDpdaN6v6ru/interpreting- gpt-the-logit-lens

work page 2020
[36]

OWASP. 2023. OWASP Top 10 for Large Language Model Applica- tions. https://owasp.org/www-project-top-10-for-large-language-model- applications/assets/PDF/OWASP-Top-10-for-LLMs-2023-v1_1.pdf

work page 2023
[37]

Dario Pasquini, Martin Strohmeier, and Carmela Troncoso. 2024. Neural exec: Learning (and learning from) execution triggers for prompt injection attacks. In Proceedings of the 2024 Workshop on Artificial Intelligence and Security. 89–100. Conference’17, July 2017, Washington, DC, USA Hao Yang, Zhuo Ma, Yang Liu, Yilong Yang, Guancheng Wang, and JianFeng Ma

work page 2024
[38]

Fábio Perez and Ian Ribeiro. 2022. Ignore previous prompt: Attack techniques for language models.arXiv preprint arXiv:2211.09527(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[39]

Julien Piet, Maha Alrashed, Chawin Sitawarin, Sizhe Chen, Zeming Wei, Elizabeth Sun, Basel Alomair, and David Wagner. 2024. Jatmo: Prompt injection defense by task-specific finetuning. InEuropean Symposium on Research in Computer Security. Springer, 105–124

work page 2024
[40]

Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Peter Henderson, Mengdi Wang, and Prateek Mittal. 2024. Visual adversarial examples jailbreak aligned large language models. InProceedings of the AAAI conference on artificial intelligence, Vol. 38. 21527–21536

work page 2024
[41]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

work page 2021
[42]

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen

work page
[43]

Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.061251, 2 (2022), 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[44]

Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). 3982–3992

work page 2019
[45]

Rafael Reisenhofer, Sebastian Bosse, Gitta Kutyniok, and Thomas Wiegand. 2018. A Haar wavelet-based perceptual similarity index for image quality assessment. Signal Processing: Image Communication61 (2018), 33–43

work page 2018
[46]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695

work page 2022
[47]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al

work page
[48]

Imagenet large scale visual recognition challenge.International journal of computer vision115, 3 (2015), 211–252

work page 2015
[49]

do anything now

Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. 2024. "do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. 1671–1685

work page 2024
[50]

Jiawen Shi, Zenghui Yuan, Yinuo Liu, Yue Huang, Pan Zhou, Lichao Sun, and Neil Zhenqiang Gong. 2024. Optimization-based prompt injection attack to llm- as-a-judge. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. 660–674

work page 2024
[51]

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8317–8326

work page 2019
[52]

Dawn Song, Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Florian Tramer, Atul Prakash, and Tadayoshi Kohno. 2018. Physical adversarial examples for object detectors. In12th USENIX workshop on offensive technologies (WOOT 18)

work page 2018
[53]

Jiachen Sun, Changsheng Wang, Jiongxiao Wang, Yiwei Zhang, and Chaowei Xiao. 2024. Safeguarding vision-language models against patched visual prompt injectors.arXiv preprint arXiv:2405.10529(2024)

work page arXiv 2024
[54]

Qwen Team. 2025. Qwen2.5-VL. https://qwenlm.github.io/blog/qwen2.5-vl/

work page 2025
[55]

Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke Bailey, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell, et al. [n. d.]. Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game. InThe Twelfth International Conference on Learning Representations

work page
[56]

Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Eslami, Oriol Vinyals, and Felix Hill. 2021. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems34 (2021), 200–212

work page 2021
[57]

Haodi Wang, Kai Dong, Zhilei Zhu, Haotong Qin, Aishan Liu, Xiaolin Fang, Jiakai Wang, and Xianglong Liu. 2024. Transferable multimodal attack on vision- language pre-training models. In2024 IEEE Symposium on Security and Privacy (SP). IEEE, 1722–1740

work page 2024
[58]

Le Wang, Zonghao Ying, Tianyuan Zhang, Siyuan Liang, Shengshan Hu, Mingchuan Zhang, Aishan Liu, and Xianglong Liu. 2025. Manipulating multi- modal agents via cross-modal prompt injection. InProceedings of the 33rd ACM International Conference on Multimedia. 10955–10964

work page 2025
[59]

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. 2024. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution.arXiv preprint arXiv:2409.12191(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[60]

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Song XiXuan, et al. 2024. Cogvlm: Visual expert for pretrained language models.Advances in Neural Information Processing Systems 37 (2024), 121475–121499

work page 2024
[61]

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing13, 4 (2004), 600–612

work page 2004
[62]

Zhibo Wang, Hengchang Guo, Zhifei Zhang, Wenxin Liu, Zhan Qin, and Kui Ren

work page
[63]

InProceedings of the IEEE/CVF international conference on computer vision

Feature importance-aware transferable adversarial attacks. InProceedings of the IEEE/CVF international conference on computer vision. 7639–7648

work page
[64]

Simon Willison. 2022. Prompt injection attacks against GPT-3. https:// simonwillison.net/2022/Sep/12/prompt-injection/

work page 2022
[65]

Simon Willison. 2023. Delimiters won’t save you from prompt injection. https: //simonwillison.net/2023/May/11/delimiters-wont-save-you/

work page 2023
[66]

Chen Henry Wu, Rishi Rajesh Shah, Jing Yu Koh, Russ Salakhutdinov, Daniel Fried, and Aditi Raghunathan. [n. d.]. Dissecting Adversarial Robustness of Multimodal LM Agents. InThe Thirteenth International Conference on Learning Representations

work page
[67]

Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. 2024. Next-gpt: Any-to-any multimodal llm. InForty-first International Conference on Machine Learning

work page 2024
[68]

Cihang Xie, Jianyu Wang, Zhishuai Zhang, Zhou Ren, and Alan Yuille. 2018. Mit- igating Adversarial Effects Through Randomization. InInternational Conference on Learning Representations

work page 2018
[69]

Chejian Xu, Mintong Kang, Jiawei Zhang, Zeyi Liao, Lingbo Mo, Mengqi Yuan, Huan Sun, and Bo Li. 2024. Advweb: Controllable black-box attacks on vlm- powered web agents. (2024)

work page 2024
[70]

Hai Yan, Haijian Ma, Xiaowen Cai, Daizong Liu, Zenghui Yuan, Xiaoye Qu, Jianfeng Dong, Runwei Guan, Xiang Fang, Hongyang He, et al . 2025. Fit the Distribution: Cross-Image/Prompt Adversarial Attacks on Multimodal Large Language Models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

work page 2025
[71]

Dingchen Yang, Bowen Cao, Anran Zhang, Weibo Gu, Winston Hu, and Guang Chen. 2025. Beyond intermediate states: Explaining visual redundancy through language.arXiv preprint arXiv:2503.20540(2025)

work page arXiv 2025
[72]

Zuopeng Yang, Jiluan Fan, Anli Yan, Erdun Gao, Xin Lin, Tao Li, Kanghua Mo, and Changyu Dong. 2025. Distraction is all you need for multimodal large language model jailbreaking. InProceedings of the Computer Vision and Pattern Recognition Conference. 9467–9476

work page 2025
[73]

Zhuoran Yu and Yong Jae Lee. 2025. How Multimodal LLMs Solve Image Tasks: A Lens on Visual Grounding, Task Reasoning, and Answer Decoding. InSecond Conference on Language Modeling

work page 2025
[74]

Zhiyuan Yu, Xiaogeng Liu, Shunning Liang, Zach Cameron, Chaowei Xiao, and Ning Zhang. 2024. Don’t listen to me: Understanding and exploring jailbreak prompts of large language models. In33rd USENIX Security Symposium (USENIX Security 24). 4675–4692

work page 2024
[75]

Andrew Yuan, Alina Oprea, and Cheng Tan. 2024. Dropout attacks. In2024 IEEE Symposium on Security and Privacy (SP). IEEE, 1255–1269

work page 2024
[76]

Duzhen Zhang, Yahan Yu, Jiahua Dong, Chenxing Li, Dan Su, Chenhui Chu, and Dong Yu. 2024. Mm-llms: Recent advances in multimodal large language models. arXiv preprint arXiv:2401.13601(2024)

work page arXiv 2024
[77]

Lin Zhang, Lei Zhang, Xuanqin Mou, and David Zhang. 2011. FSIM: A feature sim- ilarity index for image quality assessment.IEEE transactions on Image Processing 20, 8 (2011), 2378–2386

work page 2011
[78]

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang

work page
[79]

InProceedings of the IEEE conference on computer vision and pattern recognition

The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition. 586–595

work page
[80]

Xinwei Zhang, Li Bai, Tianwei Zhang, Youqian Zhang, Qingqing Ye, Yingnan Zhao, Ruochen Du, and Haibo Hu. 2026. Understanding and Enhancing Encoder- based Adversarial Transferability against Large Vision-Language Models.arXiv preprint arXiv:2602.09431(2026)

work page internal anchor Pith review arXiv 2026

Showing first 80 references.

[1] [1]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al

work page

[3] [3]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems35 (2022), 23716–23736

work page 2022

[4] [4]

Hezekiah J Branch, Jonathan Rodriguez Cefalu, Jeremy McHugh, Leyla Hujer, Aditya Bahl, Daniel del Castillo Iglesias, Ron Heichman, and Ramesh Darwishi

work page

[5] [5]

Evaluating the susceptibility of pre-trained language models via handcrafted adversarial examples.arXiv preprint arXiv:2209.02128(2022)

work page arXiv 2022

[6] [6]

Xiaowen Cai, Daizong Liu, Xiaoye Qu, Xiang Fang, Jianfeng Dong, Keke Tang, Pan Zhou, Lichao Sun, and Wei Hu. 2025. Towards Building Model/Prompt- Transferable Attackers against Large Vision-Language Models. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems

work page 2025

[7] [7]

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. 2024. Sharegpt4v: Improving large multi-modal models with better captions. InEuropean Conference on Computer Vision. Springer, 370– 387

work page 2024

[8] [8]

Xiangxiang Chen, Peixin Zhang, Jun Sun, Wenhai Wang, and Jingyi Wang. 2025. Rounding-Guided Backdoor Injection in Deep Learning Model Quantization. arXiv preprint arXiv:2510.09647(2025)

work page arXiv 2025

[9] [9]

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. 2023. Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in neural information processing systems36 (2023), 49250–49267

work page 2023

[10] [10]

Iuri Frosio and Jan Kautz. 2023. The best defense is a good offense: Adver- sarial augmentation against adversarial attacks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4067–4076

work page 2023

[11] [11]

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh

work page

[12] [12]

InProceedings of the IEEE conference on computer vision and pattern recognition

Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition. 6904–6913

work page

[13] [13]

Chuan Guo, Mayank Rana, Moustapha Cisse, and Laurens van der Maaten. 2018. Countering Adversarial Images using Input Transformations. InInternational Conference on Learning Representations

work page 2018

[14] [14]

Rich Harang. 2023. Securing LLM Systems Against Prompt Injection. https: //developer.nvidia.com/blog/securing-llm-systems-against-prompt-injection

work page 2023

[15] [15]

Wenbo Hu, Yifan Xu, Yi Li, Weiyue Li, Zeyuan Chen, and Zhuowen Tu. 2024. Bliva: A simple multimodal llm for better handling of text-rich visual questions. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 2256–2264

work page 2024

[16] [16]

Yuke Hu, Zheng Li, Zhihao Liu, Yang Zhang, Zhan Qin, Kui Ren, and Chun Chen

work page

[17] [17]

Membership inference attacks against vision-language models.arXiv preprint arXiv:2501.18624(2025)

work page arXiv 2025

[18] [18]

Andrey Labunets, Nishit V Pandya, Ashish Hooda, Xiaohan Fu, and Earlence Fernandes. 2025. Fun-tuning: Characterizing the vulnerability of proprietary llms to optimization-based prompt injection attacks via the fine-tuning interface. In2025 IEEE Symposium on Security and Privacy (SP). IEEE, 411–429

work page 2025

[19] [19]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning. PMLR, 19730–19742

work page 2023

[20] [20]

Yanjie Li, Yiming Cao, Dong Wang, and Bin Xiao. 2025. AgentTypo: Adaptive Typographic Prompt Injection Attacks against Black-box Multimodal Agents. arXiv preprint arXiv:2510.04257(2025)

work page arXiv 2025

[21] [21]

Zhan Li, Yongtao Wu, Yihang Chen, Francesco Tonin, Elias Abad Rocamora, and Volkan Cevher. 2024. Membership inference attacks against large vision- language models.Advances in Neural Information Processing Systems37 (2024), 98645–98674

work page 2024

[22] [22]

Zhichao Li, Hongshan Yang, Zhibo Wang, Huiyu Xu, Junhong Lai, Yaopeng Wang, Kui Ren, and Chun Chen. [n. d.]. On Evaluating the Robustness of Large Vision- Language Models via Untargeted Modality Alignment Breaking Adversarial Attack. ([n. d.])

work page

[23] [23]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. InEuropean conference on computer vision. Springer, 740–755

work page 2014

[24] [24]

Daizong Liu, Mingyu Yang, Xiaoye Qu, Pan Zhou, Xiang Fang, Keke Tang, Yao Wan, and Lichao Sun. 2024. Pandora’s box: Towards building universal attackers against real-world large vision-language models.Advances in Neural Information Processing Systems37 (2024), 52127–52158

work page 2024

[25] [25]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual in- struction tuning.Advances in neural information processing systems36 (2023), 34892–34916

work page 2023

[26] [26]

Xiaogeng Liu, Zhiyuan Yu, Yizhe Zhang, Ning Zhang, and Chaowei Xiao. 2024. Automatic and universal prompt injection attacks against large language models. arXiv preprint arXiv:2403.04957(2024)

work page arXiv 2024

[27] [27]

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. 2024. Formalizing and benchmarking prompt injection attacks and defenses. In33rd USENIX Security Symposium (USENIX Security 24). 1831–1847

work page 2024

[28] [28]

Zhaoyi Liu and Huan Zhang. 2025. Stealthy Backdoor Attack in Self-Supervised Learning Vision Encoders for Large Vision Language Models. InProceedings of the Computer Vision and Pattern Recognition Conference. 25060–25070

work page 2025

[29] [29]

Haochen Luo, Jindong Gu, Fengyuan Liu, and Philip Torr. 2024. An image is worth 1000 lies: Adversarial transferability across prompts on vision-language models.arXiv preprint arXiv:2403.09766(2024)

work page arXiv 2024

[30] [30]

Peizhuo Lv, Chang Yue, Ruigang Liang, Yunfei Yang, Shengzhi Zhang, Hualong Ma, and Kai Chen. 2023. A data-free backdoor injection approach in neural networks. In32nd USENIX Security Symposium (USENIX Security 23). 2671–2688

work page 2023

[31] [31]

Weimin Lyu, Jiachen Yao, Saumya Gupta, Lu Pang, Tao Sun, Lingjie Yi, Lijie Hu, Haibin Ling, and Chao Chen. [n. d.]. Backdooring Vision-Language Models with Out-Of-Distribution Data. InThe Thirteenth International Conference on Learning Representations

work page

[32] [32]

Clement Neo, Luke Ong, Philip Torr, Mor Geva, David Krueger, and Fazl Barez

work page

[33] [33]

InThe Thirteenth International Conference on Learning Representations

Towards Interpreting Visual Information Processing in Vision-Language Models. InThe Thirteenth International Conference on Learning Representations

work page

[34] [34]

Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. 2022. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. InInternational Conference on Machine Learning. PMLR, 16784–16804

work page 2022

[35] [35]

31st Aug 2020

Nostalgebraist. 31st Aug 2020. Interpreting GPT: The logit lens. https://www.alignmentforum.org/posts/AcKRB8wDpdaN6v6ru/interpreting- gpt-the-logit-lens

work page 2020

[36] [36]

OWASP. 2023. OWASP Top 10 for Large Language Model Applica- tions. https://owasp.org/www-project-top-10-for-large-language-model- applications/assets/PDF/OWASP-Top-10-for-LLMs-2023-v1_1.pdf

work page 2023

[37] [37]

Dario Pasquini, Martin Strohmeier, and Carmela Troncoso. 2024. Neural exec: Learning (and learning from) execution triggers for prompt injection attacks. In Proceedings of the 2024 Workshop on Artificial Intelligence and Security. 89–100. Conference’17, July 2017, Washington, DC, USA Hao Yang, Zhuo Ma, Yang Liu, Yilong Yang, Guancheng Wang, and JianFeng Ma

work page 2024

[38] [38]

Fábio Perez and Ian Ribeiro. 2022. Ignore previous prompt: Attack techniques for language models.arXiv preprint arXiv:2211.09527(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[39] [39]

Julien Piet, Maha Alrashed, Chawin Sitawarin, Sizhe Chen, Zeming Wei, Elizabeth Sun, Basel Alomair, and David Wagner. 2024. Jatmo: Prompt injection defense by task-specific finetuning. InEuropean Symposium on Research in Computer Security. Springer, 105–124

work page 2024

[40] [40]

Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Peter Henderson, Mengdi Wang, and Prateek Mittal. 2024. Visual adversarial examples jailbreak aligned large language models. InProceedings of the AAAI conference on artificial intelligence, Vol. 38. 21527–21536

work page 2024

[41] [41]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

work page 2021

[42] [42]

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen

work page

[43] [43]

Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.061251, 2 (2022), 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[44] [44]

Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). 3982–3992

work page 2019

[45] [45]

Rafael Reisenhofer, Sebastian Bosse, Gitta Kutyniok, and Thomas Wiegand. 2018. A Haar wavelet-based perceptual similarity index for image quality assessment. Signal Processing: Image Communication61 (2018), 33–43

work page 2018

[46] [46]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695

work page 2022

[47] [47]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al

work page

[48] [48]

Imagenet large scale visual recognition challenge.International journal of computer vision115, 3 (2015), 211–252

work page 2015

[49] [49]

do anything now

Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. 2024. "do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. 1671–1685

work page 2024

[50] [50]

Jiawen Shi, Zenghui Yuan, Yinuo Liu, Yue Huang, Pan Zhou, Lichao Sun, and Neil Zhenqiang Gong. 2024. Optimization-based prompt injection attack to llm- as-a-judge. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. 660–674

work page 2024

[51] [51]

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8317–8326

work page 2019

[52] [52]

Dawn Song, Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Florian Tramer, Atul Prakash, and Tadayoshi Kohno. 2018. Physical adversarial examples for object detectors. In12th USENIX workshop on offensive technologies (WOOT 18)

work page 2018

[53] [53]

Jiachen Sun, Changsheng Wang, Jiongxiao Wang, Yiwei Zhang, and Chaowei Xiao. 2024. Safeguarding vision-language models against patched visual prompt injectors.arXiv preprint arXiv:2405.10529(2024)

work page arXiv 2024

[54] [54]

Qwen Team. 2025. Qwen2.5-VL. https://qwenlm.github.io/blog/qwen2.5-vl/

work page 2025

[55] [55]

Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke Bailey, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell, et al. [n. d.]. Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game. InThe Twelfth International Conference on Learning Representations

work page

[56] [56]

Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Eslami, Oriol Vinyals, and Felix Hill. 2021. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems34 (2021), 200–212

work page 2021

[57] [57]

Haodi Wang, Kai Dong, Zhilei Zhu, Haotong Qin, Aishan Liu, Xiaolin Fang, Jiakai Wang, and Xianglong Liu. 2024. Transferable multimodal attack on vision- language pre-training models. In2024 IEEE Symposium on Security and Privacy (SP). IEEE, 1722–1740

work page 2024

[58] [58]

Le Wang, Zonghao Ying, Tianyuan Zhang, Siyuan Liang, Shengshan Hu, Mingchuan Zhang, Aishan Liu, and Xianglong Liu. 2025. Manipulating multi- modal agents via cross-modal prompt injection. InProceedings of the 33rd ACM International Conference on Multimedia. 10955–10964

work page 2025

[59] [59]

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. 2024. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution.arXiv preprint arXiv:2409.12191(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[60] [60]

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Song XiXuan, et al. 2024. Cogvlm: Visual expert for pretrained language models.Advances in Neural Information Processing Systems 37 (2024), 121475–121499

work page 2024

[61] [61]

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing13, 4 (2004), 600–612

work page 2004

[62] [62]

Zhibo Wang, Hengchang Guo, Zhifei Zhang, Wenxin Liu, Zhan Qin, and Kui Ren

work page

[63] [63]

InProceedings of the IEEE/CVF international conference on computer vision

Feature importance-aware transferable adversarial attacks. InProceedings of the IEEE/CVF international conference on computer vision. 7639–7648

work page

[64] [64]

Simon Willison. 2022. Prompt injection attacks against GPT-3. https:// simonwillison.net/2022/Sep/12/prompt-injection/

work page 2022

[65] [65]

Simon Willison. 2023. Delimiters won’t save you from prompt injection. https: //simonwillison.net/2023/May/11/delimiters-wont-save-you/

work page 2023

[66] [66]

Chen Henry Wu, Rishi Rajesh Shah, Jing Yu Koh, Russ Salakhutdinov, Daniel Fried, and Aditi Raghunathan. [n. d.]. Dissecting Adversarial Robustness of Multimodal LM Agents. InThe Thirteenth International Conference on Learning Representations

work page

[67] [67]

Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. 2024. Next-gpt: Any-to-any multimodal llm. InForty-first International Conference on Machine Learning

work page 2024

[68] [68]

Cihang Xie, Jianyu Wang, Zhishuai Zhang, Zhou Ren, and Alan Yuille. 2018. Mit- igating Adversarial Effects Through Randomization. InInternational Conference on Learning Representations

work page 2018

[69] [69]

Chejian Xu, Mintong Kang, Jiawei Zhang, Zeyi Liao, Lingbo Mo, Mengqi Yuan, Huan Sun, and Bo Li. 2024. Advweb: Controllable black-box attacks on vlm- powered web agents. (2024)

work page 2024

[70] [70]

Hai Yan, Haijian Ma, Xiaowen Cai, Daizong Liu, Zenghui Yuan, Xiaoye Qu, Jianfeng Dong, Runwei Guan, Xiang Fang, Hongyang He, et al . 2025. Fit the Distribution: Cross-Image/Prompt Adversarial Attacks on Multimodal Large Language Models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

work page 2025

[71] [71]

Dingchen Yang, Bowen Cao, Anran Zhang, Weibo Gu, Winston Hu, and Guang Chen. 2025. Beyond intermediate states: Explaining visual redundancy through language.arXiv preprint arXiv:2503.20540(2025)

work page arXiv 2025

[72] [72]

Zuopeng Yang, Jiluan Fan, Anli Yan, Erdun Gao, Xin Lin, Tao Li, Kanghua Mo, and Changyu Dong. 2025. Distraction is all you need for multimodal large language model jailbreaking. InProceedings of the Computer Vision and Pattern Recognition Conference. 9467–9476

work page 2025

[73] [73]

Zhuoran Yu and Yong Jae Lee. 2025. How Multimodal LLMs Solve Image Tasks: A Lens on Visual Grounding, Task Reasoning, and Answer Decoding. InSecond Conference on Language Modeling

work page 2025

[74] [74]

Zhiyuan Yu, Xiaogeng Liu, Shunning Liang, Zach Cameron, Chaowei Xiao, and Ning Zhang. 2024. Don’t listen to me: Understanding and exploring jailbreak prompts of large language models. In33rd USENIX Security Symposium (USENIX Security 24). 4675–4692

work page 2024

[75] [75]

Andrew Yuan, Alina Oprea, and Cheng Tan. 2024. Dropout attacks. In2024 IEEE Symposium on Security and Privacy (SP). IEEE, 1255–1269

work page 2024

[76] [76]

Duzhen Zhang, Yahan Yu, Jiahua Dong, Chenxing Li, Dan Su, Chenhui Chu, and Dong Yu. 2024. Mm-llms: Recent advances in multimodal large language models. arXiv preprint arXiv:2401.13601(2024)

work page arXiv 2024

[77] [77]

Lin Zhang, Lei Zhang, Xuanqin Mou, and David Zhang. 2011. FSIM: A feature sim- ilarity index for image quality assessment.IEEE transactions on Image Processing 20, 8 (2011), 2378–2386

work page 2011

[78] [78]

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang

work page

[79] [79]

InProceedings of the IEEE conference on computer vision and pattern recognition

The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition. 586–595

work page

[80] [80]

Xinwei Zhang, Li Bai, Tianwei Zhang, Youqian Zhang, Qingqing Ye, Yingnan Zhao, Ruochen Du, and Haibo Hu. 2026. Understanding and Enhancing Encoder- based Adversarial Transferability against Large Vision-Language Models.arXiv preprint arXiv:2602.09431(2026)

work page internal anchor Pith review arXiv 2026