Grounding-Driven Attack: Improving Encoder-based Adversarial Transferability against Large Vision-Language Models

Haibo Hu; Li Bai; Qingqing Ye; Ruochen Du; Tianwei Zhang; Xinwei Zhang; Yingnan Zhao; Youqian Zhang

arxiv: 2602.09431 · v2 · pith:NKEX3WO2new · submitted 2026-02-10 · 💻 cs.CR · cs.CV

Grounding-Driven Attack: Improving Encoder-based Adversarial Transferability against Large Vision-Language Models

Xinwei Zhang , Li Bai , Tianwei Zhang , Youqian Zhang , Qingqing Ye , Yingnan Zhao , Ruochen Du , Haibo Hu This is my paper

classification 💻 cs.CR cs.CV

keywords evidenceattacksencoder-basedmodelsacrossadversarialencoderexisting

0 comments

read the original abstract

Large vision-language models (LVLMs) have achieved impressive performance across multimodal tasks, but their reliance on visual inputs exposes them to adversarial threats. Encoder-based attacks provide an efficient alternative to end-to-end optimization by crafting perturbations through the vision encoder alone. However, existing encoder-based attacks often assume that the surrogate encoder is identical or similar to the victim LVLM's vision encoder. In this work, we present a systematic study of their transferability in more realistic black-box deployments with heterogeneous LVLM architectures. We find that model-specific visual evidence is inconsistent across models, whereas text-conditioned grounding regions are more closely tied to caption-relevant evidence and provide a more stable transfer target. However, existing attacks remain weakly aligned with and insufficiently disrupt these regions. Motivated by these findings, we propose Grounding-Driven Attack (GDA), which aligns perturbation optimization with text-grounded evidence. GDA combines Grounding-Aware Perturbation Allocation to concentrate perturbation budget on grounded evidence regions with Grounding-Centric Evidence Disruption to intensify their global and local disruption. Experiments across diverse victim models and tasks show that GDA consistently outperforms existing encoder-based attacks in black-box transfer. These results highlight the central role of text-grounded evidence in adversarial transferability and motivate grounding-aware robustness evaluation and defense design.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Cross-Modal Prompt Injection Attack against Large Vision-Language Models with Image-Only Perturbation
cs.CR 2026-05 unverdicted novelty 7.0

CrossMPI steers both visual and textual interpretations in LVLMs through image-only perturbations by optimizing in hidden-state space at selected middle layers with distance-based budget allocation.