IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding

Beining Xu; Di Zhang; Haodong Zhao; Jiatong Li; Jingdi Lei; Junxian Li; Simin Chen

arxiv: 2508.09456 · v5 · pith:3LLUODIWnew · submitted 2025-08-13 · 💻 cs.CV · cs.CL· cs.CR

IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding

Junxian Li , Beining Xu , Simin Chen , Jiatong Li , Jingdi Lei , Haodong Zhao , Di Zhang This is my paper

classification 💻 cs.CV cs.CLcs.CR

keywords groundingvisualattackvlm-basedvlmsbackdoorfurtherinput-aware

0 comments

read the original abstract

Recent advances in vision-language models (VLMs) have significantly enhanced the visual grounding task, which involves locating objects in an image based on natural language queries. Despite these advancements, the security of VLM-based grounding systems has not been thoroughly investigated. This paper reveals a novel and realistic vulnerability: the first multi-target backdoor attack on VLM-based visual grounding. Unlike prior attacks that rely on static triggers or fixed targets, we propose IAG, a method that dynamically generates input-aware, text-guided triggers conditioned on any specified target object description to execute the attack. This is achieved through a text-conditioned UNet that embeds imperceptible target semantic cues into visual inputs while preserving normal grounding performance on benign samples. We further develop a joint training objective that balances language capability with perceptual reconstruction to ensure imperceptibility, effectiveness, and stealth. Extensive experiments on multiple VLMs (e.g., LLaVA, InternVL, Ferret) and benchmarks (RefCOCO, RefCOCO+, RefCOCOg, Flickr30k Entities, and ShowUI) demonstrate that IAG achieves the best ASRs compared with other baselines on almost all settings without compromising clean accuracy, maintaining robustness against existing defenses, and exhibiting transferability across datasets and models. These findings underscore critical security risks in grounding-capable VLMs and highlight the need for further research on trustworthy multimodal understanding.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents
cs.CL 2026-04 unverdicted novelty 7.0

OS-SPEAR is a new evaluation toolkit that tests 22 OS agents and identifies trade-offs between efficiency and safety or robustness.
ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety
cs.CR 2026-04 unverdicted novelty 7.0

ProjLens shows that backdoor parameters in MLLMs are encoded in low-rank subspaces of the projector and that embeddings shift toward the target direction with magnitude linear in input norm, activating only on poisone...
Phantasia: Context-Adaptive Backdoors in Vision Language Models
cs.CV 2026-04 unverdicted novelty 6.0

Phantasia is a new backdoor attack on VLMs that dynamically aligns malicious outputs with input context to achieve higher stealth and state-of-the-art success rates compared to static-pattern attacks.
Starve to Perceive: Taming Lazy Perception in VLMs with Constrained Visual Bandwidth
cs.CV 2026-05 unverdicted novelty 5.0

Constraining visual token budget per observation during VLM training forces genuine active perception and delivers 5% average relative improvement without auxiliary losses or architecture changes.