Recognition: 2 theorem links
· Lean TheoremCBV: Clean-label Backdoor Attacks on Vision Language Models via Diffusion Models
Pith reviewed 2026-05-08 19:24 UTC · model grok-4.3
The pith
Diffusion models can generate natural poisoned images that backdoor vision-language models while keeping labels and performance intact.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors propose CBV, which leverages diffusion models to generate natural poisoned examples via score matching by modifying the score during the reverse generation process to embed triggered image features. Textual information from triggered images serves as multimodal guidance, and a GradCAM-guided mask restricts modifications to semantically important regions for stealth. Evaluation on MSCOCO and VQA v2 across four VLMs shows over 80 percent attack success rate while normal functionality is preserved.
What carries the argument
Score modification during the diffusion model's reverse generation process, with multimodal textual guidance and a GradCAM-guided mask to limit changes to key regions.
Load-bearing premise
That adjustments to the score function in diffusion reverse generation can add backdoor features to images without creating visible artifacts or label mismatches.
What would settle it
A direct test showing whether the generated poisoned images are flagged as anomalous by common image quality or anomaly detectors at rates above random chance, or whether attack success drops below 50 percent on the evaluated VLMs.
Figures
read the original abstract
Vision-Language Models (VLMs) have achieved remarkable success in tasks such as image captioning and visual question answering (VQA). However, as their applications become increasingly widespread, recent studies have revealed that VLMs are vulnerable to backdoor attacks. Existing backdoor attacks on VLMs primarily rely on data poisoning by adding visual triggers and modifying text labels, where the induced image-text mismatch makes poisoned samples easy to detect. To address this limitation, we propose the Clean-Label Backdoor Attack on VLMs via Diffusion Models (CBV), which leverages diffusion models to generate natural poisoned examples via score matching. Specifically, CBV modifies the score during the reverse generation process of the diffusion model to guide the generation of poisoned samples that contain triggered image features. To further enhance the effectiveness of the attack, we incorporate the textual information of the triggered images as multimodal guidance during generation. Moreover, to enhance stealthiness, we introduce a GradCAM-guided Mask (GM) that restricts modifications to only the most semantically important regions, rather than the entire image. We evaluate our method on MSCOCO and VQA v2 with four representative VLMs, achieving over 80% ASR while preserving normal functionality.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes CBV, a clean-label backdoor attack on VLMs that generates poisoned samples by modifying the score function during the reverse diffusion process of a diffusion model, using textual information from triggered images as multimodal guidance, and applying a GradCAM-guided mask to restrict changes to semantically important regions. It reports over 80% attack success rate (ASR) on MSCOCO and VQA v2 across four VLMs while preserving clean-task performance.
Significance. If validated with full experimental details, the work would be significant for introducing a generative-model-based clean-label attack that avoids the image-text mismatch of prior VLM backdoors, potentially exposing new stealth vulnerabilities in multimodal models and motivating improved detection methods.
major comments (3)
- [Method] Method section (score modification): the central mechanism of adjusting the score during reverse diffusion to embed triggered image features is described qualitatively but lacks an explicit equation or pseudocode for the guided score update (e.g., how the triggered-text conditioning is combined with the original score), which is load-bearing for reproducing the poisoned samples and verifying the >80% ASR claim.
- [Experiments] Experiments/results: the >80% ASR and preserved clean performance are asserted without reported ablation studies on the GradCAM-guided mask (e.g., ASR with vs. without mask, or mask computed on which base VLM), nor details on the concrete trigger pattern or how it survives standard VLM training augmentations, undermining evaluation of the stealthiness and effectiveness components.
- [Experiments] Evaluation setup: no baseline comparisons to existing clean-label or diffusion-based attacks are mentioned, and it is unclear whether the GradCAM mask generalizes across the four tested VLMs or remains effective after fine-tuning, which is necessary to support the claim that modifications are imperceptible and reliable.
minor comments (2)
- [Abstract] Abstract: the term 'triggered image features' is used without a brief definition or example of the trigger pattern, reducing immediate clarity for readers.
- [Method] Notation: the description of 'multimodal guidance' during generation could benefit from a short diagram or pseudocode to distinguish it from standard classifier-free guidance.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below and will make the necessary revisions to improve reproducibility, evaluation completeness, and clarity.
read point-by-point responses
-
Referee: [Method] Method section (score modification): the central mechanism of adjusting the score during reverse diffusion to embed triggered image features is described qualitatively but lacks an explicit equation or pseudocode for the guided score update (e.g., how the triggered-text conditioning is combined with the original score), which is load-bearing for reproducing the poisoned samples and verifying the >80% ASR claim.
Authors: We agree that an explicit formulation is essential for reproducibility. The manuscript currently presents the score modification at a high level for accessibility, but we will revise the Method section to include the precise mathematical expression for the guided score update (showing how triggered-text multimodal conditioning is combined with the base score) along with pseudocode for the full poisoned-sample generation procedure. revision: yes
-
Referee: [Experiments] Experiments/results: the >80% ASR and preserved clean performance are asserted without reported ablation studies on the GradCAM-guided mask (e.g., ASR with vs. without mask, or mask computed on which base VLM), nor details on the concrete trigger pattern or how it survives standard VLM training augmentations, undermining evaluation of the stealthiness and effectiveness components.
Authors: We acknowledge the value of component-wise ablations. While the primary results establish overall attack success and clean-task preservation, we will add the requested ablation studies (ASR with/without the GradCAM mask and mask source VLM) and provide concrete details on the trigger pattern plus new experiments quantifying its robustness to standard VLM training augmentations. revision: yes
-
Referee: [Experiments] Evaluation setup: no baseline comparisons to existing clean-label or diffusion-based attacks are mentioned, and it is unclear whether the GradCAM mask generalizes across the four tested VLMs or remains effective after fine-tuning, which is necessary to support the claim that modifications are imperceptible and reliable.
Authors: We appreciate the call for stronger comparative evaluation. We will incorporate baseline comparisons against relevant prior clean-label and diffusion-based attacks. We will also add results demonstrating GradCAM-mask generalization across the four VLMs and its post-fine-tuning effectiveness to better substantiate the imperceptibility and reliability claims. revision: yes
Circularity Check
No circularity: constructive method with empirical validation
full rationale
The paper presents CBV as a novel construction: it modifies the diffusion score during reverse generation (guided by triggered-image text) and applies a GradCAM-guided mask to produce clean-label poisoned samples. No equations or steps reduce by construction to fitted parameters, self-definitions, or prior self-citations. The attack success rate (>80% ASR) and clean performance claims are supported by direct evaluation on MSCOCO and VQA v2 across four VLMs, which are external benchmarks independent of the generation procedure itself. This matches the default expectation of a non-circular method paper.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
Cost.FunctionalEquation (RS J-cost has no role here)washburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Score = -(ε_θ(x_t,t)/√(1-ᾱ_t) + ∇_{x_t} log p_ξ(y_trig | x_t))
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
DIVER:Diving Deeper into Distilled Data via Expressive Semantic Recovery
DIVER is a dual-stage distillation method using diffusion models to enhance semantic preservation and cross-architecture generalization in dataset distillation.
Reference graph
Works this paper leans on
-
[1]
Computing Research Repository , year=
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , author=. Computing Research Repository , year=
-
[2]
Proceedings of ICML , year=
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , author=. Proceedings of ICML , year=
-
[3]
Proceedings of ICLR , year=
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. Proceedings of ICLR , year=
-
[4]
Computing Research Repository , volume=
Qwen2 Technical Report , author=. Computing Research Repository , volume=
-
[5]
Computing Research Repository , volume=
LLaMA: Open and Efficient Foundation Language Models , author=. Computing Research Repository , volume=
-
[6]
Computing Research Repository , volume=
Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. Computing Research Repository , volume=
-
[7]
ACM Computing Surveys , volume=
A Comprehensive Survey of Deep Learning for Image Captioning , author=. ACM Computing Surveys , volume=
-
[8]
Computer Vision and Image Understanding , volume=
Visual Question Answering: A Survey of Methods and Datasets , author=. Computer Vision and Image Understanding , volume=
-
[9]
Computing Research Repository , volume=
Microsoft COCO Captions: Data Collection and Evaluation Server , author=. Computing Research Repository , volume=
-
[10]
, author=
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. , author=. International Journal of Computer Vision , volume=
-
[11]
Proceedings of ECCV , year=
TrojVLM: Backdoor Attack Against Vision Language Models , author=. Proceedings of ECCV , year=
-
[12]
Proceedings of CVPR , year=
BadToken: Token-level Backdoor Attacks to Multi-modal Large Language Models , author=. Proceedings of CVPR , year=
-
[13]
A survey of safety on large vision-language models: Attacks, defenses and evalua- tions,
A survey of safety on large vision-language models: Attacks, defenses and evaluations , author=. arXiv preprint arXiv:2502.14881 , year=
-
[14]
Proceedings of CVPR , year=
Revisiting Backdoor Attacks Against Large Vision-Language Models from Domain Shift , author=. Proceedings of CVPR , year=
-
[15]
Proceedings of S&P , year=
Backdooring Multimodal Learning , author=. Proceedings of S&P , year=
-
[16]
International Journal of Computer Vision , volume=
VL-Trojan: Multimodal Instruction Backdoor Attacks Against Autoregressive Visual Language Models , author=. International Journal of Computer Vision , volume=
-
[17]
, author=
Estimation of Non-Normalized Statistical Models by Score Matching. , author=. Journal of Machine Learning Research , year=
-
[18]
, author=
Generative Modeling by Estimating Gradients of the Data Distribution. , author=. Proceedings of NeurIPS , year=
-
[19]
IEEE Transactions on Information Forensics and Security , volume=
Efficient Generation of Targeted and Transferable Adversarial Examples for Vision-Language Models Via Diffusion Models , author=. IEEE Transactions on Information Forensics and Security , volume=
-
[20]
Proceedings of CVPR , year=
Clean-Label Backdoor Attacks on Video Recognition Models , author=. Proceedings of CVPR , year=
-
[21]
Proceedings of CCS , year=
Narcissus: A Practical Clean-Label Backdoor Attack with Limited Information , author=. Proceedings of CCS , year=
-
[22]
IEEE Conference on Computer Communications , year=
Invisible Poison: A Blackbox Clean Label Backdoor Attack to Deep Neural Networks , author=. IEEE Conference on Computer Communications , year=
-
[23]
Proceedings of ICML , year=
Generalization Bound and New Algorithm for Clean-Label Backdoor Attack , author=. Proceedings of ICML , year=
-
[24]
, author=
Hidden Trigger Backdoor Attacks. , author=. Proceedings of AAAI , pages=
-
[25]
, author=
Sleeper Agent: Scalable Hidden Trigger Backdoors for Neural Networks Trained from Scratch. , author=. Proceedings of NeurIPS , year=
-
[26]
, author=
Diffusion Models Beat GANs on Image Synthesis. , author=. Proceedings of NeurIPS , year=
-
[27]
, author=
Universal Adversarial Perturbations. , author=. Proceedings of CVPR , pages=
-
[28]
IEEE Transactions on Dependable and Secure Computing , volume=
I2I Backdoor: Backdoor Attacks Against Image-to-Image Tasks , author=. IEEE Transactions on Dependable and Secure Computing , volume=
-
[29]
Proceedings of ICML , year=
Learning Transferable Visual Models from Natural Language Supervision , author=. Proceedings of ICML , year=
-
[30]
, author=
Scaling Up Visual and Vision-Language Representation Learning with Noisy Text Supervision. , author=. Computing Research Repository , volume=
-
[31]
Computing Research Repository , year=
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , author=. Computing Research Repository , year=
-
[32]
Computing Research Repository , volume=
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author=. Computing Research Repository , volume=
-
[33]
Proceedings of NeurIPS , year=
Visual Instruction Tuning , author=. Proceedings of NeurIPS , year=
-
[34]
, author=
Flamingo: a Visual Language Model for Few-Shot Learning. , author=. Proceedings of NeurIPS , year=
-
[35]
, author=
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. , author=. Computing Research Repository , year=
-
[36]
Ieee Access , volume=
Badnets: Evaluating backdooring attacks on deep neural networks , author=. Ieee Access , volume=. 2019 , publisher=
2019
-
[37]
Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning
Targeted backdoor attacks on deep learning systems using data poisoning , author=. arXiv preprint arXiv:1712.05526 , year=
work page internal anchor Pith review arXiv
-
[38]
Wanet–imperceptible warping-based back- door attack,
Wanet--imperceptible warping-based backdoor attack , author=. arXiv preprint arXiv:2102.10369 , year=
-
[39]
Proceedings of CVPR , pages=
Color backdoor: A robust poisoning attack in color space , author=. Proceedings of CVPR , pages=
-
[40]
Neural Networks , pages=
Backdoor attacks against hybrid classical-quantum neural networks , author=. Neural Networks , pages=. 2025 , publisher=
2025
-
[41]
Proceedings of NeurIPS , volume=
Shadowcast: Stealthy data poisoning attacks against vision-language models , author=. Proceedings of NeurIPS , volume=
-
[42]
2023 , journal=
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning , author=. 2023 , journal=
2023
-
[43]
Lawrence Zitnick and Devi Parikh , booktitle=
Ramakrishna Vedantam and C. Lawrence Zitnick and Devi Parikh , booktitle=. CIDEr: Consensus-based Image Description Evaluation , year=
-
[44]
Show and tell: A neural image caption generator,
Show and Tell: A Neural Image Caption Generator , author=. arXiv preprint arXiv:1411.4555 , year=
-
[45]
Proceedings of CVPR , title=
Robin Rombach and Andreas Blattmann and Dominik Lorenz and Patrick Esser and Bjoern Ommer , pages=. Proceedings of CVPR , title=
-
[46]
Indranil Sur and Karan Sikka and Matthew Walmer and Kaushik Koneripalli and Anirban Roy and Xiao Lin and Ajay Divakaran and Susmit Jha , title=
-
[47]
Zhao , booktitle=
Bolun Wang and Yuanshun Yao and Shawn Shan and Huiying Li and Bimal Viswanath and Haitao Zheng and Ben Y. Zhao , booktitle=
-
[48]
Lecture Notes in Computer Science , title=
Kang Liu and Brendan Dolan-Gavitt and Siddharth Garg , pages=. Lecture Notes in Computer Science , title=
-
[49]
Ranasinghe and Surya Nepal , journal=
Yansong Gao and Change Xu and Derui Wang and Shiping Chen and Damith C. Ranasinghe and Surya Nepal , journal=
-
[50]
Proceedings of CVPR , year=
Deep Residual Learning for Image Recognition , author=. Proceedings of CVPR , year=
-
[51]
Computing Research Repository , year=
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. Computing Research Repository , year=
-
[52]
, author=
AdvDoor: Adversarial Backdoor Attack of Deep Learning System. , author=. International Symposium on Software Testing and Analysis , year=
-
[53]
arXiv preprint arXiv:2510.12251 , year=
DSAS: A Universal Plug-and-Play Framework for Attention Optimization in Multi-Document Question Answering , author=. arXiv preprint arXiv:2510.12251 , year=
-
[54]
IEEE Transactions on Dependable and Secure Computing , volume=
Rethinking the design of backdoor triggers and adversarial perturbations: A color space perspective , author=. IEEE Transactions on Dependable and Secure Computing , volume=. 2024 , publisher=
2024
-
[55]
Neural Networks , volume=
Backdoor attacks against hybrid classical-quantum neural networks , author=. Neural Networks , volume=. 2025 , publisher=
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.