pith. machine review for the scientific record. sign in

arxiv: 2605.02202 · v1 · submitted 2026-05-04 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

CBV: Clean-label Backdoor Attacks on Vision Language Models via Diffusion Models

Authors on Pith no claims yet

Pith reviewed 2026-05-08 19:24 UTC · model grok-4.3

classification 💻 cs.AI
keywords clean-label backdoorvision-language modelsdiffusion modelsscore matchingGradCAM maskmultimodal guidanceimage captioningvisual question answering
0
0 comments X

The pith

Diffusion models can generate natural poisoned images that backdoor vision-language models while keeping labels and performance intact.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a method to attack vision-language models with backdoors that avoid the detectable mismatch between poisoned images and their text labels. It does so by using diffusion models to create images through a controlled reverse generation process where the score is adjusted to insert hidden trigger features, guided by text information and confined to important image regions via a mask. The resulting samples look ordinary and support normal model behavior on clean data, yet reliably activate the backdoor when the trigger appears. A reader would care because many modern VLMs are trained or fine-tuned on data that could include outputs from diffusion models, opening a pathway for stealthy compromise in real deployments.

Core claim

The authors propose CBV, which leverages diffusion models to generate natural poisoned examples via score matching by modifying the score during the reverse generation process to embed triggered image features. Textual information from triggered images serves as multimodal guidance, and a GradCAM-guided mask restricts modifications to semantically important regions for stealth. Evaluation on MSCOCO and VQA v2 across four VLMs shows over 80 percent attack success rate while normal functionality is preserved.

What carries the argument

Score modification during the diffusion model's reverse generation process, with multimodal textual guidance and a GradCAM-guided mask to limit changes to key regions.

Load-bearing premise

That adjustments to the score function in diffusion reverse generation can add backdoor features to images without creating visible artifacts or label mismatches.

What would settle it

A direct test showing whether the generated poisoned images are flagged as anomalous by common image quality or anomaly detectors at rates above random chance, or whether attack success drops below 50 percent on the evaluated VLMs.

Figures

Figures reproduced from arXiv: 2605.02202 by Cencen Liu, Jielei Wang, Jierun Chen, Ji Guo, Wenbo Jiang, Xiaolong Qin.

Figure 1
Figure 1. Figure 1: Comparisons of TrojVLM (Lyu et al., 2024), Bad￾Token (Bai et al., 2024) and CBV poisoned sample from MSCOCO (Chen et al., 2015). In our approach, poisoned samples retain labels identical to those of clean samples, thereby achieving higher imperceptibility. 1. Introduction Vision-Language Models (VLMs) (Li et al., 2022; 2023; Bai et al., 2024), which extend large language models (LLMs) (Touvron & et al., 20… view at source ↗
Figure 2
Figure 2. Figure 2: Visual comparison of poisoned images produced by different backdoor methods (HTBA (Saha et al., 2020), Narcis￾sus (Zeng et al., 2023), Shadowcast (Xu et al., 2024)). Prior approaches often introduce visible noise or artifacts that make poisoned samples conspicuous and easy to detect. In contrast, CBV generates poisoned images with a natural looking appear￾ance, avoiding noticeable artifacts and thus improv… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of CBV. A UAP is first learned by maximizing the loss over benign inputs to form a reusable trigger. Grad-CAM then generates a saliency mask on the clean image. Guided by the triggered image, a diffusion-based score matching process denoises the masked region, producing a natural-looking poisoned image that embeds the trigger features. tions that must be optimized separately for each sample, our U… view at source ↗
Figure 4
Figure 4. Figure 4: Outputs of the backdoor LLaVA-v1.5-7B on image captioning (left) and VQA (right) tasks for both the triggered and normal images. The red boxes highlight the outputs corresponding to the triggered images. (a) clean (b) TrojVLM (c) BadToken (d) MABA (e) VL-Trojan (f) CBAVR (g) Narcissus (h) HTBA (i) SAA (j) Shadowcast (k) IP (l) CBV(Ours) view at source ↗
Figure 5
Figure 5. Figure 5: Visual comparison of poisoned samples generated by different methods and differences from the original images. to architectural differences between the surrogate and victim models compared to prior methods. We also provide in view at source ↗
Figure 7
Figure 7. Figure 7: Visual comparison of with/o of GM for poisoned images view at source ↗
Figure 8
Figure 8. Figure 8: Robustness of CBV against two defense. 8 view at source ↗
Figure 9
Figure 9. Figure 9: Visualization of clean image, poison image, and corresponding labels. stability of CBV under varying generation budgets, we per￾form ablation studies by sweeping T while holding all other components constant. The results offer practical guidance for choosing T and confirm that CBV achieves consistently high attack success rates without compromising visual fi￾delity across a broad range of diffusion timeste… view at source ↗
Figure 10
Figure 10. Figure 10: Outputs of the backdoor Qwen-2.5-VL-7B for both the triggered and normal images view at source ↗
Figure 11
Figure 11. Figure 11: Grad-CAM comparison across different models. Source ResNet-50 CLIP-ResNet-50 Vit-B/32 CLIP-Vit-B/32 view at source ↗
Figure 12
Figure 12. Figure 12: Images generated with GM across different models. F. Analysis and Experiments on Guidance Weights λI and λT In Section 4.4, we introduce multimodal guidance into the score matching process by incorporating both image-level and text-level alignment objectives, controlled respectively by the hyperparameters λI and λT . These weights critically balance the influence of the triggered image and its associ￾ated… view at source ↗
Figure 13
Figure 13. Figure 13: The impact of the diffusion timestep T view at source ↗
Figure 14
Figure 14. Figure 14: The impact of Lambda T. 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 Lambda I 40 50 60 70 80 ASR(%) ASR CIDEr 100 105 110 115 120 125 130 135 140 CIDEr (a) ASR and CIDEr 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 LambdaI 30 40 50 60 70 80 90 ASR(%) ASR VQA Acc 30 40 50 60 70 80 90 VQA Acc(%) (b) ASR and VQA ACC 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 LambdaI 10 20 30 40 50 60 70 80 LPIPS(%) LPIPS SSIM 10 20 30 40 50 60 70 80 SSIM(%) (c)… view at source ↗
Figure 15
Figure 15. Figure 15: The impact of Lambda I. gesting that overemphasizing image guidance may disrupt the delicate multimodal balance required for both attack fidelity and downstream task performance. The Fig. 15c illustrates the impact of the image guidance weight λI on semantic and structural image fidelity, mea￾sured by LPIPS and SSIM, respectively. As λI increases, both LPIPS and SSIM consistently decrease, with the de￾cli… view at source ↗
Figure 16
Figure 16. Figure 16: Robustness of CBV against STRIP and Neural Cleanse. those of clean inputs. This alignment arises because the diffusion-based trigger in CBV is semantically integrated and visually imperceptible, causing model predictions to vary naturally under perturbation rather than remaining fixed on the target class. As a result, the entropy histograms of clean and poisoned samples largely overlap, demonstrat￾ing tha… view at source ↗
Figure 17
Figure 17. Figure 17: Comparison of poisoned images generated by different diffusion models. by Neural Cleanse to identify the backdoor. This demon￾strates that Neural Cleanse fails to effectively detect CBV generated poisoned samples, confirming that the CBV attack successfully evades this classic backdoor detection method. H. Impact of Diffusion Model Variants on CBV Attack Performance The effectiveness of our clean-label ba… view at source ↗
read the original abstract

Vision-Language Models (VLMs) have achieved remarkable success in tasks such as image captioning and visual question answering (VQA). However, as their applications become increasingly widespread, recent studies have revealed that VLMs are vulnerable to backdoor attacks. Existing backdoor attacks on VLMs primarily rely on data poisoning by adding visual triggers and modifying text labels, where the induced image-text mismatch makes poisoned samples easy to detect. To address this limitation, we propose the Clean-Label Backdoor Attack on VLMs via Diffusion Models (CBV), which leverages diffusion models to generate natural poisoned examples via score matching. Specifically, CBV modifies the score during the reverse generation process of the diffusion model to guide the generation of poisoned samples that contain triggered image features. To further enhance the effectiveness of the attack, we incorporate the textual information of the triggered images as multimodal guidance during generation. Moreover, to enhance stealthiness, we introduce a GradCAM-guided Mask (GM) that restricts modifications to only the most semantically important regions, rather than the entire image. We evaluate our method on MSCOCO and VQA v2 with four representative VLMs, achieving over 80% ASR while preserving normal functionality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes CBV, a clean-label backdoor attack on VLMs that generates poisoned samples by modifying the score function during the reverse diffusion process of a diffusion model, using textual information from triggered images as multimodal guidance, and applying a GradCAM-guided mask to restrict changes to semantically important regions. It reports over 80% attack success rate (ASR) on MSCOCO and VQA v2 across four VLMs while preserving clean-task performance.

Significance. If validated with full experimental details, the work would be significant for introducing a generative-model-based clean-label attack that avoids the image-text mismatch of prior VLM backdoors, potentially exposing new stealth vulnerabilities in multimodal models and motivating improved detection methods.

major comments (3)
  1. [Method] Method section (score modification): the central mechanism of adjusting the score during reverse diffusion to embed triggered image features is described qualitatively but lacks an explicit equation or pseudocode for the guided score update (e.g., how the triggered-text conditioning is combined with the original score), which is load-bearing for reproducing the poisoned samples and verifying the >80% ASR claim.
  2. [Experiments] Experiments/results: the >80% ASR and preserved clean performance are asserted without reported ablation studies on the GradCAM-guided mask (e.g., ASR with vs. without mask, or mask computed on which base VLM), nor details on the concrete trigger pattern or how it survives standard VLM training augmentations, undermining evaluation of the stealthiness and effectiveness components.
  3. [Experiments] Evaluation setup: no baseline comparisons to existing clean-label or diffusion-based attacks are mentioned, and it is unclear whether the GradCAM mask generalizes across the four tested VLMs or remains effective after fine-tuning, which is necessary to support the claim that modifications are imperceptible and reliable.
minor comments (2)
  1. [Abstract] Abstract: the term 'triggered image features' is used without a brief definition or example of the trigger pattern, reducing immediate clarity for readers.
  2. [Method] Notation: the description of 'multimodal guidance' during generation could benefit from a short diagram or pseudocode to distinguish it from standard classifier-free guidance.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below and will make the necessary revisions to improve reproducibility, evaluation completeness, and clarity.

read point-by-point responses
  1. Referee: [Method] Method section (score modification): the central mechanism of adjusting the score during reverse diffusion to embed triggered image features is described qualitatively but lacks an explicit equation or pseudocode for the guided score update (e.g., how the triggered-text conditioning is combined with the original score), which is load-bearing for reproducing the poisoned samples and verifying the >80% ASR claim.

    Authors: We agree that an explicit formulation is essential for reproducibility. The manuscript currently presents the score modification at a high level for accessibility, but we will revise the Method section to include the precise mathematical expression for the guided score update (showing how triggered-text multimodal conditioning is combined with the base score) along with pseudocode for the full poisoned-sample generation procedure. revision: yes

  2. Referee: [Experiments] Experiments/results: the >80% ASR and preserved clean performance are asserted without reported ablation studies on the GradCAM-guided mask (e.g., ASR with vs. without mask, or mask computed on which base VLM), nor details on the concrete trigger pattern or how it survives standard VLM training augmentations, undermining evaluation of the stealthiness and effectiveness components.

    Authors: We acknowledge the value of component-wise ablations. While the primary results establish overall attack success and clean-task preservation, we will add the requested ablation studies (ASR with/without the GradCAM mask and mask source VLM) and provide concrete details on the trigger pattern plus new experiments quantifying its robustness to standard VLM training augmentations. revision: yes

  3. Referee: [Experiments] Evaluation setup: no baseline comparisons to existing clean-label or diffusion-based attacks are mentioned, and it is unclear whether the GradCAM mask generalizes across the four tested VLMs or remains effective after fine-tuning, which is necessary to support the claim that modifications are imperceptible and reliable.

    Authors: We appreciate the call for stronger comparative evaluation. We will incorporate baseline comparisons against relevant prior clean-label and diffusion-based attacks. We will also add results demonstrating GradCAM-mask generalization across the four VLMs and its post-fine-tuning effectiveness to better substantiate the imperceptibility and reliability claims. revision: yes

Circularity Check

0 steps flagged

No circularity: constructive method with empirical validation

full rationale

The paper presents CBV as a novel construction: it modifies the diffusion score during reverse generation (guided by triggered-image text) and applies a GradCAM-guided mask to produce clean-label poisoned samples. No equations or steps reduce by construction to fitted parameters, self-definitions, or prior self-citations. The attack success rate (>80% ASR) and clean performance claims are supported by direct evaluation on MSCOCO and VQA v2 across four VLMs, which are external benchmarks independent of the generation procedure itself. This matches the default expectation of a non-circular method paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations or axioms; this is an empirical attack proposal. The method relies on standard diffusion model assumptions but no new free parameters listed in abstract.

pith-pipeline@v0.9.0 · 5524 in / 1087 out tokens · 19476 ms · 2026-05-08T19:24:22.378281+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DIVER:Diving Deeper into Distilled Data via Expressive Semantic Recovery

    cs.CV 2026-05 unverdicted novelty 6.0

    DIVER is a dual-stage distillation method using diffusion models to enhance semantic preservation and cross-architecture generalization in dataset distillation.

Reference graph

Works this paper leans on

55 extracted references · 5 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    Computing Research Repository , year=

    BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , author=. Computing Research Repository , year=

  2. [2]

    Proceedings of ICML , year=

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , author=. Proceedings of ICML , year=

  3. [3]

    Proceedings of ICLR , year=

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. Proceedings of ICLR , year=

  4. [4]

    Computing Research Repository , volume=

    Qwen2 Technical Report , author=. Computing Research Repository , volume=

  5. [5]

    Computing Research Repository , volume=

    LLaMA: Open and Efficient Foundation Language Models , author=. Computing Research Repository , volume=

  6. [6]

    Computing Research Repository , volume=

    Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. Computing Research Repository , volume=

  7. [7]

    ACM Computing Surveys , volume=

    A Comprehensive Survey of Deep Learning for Image Captioning , author=. ACM Computing Surveys , volume=

  8. [8]

    Computer Vision and Image Understanding , volume=

    Visual Question Answering: A Survey of Methods and Datasets , author=. Computer Vision and Image Understanding , volume=

  9. [9]

    Computing Research Repository , volume=

    Microsoft COCO Captions: Data Collection and Evaluation Server , author=. Computing Research Repository , volume=

  10. [10]

    , author=

    Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. , author=. International Journal of Computer Vision , volume=

  11. [11]

    Proceedings of ECCV , year=

    TrojVLM: Backdoor Attack Against Vision Language Models , author=. Proceedings of ECCV , year=

  12. [12]

    Proceedings of CVPR , year=

    BadToken: Token-level Backdoor Attacks to Multi-modal Large Language Models , author=. Proceedings of CVPR , year=

  13. [13]

    A survey of safety on large vision-language models: Attacks, defenses and evalua- tions,

    A survey of safety on large vision-language models: Attacks, defenses and evaluations , author=. arXiv preprint arXiv:2502.14881 , year=

  14. [14]

    Proceedings of CVPR , year=

    Revisiting Backdoor Attacks Against Large Vision-Language Models from Domain Shift , author=. Proceedings of CVPR , year=

  15. [15]

    Proceedings of S&P , year=

    Backdooring Multimodal Learning , author=. Proceedings of S&P , year=

  16. [16]

    International Journal of Computer Vision , volume=

    VL-Trojan: Multimodal Instruction Backdoor Attacks Against Autoregressive Visual Language Models , author=. International Journal of Computer Vision , volume=

  17. [17]

    , author=

    Estimation of Non-Normalized Statistical Models by Score Matching. , author=. Journal of Machine Learning Research , year=

  18. [18]

    , author=

    Generative Modeling by Estimating Gradients of the Data Distribution. , author=. Proceedings of NeurIPS , year=

  19. [19]

    IEEE Transactions on Information Forensics and Security , volume=

    Efficient Generation of Targeted and Transferable Adversarial Examples for Vision-Language Models Via Diffusion Models , author=. IEEE Transactions on Information Forensics and Security , volume=

  20. [20]

    Proceedings of CVPR , year=

    Clean-Label Backdoor Attacks on Video Recognition Models , author=. Proceedings of CVPR , year=

  21. [21]

    Proceedings of CCS , year=

    Narcissus: A Practical Clean-Label Backdoor Attack with Limited Information , author=. Proceedings of CCS , year=

  22. [22]

    IEEE Conference on Computer Communications , year=

    Invisible Poison: A Blackbox Clean Label Backdoor Attack to Deep Neural Networks , author=. IEEE Conference on Computer Communications , year=

  23. [23]

    Proceedings of ICML , year=

    Generalization Bound and New Algorithm for Clean-Label Backdoor Attack , author=. Proceedings of ICML , year=

  24. [24]

    , author=

    Hidden Trigger Backdoor Attacks. , author=. Proceedings of AAAI , pages=

  25. [25]

    , author=

    Sleeper Agent: Scalable Hidden Trigger Backdoors for Neural Networks Trained from Scratch. , author=. Proceedings of NeurIPS , year=

  26. [26]

    , author=

    Diffusion Models Beat GANs on Image Synthesis. , author=. Proceedings of NeurIPS , year=

  27. [27]

    , author=

    Universal Adversarial Perturbations. , author=. Proceedings of CVPR , pages=

  28. [28]

    IEEE Transactions on Dependable and Secure Computing , volume=

    I2I Backdoor: Backdoor Attacks Against Image-to-Image Tasks , author=. IEEE Transactions on Dependable and Secure Computing , volume=

  29. [29]

    Proceedings of ICML , year=

    Learning Transferable Visual Models from Natural Language Supervision , author=. Proceedings of ICML , year=

  30. [30]

    , author=

    Scaling Up Visual and Vision-Language Representation Learning with Noisy Text Supervision. , author=. Computing Research Repository , volume=

  31. [31]

    Computing Research Repository , year=

    InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , author=. Computing Research Repository , year=

  32. [32]

    Computing Research Repository , volume=

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author=. Computing Research Repository , volume=

  33. [33]

    Proceedings of NeurIPS , year=

    Visual Instruction Tuning , author=. Proceedings of NeurIPS , year=

  34. [34]

    , author=

    Flamingo: a Visual Language Model for Few-Shot Learning. , author=. Proceedings of NeurIPS , year=

  35. [35]

    , author=

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. , author=. Computing Research Repository , year=

  36. [36]

    Ieee Access , volume=

    Badnets: Evaluating backdooring attacks on deep neural networks , author=. Ieee Access , volume=. 2019 , publisher=

  37. [37]

    Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning

    Targeted backdoor attacks on deep learning systems using data poisoning , author=. arXiv preprint arXiv:1712.05526 , year=

  38. [38]

    Wanet–imperceptible warping-based back- door attack,

    Wanet--imperceptible warping-based backdoor attack , author=. arXiv preprint arXiv:2102.10369 , year=

  39. [39]

    Proceedings of CVPR , pages=

    Color backdoor: A robust poisoning attack in color space , author=. Proceedings of CVPR , pages=

  40. [40]

    Neural Networks , pages=

    Backdoor attacks against hybrid classical-quantum neural networks , author=. Neural Networks , pages=. 2025 , publisher=

  41. [41]

    Proceedings of NeurIPS , volume=

    Shadowcast: Stealthy data poisoning attacks against vision-language models , author=. Proceedings of NeurIPS , volume=

  42. [42]

    2023 , journal=

    MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning , author=. 2023 , journal=

  43. [43]

    Lawrence Zitnick and Devi Parikh , booktitle=

    Ramakrishna Vedantam and C. Lawrence Zitnick and Devi Parikh , booktitle=. CIDEr: Consensus-based Image Description Evaluation , year=

  44. [44]

    Show and tell: A neural image caption generator,

    Show and Tell: A Neural Image Caption Generator , author=. arXiv preprint arXiv:1411.4555 , year=

  45. [45]

    Proceedings of CVPR , title=

    Robin Rombach and Andreas Blattmann and Dominik Lorenz and Patrick Esser and Bjoern Ommer , pages=. Proceedings of CVPR , title=

  46. [46]

    Indranil Sur and Karan Sikka and Matthew Walmer and Kaushik Koneripalli and Anirban Roy and Xiao Lin and Ajay Divakaran and Susmit Jha , title=

  47. [47]

    Zhao , booktitle=

    Bolun Wang and Yuanshun Yao and Shawn Shan and Huiying Li and Bimal Viswanath and Haitao Zheng and Ben Y. Zhao , booktitle=

  48. [48]

    Lecture Notes in Computer Science , title=

    Kang Liu and Brendan Dolan-Gavitt and Siddharth Garg , pages=. Lecture Notes in Computer Science , title=

  49. [49]

    Ranasinghe and Surya Nepal , journal=

    Yansong Gao and Change Xu and Derui Wang and Shiping Chen and Damith C. Ranasinghe and Surya Nepal , journal=

  50. [50]

    Proceedings of CVPR , year=

    Deep Residual Learning for Image Recognition , author=. Proceedings of CVPR , year=

  51. [51]

    Computing Research Repository , year=

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. Computing Research Repository , year=

  52. [52]

    , author=

    AdvDoor: Adversarial Backdoor Attack of Deep Learning System. , author=. International Symposium on Software Testing and Analysis , year=

  53. [53]

    arXiv preprint arXiv:2510.12251 , year=

    DSAS: A Universal Plug-and-Play Framework for Attention Optimization in Multi-Document Question Answering , author=. arXiv preprint arXiv:2510.12251 , year=

  54. [54]

    IEEE Transactions on Dependable and Secure Computing , volume=

    Rethinking the design of backdoor triggers and adversarial perturbations: A color space perspective , author=. IEEE Transactions on Dependable and Secure Computing , volume=. 2024 , publisher=

  55. [55]

    Neural Networks , volume=

    Backdoor attacks against hybrid classical-quantum neural networks , author=. Neural Networks , volume=. 2025 , publisher=