pith. sign in

arxiv: 2605.15711 · v1 · pith:JT3OJOENnew · submitted 2026-05-15 · 💻 cs.CV

EntropyScan: Towards Model-level Backdoor Detection in LVLMs via Visual Attention Entropy

Pith reviewed 2026-05-20 19:42 UTC · model grok-4.3

classification 💻 cs.CV
keywords backdoor detectionlarge vision-language modelsvisual attentionTsallis entropyZ-score normalizationmodel-level defensecross-modal alignment
0
0 comments X

The pith

Backdoor attacks in large vision-language models create structural anomalies in visual attention on benign samples that can be detected using entropy measures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to identify backdoored large vision-language models at the model level without needing triggers or training data. It establishes that injecting a backdoor disrupts cross-modal alignment, leading to consistent anomalies in how the model distributes attention across normal images. These anomalies are quantified by applying Tsallis entropy to attention distributions from the first layers of the language model part. A reference-anchored Z-score normalization then compares against clean behavior using a few benign samples. This would matter for practical auditing of models before use, as it avoids reliance on attack-specific knowledge.

Core claim

Backdoor injection disrupts the cross-modal alignment, resulting in pronounced structural anomalies in visual attention allocation on benign samples. EntropyScan detects the backdoored models by quantifying such attention deviations with Tsallis entropy and reference-anchored Z-score normalization on benign samples.

What carries the argument

Tsallis entropy applied to visual attention distributions from initial layers of the LLM component to measure backdoor-induced structural distortions.

If this is right

  • Allows detection without knowledge of the backdoor trigger or poisoned training data.
  • Achieves an average F1 score of 98.5% and AUC of 96.6% across tested architectures and attacks.
  • Works on two different LVLM architectures and three advanced attack scenarios.
  • Relies only on a small set of benign samples for reference normalization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar entropy-based detection could be explored for other multimodal AI systems vulnerable to alignment attacks.
  • The method highlights attention mechanisms as potential weak points for security analysis in LVLMs.
  • Extending the approach to later layers or different entropy measures might improve detection in some cases.

Load-bearing premise

The premise that backdoor injection produces consistent structural anomalies in visual attention on benign samples across different attacks and models, which are not present in clean versions.

What would settle it

Running the detection on a backdoored model where the attack was crafted to preserve normal visual attention patterns on benign inputs would show if the method fails.

Figures

Figures reproduced from arXiv: 2605.15711 by Jie Zhang, Shiguang Shan, Xilin Chen, Xuanyu Ge, Zhongqi Wang.

Figure 1
Figure 1. Figure 1: Illustration of a multimodal backdoor attack against Large Vision-Language Models (LVLMs). A compromised model downloaded from a third-party platform (left) generates accurate, harmless responses for benign inputs. However, introducing a pre￾defined trigger activates the hidden backdoor, forcing the model to output a malicious target response and bypass safety alignments (right). While this distribution me… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the EntropyScan. EntropyScan evaluates a suspect model Mtarget against an architecture-matched benign reference Mref using a small clean dataset Dval. Specifically, (a) it extracts visual attention weights from the initial layer of LLM to formulate the renormalized conditional probability distribution Pv. To quantify structural anomalies, we calculate the Tsallis entropy Hq(·) to Pv, yielding t… view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of visual attention maps at the initial layer (Layer-0) of the LLM given the same input. Each colored cell in the right-side heatmaps denotes the attention probability of the final token of the input prompt (acting as the query) attending to a specific visual patch token (acting as the key). Map (a) illustrates the standard atten￾tion distribution of the benign model, while map (b) reveals th… view at source ↗
Figure 4
Figure 4. Figure 4: Layer-wise Z-score analysis of visual attention entropy. The red and blue lines represent the backdoored and benign models, respectively. The comparison under (a) Imgtrojan and (b) Shadowcast attacks reveals that the most significant structural de￾viation occurs at the initial layer (l = 0), validating our layer selection strategy. We visualize the P (0) V for the benign and backdoored model respectively. … view at source ↗
Figure 5
Figure 5. Figure 5: Layer sensitivity analysis. (a) Layer-0 serves as the optimal detection probe in the majority of attack scenarios. (b) The detection signal decays as network depth increases [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Illustrative examples of the evaluated backdoor attack scenarios. The top row demonstrates two task-specific variations of the ShadowCast: the Label Attack (left) and the Persuasion Attack (right). The middle row illustrates the application of the VL-Trojan across two distinct tasks: Image Captioning (left) and Spot the Difference (right). The bottom row depicts the ImgTrojan, which achieves a malicious ja… view at source ↗
read the original abstract

Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across various tasks, yet they remain vulnerable to backdoor attacks. Existing defense methods predominantly focus on sample-level defense, which relies on the knowledge of training data or triggers. However, identifying whether a given model is backdoored remains a critical but unexplored task. To fill this gap, we propose EntropyScan, a lightweight and trigger-agnostic method for model-level backdoor detection in LVLMs. We first observe that backdoor injection disrupts the cross-modal alignment, resulting in pronounced structural anomalies in visual attention allocation on benign samples. Based on this insight, EntropyScan detects the backdoor models by quantifying such attention deviations. Specifically, it extracts visual attention distributions from the initial layers of the Large Language Model (LLM) and applies Tsallis entropy to capture these structural distortions. By employing a reference-anchored Z-score normalization on a small set of benign samples, it effectively identifies the backdoored model. Extensive experiments across two LVLMs architectures and three advanced attack scenarios show that EntropyScan achieves an F1 score of 98.5% in average and an AUC of 96.6%. Our code will be publicly available soon.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes EntropyScan for model-level backdoor detection in Large Vision-Language Models (LVLMs). It claims that backdoor injection disrupts cross-modal alignment, producing detectable structural anomalies in visual attention distributions extracted from the initial LLM layers even on benign samples. The method quantifies these via Tsallis entropy followed by reference-anchored Z-score normalization on a small set of benign samples. Experiments on two LVLM architectures and three attack scenarios report average F1 of 98.5% and AUC of 96.6%.

Significance. If the central empirical claim is substantiated, the work would be significant as one of the first practical model-level detectors for backdoored LVLMs. It is lightweight, trigger-agnostic, and leverages existing attention maps rather than requiring trigger reconstruction or poisoned data access. The reported aggregate performance is strong, and the commitment to public code is a positive for reproducibility. Such a method could be adopted in deployment pipelines where users receive models without training provenance.

major comments (2)
  1. [§4] §4 (Experiments): The reported average F1 of 98.5% and AUC of 96.6% are given only in aggregate form with no per-model entropy histograms, variance statistics across clean-model training runs, or ablation on non-backdoor fine-tuning. This directly bears on whether the Z-score threshold separates backdoor-induced anomalies from natural variation due to initialization or data heterogeneity.
  2. [§3.1] §3.1 (Observation of attention anomalies): The premise that backdoor training consistently produces 'pronounced structural anomalies' in visual attention on benign inputs (absent in clean models) is load-bearing for the detection claim, yet the manuscript provides no quantitative comparison of entropy spread between clean and backdoored models under matched training conditions.
minor comments (2)
  1. [§3.2] The choice of q-parameter in the Tsallis entropy formula and the exact number of benign samples used for reference anchoring should be stated explicitly with an equation or pseudocode for reproducibility.
  2. [Figure 2] Figure captions for attention visualizations could include the exact layer indices from which maps are extracted to allow direct replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, acknowledging where additional evidence is needed and outlining specific revisions to strengthen the empirical claims.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): The reported average F1 of 98.5% and AUC of 96.6% are given only in aggregate form with no per-model entropy histograms, variance statistics across clean-model training runs, or ablation on non-backdoor fine-tuning. This directly bears on whether the Z-score threshold separates backdoor-induced anomalies from natural variation due to initialization or data heterogeneity.

    Authors: We agree that aggregate performance metrics alone leave open questions about natural variation. In the revised manuscript we will add per-model entropy histograms comparing clean and backdoored models, report standard deviations of entropy values across multiple independent clean-model training runs with matched hyperparameters and data, and include an ablation on non-backdoor fine-tuning (e.g., continued pre-training or instruction tuning on clean data). These additions will directly test whether the chosen Z-score threshold isolates backdoor-induced shifts from initialization or data heterogeneity effects. revision: yes

  2. Referee: [§3.1] §3.1 (Observation of attention anomalies): The premise that backdoor training consistently produces 'pronounced structural anomalies' in visual attention on benign inputs (absent in clean models) is load-bearing for the detection claim, yet the manuscript provides no quantitative comparison of entropy spread between clean and backdoored models under matched training conditions.

    Authors: We acknowledge that a quantitative comparison under strictly matched training conditions is important for substantiating the core observation. While the current experiments compare backdoored models against clean baselines trained on similar data distributions, we will revise §3.1 to include explicit quantitative metrics: mean and variance of Tsallis entropy values, together with statistical significance tests (e.g., t-tests or Kolmogorov-Smirnov tests), for clean versus backdoored models trained under identical conditions. This will provide a clearer measure of entropy spread attributable to backdoor injection. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation rests on empirical observation plus standard statistical quantification

full rationale

The paper begins with a stated empirical observation that backdoor injection produces structural anomalies in visual attention on benign samples, then quantifies those anomalies via Tsallis entropy on attention distributions extracted from initial LLM layers followed by reference-anchored Z-score normalization. No equation or step reduces the detection output to a fitted parameter or self-referential definition of the target; the Z-score operates on a small set of benign samples without the threshold or entropy measure being constructed from backdoor labels. No self-citations are invoked as load-bearing uniqueness theorems, and the method does not rename a known result or smuggle an ansatz. The chain is therefore self-contained against external benchmarks of attention statistics and detection performance.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that backdoors produce detectable attention anomalies on clean inputs; no free parameters or invented entities are introduced in the abstract description.

axioms (1)
  • domain assumption Backdoor injection disrupts cross-modal alignment resulting in pronounced structural anomalies in visual attention allocation on benign samples.
    This observation is presented as the foundational insight enabling the detection method.

pith-pipeline@v0.9.0 · 5758 in / 1273 out tokens · 54314 ms · 2026-05-20T19:42:23.196308+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 9 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  2. [2]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966 (2023)

  3. [3]

    In: Advances in Neural Information Processing Systems (NeurIPS)

    Cai, X., Xu, H., Xu, S., Zhang, Y., Yuan, X.: Badprompt: Backdoor attacks on continuous prompts. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 35, pp. 37068–37080 (2022)

  4. [4]

    In: AAAI Workshop on Artificial Intelligence Safety (SafeAI@AAAI)

    Chen, B., Carvalho, W., Baracaldo, N., Ludwig, H., Edwards, B., Lee, T., Molloy, I.,Srivastava,B.:Detectingbackdoorattacksondeepneuralnetworksbyactivation clustering. In: AAAI Workshop on Artificial Intelligence Safety (SafeAI@AAAI). CEUR Workshop Proceedings, vol. 2301 (2019)

  5. [5]

    Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning

    Chen, X., Liu, C., Li, B., Lu, K., Song, D.: Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526 (2017)

  6. [6]

    Chiang, W.L., Lin, Z., Sheng, Y., Li, X., Liu, D., Zhang, H., Hou, Y., Zhong, Y., Wang, S., Li, Z., Zhu, T., Lin, C.H., Wu, Y., Zhang, R., Gonzalez, J.E., Stoica, I., Xing, E.P.: Vicuna: An open-source chatbot impressing GPT-4 with 90% chatgpt quality (2023)

  7. [7]

    Chou, S.Y., Chen, P.Y., Ho, T.Y.: How to backdoor diffusion models? In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

  8. [8]

    PaLM: Scaling Language Modeling with Pathways

    Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N., Prab- hakaran, V., Reif, E., Du, N., Hutchinson, B., Pope, R., Bradbury, J., Austin, J., Isard, M., Gur-Ari, G., Yin, P., Duke, T., Levsk...

  9. [9]

    In: NeurIPS (2023)

    Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B., Fung, P., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning. In: NeurIPS (2023)

  10. [10]

    In: NeurIPS (2023)

    Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L.: Qlora: Efficient finetun- ing of quantized llms. In: NeurIPS (2023)

  11. [11]

    In: ICLR (2021)

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021)

  12. [12]

    Pattern Recognition Letters27(8), 861–874 (2006)

    Fawcett, T.: An introduction to ROC analysis. Pattern Recognition Letters27(8), 861–874 (2006)

  13. [13]

    In: ACSAC (2019)

    Gao, Y., Xu, C., Wang, D., Chen, S., Ranasinghe, D.C., Nepal, S.: STRIP: A defence against trojan attacks on deep neural networks. In: ACSAC (2019)

  14. [14]

    IEEE Access (2019) 22 X

    Gu, T., Liu, K., Dolan-Gavitt, B., Garg, S.: Badnets: Evaluating backdooring at- tacks on deep neural networks. IEEE Access (2019) 22 X. Ge et al

  15. [15]

    Hao, J., Jin, X., Xiaoguang, H., Tianyou, C., Jiajia, Z.: Diff-cleanse: Identifying and mitigating backdoor attacks in diffusion models (2024)

  16. [16]

    In: CVPR (2016)

    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

  17. [17]

    In: ICLR (2022)

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. In: ICLR (2022)

  18. [18]

    Journal of Classification2(1), 193– 218 (1985)

    Hubert, L., Arabie, P.: Comparing partitions. Journal of Classification2(1), 193– 218 (1985)

  19. [19]

    Springer, 2 edn

    Jolliffe, I.T.: Principal Component Analysis. Springer, 2 edn. (2002)

  20. [20]

    John Wiley & Sons (2011)

    Kreyszig, E.: Advanced engineering mathematics. John Wiley & Sons (2011)

  21. [21]

    Otter: A Multi-Modal Model with In-Context Instruction Tuning

    Li, B., Zhang, Y., Chen, L., Wang, J., Pu, F., Cahyono, J.A., Yang, J., Liu, Z.: Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726 (2023)

  22. [22]

    Mimic-it: Multi-modal in-context instruction tuning,

    Li, B., Zhang, Y., Chen, L., Wang, J., Pu, F., Yang, J., Li, C., Liu, Z.: Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425 (2023)

  23. [23]

    In: Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track (2023)

    Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: LLaVA-Med: Training a large language-and-vision assistant for biomedicine in one day. In: Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track (2023)

  24. [24]

    In: ICML (2023)

    Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: ICML (2023)

  25. [25]

    IJCV (2025)

    Liang, J., Liang, S., Luo, M., Liu, A., Han, D., Chang, E.C., Cao, X.: Vl-trojan: Multimodal instruction backdoor attacks against autoregressive visual language models. IJCV (2025)

  26. [26]

    In: CVPR (2025)

    Liang, S., Liang, J., Pang, T., Du, C., Liu, A., Chang, E.C., Cao, X.: Revisiting backdoor attacks against large vision-language models. In: CVPR (2025)

  27. [27]

    In: NeurIPS (2023)

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)

  28. [28]

    In: RAID (2018)

    Liu, K., Dolan-Gavitt, B., Garg, S.: Fine-pruning: Defending against backdooring attacks on deep neural networks. In: RAID (2018)

  29. [29]

    arXiv preprint arXiv:2505.06413 , year=

    Liu, M., Liang, S., Howlader, K., Wang, L., Tao, D., Zhang, W.: Natural reflection backdoor attack on vision language model for autonomous driving. arXiv preprint arXiv:2505.06413 (2025)

  30. [30]

    arXiv preprint arXiv:2601.21692 (2026)

    Liu, M., Fang, H., Cong, R.: TCAP: Tri-component attention profiling for unsuper- vised backdoor detection in MLLM fine-tuning. arXiv preprint arXiv:2601.21692 (2026)

  31. [31]

    Test-time backdoor attacks on multimodal large language models,

    Lu, D., Pang, T., Du, C., Liu, Q., Yang, X., Lin, M.: Test-time backdoor attacks on multimodal large language models. arXiv preprint arXiv:2402.08577 (2024)

  32. [32]

    DeepSeek-VL: Towards Real-World Vision-Language Understanding

    Lu, H., Liu, W., Zhang, B., Wang, B., Dong, K., Liu, B., Sun, J., Ren, T., Li, Z., Yang, H., Sun, Y., Deng, C., Xu, H., Xie, Z., Ruan, C.: Deepseek-vl: Towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525 (2024)

  33. [33]

    In: ECCV (2024)

    Lyu, W., Pang, L., Ma, T., Ling, H., Chen, C.: Trojvlm: Backdoor attack against vision language models. In: ECCV (2024)

  34. [34]

    In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability

    MacQueen, J.B.: Some methods for classification and analysis of multivariate ob- servations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. vol. 1, pp. 281–297 (1967)

  35. [35]

    In: ICMLW (2024)

    Ni, Z., Ye, R., Wei, Y., Xiang, Z., Wang, Y., Chen, S.: Physical backdoor attack can jeopardize driving with vision-large-language models. In: ICMLW (2024)

  36. [36]

    In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

    Qi, F., Chen, Y., Li, M., Yao, Y., Liu, Z., Sun, M.: ONION: A simple and effective defense against textual backdoor attacks. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. pp. 9558–9566. Association for Computational Linguistics (2021) Title Suppressed Due to Excessive Length 23

  37. [37]

    Qi, F., Li, M., Chen, Y., Zhang, Z., Liu, Z., Wang, Y., Sun, M.: Hidden killer: Invisible textual backdoor attacks with syntactic trigger. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). pp. 443–453. Association f...

  38. [38]

    In: ICML (2021)

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: ICML (2021)

  39. [39]

    In: NeurIPS (2025)

    Rong, X., Huang, W., Liang, J., Bi, J., Xiao, X., Li, Y., Du, B., Ye, M.: Backdoor cleaning without external guidance in MLLM fine-tuning. In: NeurIPS (2025)

  40. [40]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

    Shao, H., Hu, Y., Wang, L., Waslander, S.L., Liu, Y., Li, H.: LMDrive: Closed-loop end-to-end driving with large language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

  41. [41]

    In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV)

    Struppek, L., Hintersdorf, D., Kersting, K.: Rickrolling the artist: Injecting back- doors into text encoders for text-to-image synthesis. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV). pp. 4561–4573 (2022)

  42. [42]

    In: NAACL (2025)

    Tao, X., Zhong, S., Li, L., Liu, Q., Kong, L.: Imgtrojan: Jailbreaking vision- language models with one image. In: NAACL (2025)

  43. [43]

    Gemini: A Family of Highly Capable Multimodal Models

    Team, G.: Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2025)

  44. [44]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., Lample, G.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  45. [45]

    In: NeurIPS (2018)

    Tran, B., Li, J., Madry, A.: Spectral signatures in backdoor attacks. In: NeurIPS (2018)

  46. [46]

    Journal of Statis- tical Physics52(1–2), 479–487 (1988)

    Tsallis, C.: Possible generalization of boltzmann-gibbs statistics. Journal of Statis- tical Physics52(1–2), 479–487 (1988)

  47. [47]

    In: IEEE S&P (2019)

    Wang, B., Yao, Y., Shan, S., Li, H., Viswanath, B., Zheng, H., Zhao, B.Y.: Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In: IEEE S&P (2019)

  48. [48]

    In: Proceedings of the European Conference on Computer Vision (ECCV) (2024)

    Wang, Z., Zhang, J., Shan, S., Chen, X.: T2ishield: Defending against backdoors on text-to-image diffusion models. In: Proceedings of the European Conference on Computer Vision (ECCV) (2024)

  49. [49]

    IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) pp

    Wang, Zhongqi and Zhang, Jie and Shan, Shiguang and Chen, Xilin: Dynamic attention analysis for backdoor detection in text-to-image diffusion models. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) pp. 1–14 (2025)

  50. [50]

    In: AAAI (2026)

    Xu, S., Liang, S., Zheng, H., Liu, A., Wang, X., Luo, Y., Lin, F., Rutkowski, L., Tao, D.: SRD: Reinforcement-learned semantic perturbation for backdoor defense in VLMs. In: AAAI (2026)

  51. [51]

    In: NeurIPS (2024)

    Xu, Y., Yao, J., Shu, M., Sun, Y., Wu, Z., Yu, N., Goldstein, T., Huang, F.: Shadowcast: Stealthy data poisoning attacks against vision-language models. In: NeurIPS (2024)

  52. [52]

    arXiv preprint arXiv:2506.05401 , year=

    Xun, Y., Liang, S., Jia, X., Liu, X., Cao, X.: Robust anti-backdoor instruction tuning in LVLMs. arXiv preprint arXiv:2506.05401 (2025)

  53. [53]

    In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

    Yang, W., Lin, Y., Li, P., Zhou, J., Sun, X.: RAP: Robustness-aware perturbations for defending against backdoor attacks on NLP models. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. pp. 8365–8381. 24 X. Ge et al. Association for Computational Linguistics (2021).https://doi.org/10.18653/ v1/2021.emnlp-main.659

  54. [54]

    National Science Review11(12), nwae403 (2024)

    Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., Chen, E.: A survey on multimodal large language models. National Science Review11(12), nwae403 (2024)

  55. [55]

    In: Proceed- ings of the 31st ACM International Conference on Multimedia (ACM MM)

    Zhai, S., Dong, Y., Shen, Q., Pu, S., Fang, Y., Su, H.: Text-to-image diffusion models can be easily backdoored through multimodal data poisoning. In: Proceed- ings of the 31st ACM International Conference on Multimedia (ACM MM). p. 1577–1587. Association for Computing Machinery (2023).https://doi.org/10. 1145/3581783.3612108

  56. [56]

    arXiv preprint arXiv:2503.17724 (2025)

    Zhang, J., Wang, Z., Shan, S., Chen, X.: Trigger without trace: Towards stealthy backdoor attack on text-to-image diffusion models. arXiv preprint arXiv:2503.17724 (2025)

  57. [57]

    arXiv preprint arXiv:2506.07214 (2025)

    Zhong, Z., Sun, Z., Liu, Y., He, X., Tao, G.: Backdoor attack on vision language models with stealthy semantic manipulation. arXiv preprint arXiv:2506.07214 (2025)