EntropyScan: Towards Model-level Backdoor Detection in LVLMs via Visual Attention Entropy
Pith reviewed 2026-05-20 19:42 UTC · model grok-4.3
The pith
Backdoor attacks in large vision-language models create structural anomalies in visual attention on benign samples that can be detected using entropy measures.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Backdoor injection disrupts the cross-modal alignment, resulting in pronounced structural anomalies in visual attention allocation on benign samples. EntropyScan detects the backdoored models by quantifying such attention deviations with Tsallis entropy and reference-anchored Z-score normalization on benign samples.
What carries the argument
Tsallis entropy applied to visual attention distributions from initial layers of the LLM component to measure backdoor-induced structural distortions.
If this is right
- Allows detection without knowledge of the backdoor trigger or poisoned training data.
- Achieves an average F1 score of 98.5% and AUC of 96.6% across tested architectures and attacks.
- Works on two different LVLM architectures and three advanced attack scenarios.
- Relies only on a small set of benign samples for reference normalization.
Where Pith is reading between the lines
- Similar entropy-based detection could be explored for other multimodal AI systems vulnerable to alignment attacks.
- The method highlights attention mechanisms as potential weak points for security analysis in LVLMs.
- Extending the approach to later layers or different entropy measures might improve detection in some cases.
Load-bearing premise
The premise that backdoor injection produces consistent structural anomalies in visual attention on benign samples across different attacks and models, which are not present in clean versions.
What would settle it
Running the detection on a backdoored model where the attack was crafted to preserve normal visual attention patterns on benign inputs would show if the method fails.
Figures
read the original abstract
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across various tasks, yet they remain vulnerable to backdoor attacks. Existing defense methods predominantly focus on sample-level defense, which relies on the knowledge of training data or triggers. However, identifying whether a given model is backdoored remains a critical but unexplored task. To fill this gap, we propose EntropyScan, a lightweight and trigger-agnostic method for model-level backdoor detection in LVLMs. We first observe that backdoor injection disrupts the cross-modal alignment, resulting in pronounced structural anomalies in visual attention allocation on benign samples. Based on this insight, EntropyScan detects the backdoor models by quantifying such attention deviations. Specifically, it extracts visual attention distributions from the initial layers of the Large Language Model (LLM) and applies Tsallis entropy to capture these structural distortions. By employing a reference-anchored Z-score normalization on a small set of benign samples, it effectively identifies the backdoored model. Extensive experiments across two LVLMs architectures and three advanced attack scenarios show that EntropyScan achieves an F1 score of 98.5% in average and an AUC of 96.6%. Our code will be publicly available soon.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes EntropyScan for model-level backdoor detection in Large Vision-Language Models (LVLMs). It claims that backdoor injection disrupts cross-modal alignment, producing detectable structural anomalies in visual attention distributions extracted from the initial LLM layers even on benign samples. The method quantifies these via Tsallis entropy followed by reference-anchored Z-score normalization on a small set of benign samples. Experiments on two LVLM architectures and three attack scenarios report average F1 of 98.5% and AUC of 96.6%.
Significance. If the central empirical claim is substantiated, the work would be significant as one of the first practical model-level detectors for backdoored LVLMs. It is lightweight, trigger-agnostic, and leverages existing attention maps rather than requiring trigger reconstruction or poisoned data access. The reported aggregate performance is strong, and the commitment to public code is a positive for reproducibility. Such a method could be adopted in deployment pipelines where users receive models without training provenance.
major comments (2)
- [§4] §4 (Experiments): The reported average F1 of 98.5% and AUC of 96.6% are given only in aggregate form with no per-model entropy histograms, variance statistics across clean-model training runs, or ablation on non-backdoor fine-tuning. This directly bears on whether the Z-score threshold separates backdoor-induced anomalies from natural variation due to initialization or data heterogeneity.
- [§3.1] §3.1 (Observation of attention anomalies): The premise that backdoor training consistently produces 'pronounced structural anomalies' in visual attention on benign inputs (absent in clean models) is load-bearing for the detection claim, yet the manuscript provides no quantitative comparison of entropy spread between clean and backdoored models under matched training conditions.
minor comments (2)
- [§3.2] The choice of q-parameter in the Tsallis entropy formula and the exact number of benign samples used for reference anchoring should be stated explicitly with an equation or pseudocode for reproducibility.
- [Figure 2] Figure captions for attention visualizations could include the exact layer indices from which maps are extracted to allow direct replication.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below, acknowledging where additional evidence is needed and outlining specific revisions to strengthen the empirical claims.
read point-by-point responses
-
Referee: [§4] §4 (Experiments): The reported average F1 of 98.5% and AUC of 96.6% are given only in aggregate form with no per-model entropy histograms, variance statistics across clean-model training runs, or ablation on non-backdoor fine-tuning. This directly bears on whether the Z-score threshold separates backdoor-induced anomalies from natural variation due to initialization or data heterogeneity.
Authors: We agree that aggregate performance metrics alone leave open questions about natural variation. In the revised manuscript we will add per-model entropy histograms comparing clean and backdoored models, report standard deviations of entropy values across multiple independent clean-model training runs with matched hyperparameters and data, and include an ablation on non-backdoor fine-tuning (e.g., continued pre-training or instruction tuning on clean data). These additions will directly test whether the chosen Z-score threshold isolates backdoor-induced shifts from initialization or data heterogeneity effects. revision: yes
-
Referee: [§3.1] §3.1 (Observation of attention anomalies): The premise that backdoor training consistently produces 'pronounced structural anomalies' in visual attention on benign inputs (absent in clean models) is load-bearing for the detection claim, yet the manuscript provides no quantitative comparison of entropy spread between clean and backdoored models under matched training conditions.
Authors: We acknowledge that a quantitative comparison under strictly matched training conditions is important for substantiating the core observation. While the current experiments compare backdoored models against clean baselines trained on similar data distributions, we will revise §3.1 to include explicit quantitative metrics: mean and variance of Tsallis entropy values, together with statistical significance tests (e.g., t-tests or Kolmogorov-Smirnov tests), for clean versus backdoored models trained under identical conditions. This will provide a clearer measure of entropy spread attributable to backdoor injection. revision: yes
Circularity Check
No significant circularity; derivation rests on empirical observation plus standard statistical quantification
full rationale
The paper begins with a stated empirical observation that backdoor injection produces structural anomalies in visual attention on benign samples, then quantifies those anomalies via Tsallis entropy on attention distributions extracted from initial LLM layers followed by reference-anchored Z-score normalization. No equation or step reduces the detection output to a fitted parameter or self-referential definition of the target; the Z-score operates on a small set of benign samples without the threshold or entropy measure being constructed from backdoor labels. No self-citations are invoked as load-bearing uniqueness theorems, and the method does not rename a known result or smuggle an ansatz. The chain is therefore self-contained against external benchmarks of attention statistics and detection performance.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Backdoor injection disrupts cross-modal alignment resulting in pronounced structural anomalies in visual attention allocation on benign samples.
Reference graph
Works this paper leans on
-
[1]
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
In: Advances in Neural Information Processing Systems (NeurIPS)
Cai, X., Xu, H., Xu, S., Zhang, Y., Yuan, X.: Badprompt: Backdoor attacks on continuous prompts. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 35, pp. 37068–37080 (2022)
work page 2022
-
[4]
In: AAAI Workshop on Artificial Intelligence Safety (SafeAI@AAAI)
Chen, B., Carvalho, W., Baracaldo, N., Ludwig, H., Edwards, B., Lee, T., Molloy, I.,Srivastava,B.:Detectingbackdoorattacksondeepneuralnetworksbyactivation clustering. In: AAAI Workshop on Artificial Intelligence Safety (SafeAI@AAAI). CEUR Workshop Proceedings, vol. 2301 (2019)
work page 2019
-
[5]
Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning
Chen, X., Liu, C., Li, B., Lu, K., Song, D.: Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[6]
Chiang, W.L., Lin, Z., Sheng, Y., Li, X., Liu, D., Zhang, H., Hou, Y., Zhong, Y., Wang, S., Li, Z., Zhu, T., Lin, C.H., Wu, Y., Zhang, R., Gonzalez, J.E., Stoica, I., Xing, E.P.: Vicuna: An open-source chatbot impressing GPT-4 with 90% chatgpt quality (2023)
work page 2023
-
[7]
Chou, S.Y., Chen, P.Y., Ho, T.Y.: How to backdoor diffusion models? In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
work page 2023
-
[8]
PaLM: Scaling Language Modeling with Pathways
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N., Prab- hakaran, V., Reif, E., Du, N., Hutchinson, B., Pope, R., Bradbury, J., Austin, J., Isard, M., Gur-Ari, G., Yin, P., Duke, T., Levsk...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[9]
Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B., Fung, P., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning. In: NeurIPS (2023)
work page 2023
-
[10]
Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L.: Qlora: Efficient finetun- ing of quantized llms. In: NeurIPS (2023)
work page 2023
-
[11]
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021)
work page 2021
-
[12]
Pattern Recognition Letters27(8), 861–874 (2006)
Fawcett, T.: An introduction to ROC analysis. Pattern Recognition Letters27(8), 861–874 (2006)
work page 2006
-
[13]
Gao, Y., Xu, C., Wang, D., Chen, S., Ranasinghe, D.C., Nepal, S.: STRIP: A defence against trojan attacks on deep neural networks. In: ACSAC (2019)
work page 2019
-
[14]
Gu, T., Liu, K., Dolan-Gavitt, B., Garg, S.: Badnets: Evaluating backdooring at- tacks on deep neural networks. IEEE Access (2019) 22 X. Ge et al
work page 2019
-
[15]
Hao, J., Jin, X., Xiaoguang, H., Tianyou, C., Jiajia, Z.: Diff-cleanse: Identifying and mitigating backdoor attacks in diffusion models (2024)
work page 2024
-
[16]
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
work page 2016
-
[17]
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. In: ICLR (2022)
work page 2022
-
[18]
Journal of Classification2(1), 193– 218 (1985)
Hubert, L., Arabie, P.: Comparing partitions. Journal of Classification2(1), 193– 218 (1985)
work page 1985
-
[19]
Jolliffe, I.T.: Principal Component Analysis. Springer, 2 edn. (2002)
work page 2002
-
[20]
Kreyszig, E.: Advanced engineering mathematics. John Wiley & Sons (2011)
work page 2011
-
[21]
Otter: A Multi-Modal Model with In-Context Instruction Tuning
Li, B., Zhang, Y., Chen, L., Wang, J., Pu, F., Cahyono, J.A., Yang, J., Liu, Z.: Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Mimic-it: Multi-modal in-context instruction tuning,
Li, B., Zhang, Y., Chen, L., Wang, J., Pu, F., Yang, J., Li, C., Liu, Z.: Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425 (2023)
-
[23]
Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: LLaVA-Med: Training a large language-and-vision assistant for biomedicine in one day. In: Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track (2023)
work page 2023
-
[24]
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: ICML (2023)
work page 2023
-
[25]
Liang, J., Liang, S., Luo, M., Liu, A., Han, D., Chang, E.C., Cao, X.: Vl-trojan: Multimodal instruction backdoor attacks against autoregressive visual language models. IJCV (2025)
work page 2025
-
[26]
Liang, S., Liang, J., Pang, T., Du, C., Liu, A., Chang, E.C., Cao, X.: Revisiting backdoor attacks against large vision-language models. In: CVPR (2025)
work page 2025
-
[27]
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)
work page 2023
-
[28]
Liu, K., Dolan-Gavitt, B., Garg, S.: Fine-pruning: Defending against backdooring attacks on deep neural networks. In: RAID (2018)
work page 2018
-
[29]
arXiv preprint arXiv:2505.06413 , year=
Liu, M., Liang, S., Howlader, K., Wang, L., Tao, D., Zhang, W.: Natural reflection backdoor attack on vision language model for autonomous driving. arXiv preprint arXiv:2505.06413 (2025)
-
[30]
arXiv preprint arXiv:2601.21692 (2026)
Liu, M., Fang, H., Cong, R.: TCAP: Tri-component attention profiling for unsuper- vised backdoor detection in MLLM fine-tuning. arXiv preprint arXiv:2601.21692 (2026)
work page internal anchor Pith review arXiv 2026
-
[31]
Test-time backdoor attacks on multimodal large language models,
Lu, D., Pang, T., Du, C., Liu, Q., Yang, X., Lin, M.: Test-time backdoor attacks on multimodal large language models. arXiv preprint arXiv:2402.08577 (2024)
-
[32]
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Lu, H., Liu, W., Zhang, B., Wang, B., Dong, K., Liu, B., Sun, J., Ren, T., Li, Z., Yang, H., Sun, Y., Deng, C., Xu, H., Xie, Z., Ruan, C.: Deepseek-vl: Towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Lyu, W., Pang, L., Ma, T., Ling, H., Chen, C.: Trojvlm: Backdoor attack against vision language models. In: ECCV (2024)
work page 2024
-
[34]
In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability
MacQueen, J.B.: Some methods for classification and analysis of multivariate ob- servations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. vol. 1, pp. 281–297 (1967)
work page 1967
-
[35]
Ni, Z., Ye, R., Wei, Y., Xiang, Z., Wang, Y., Chen, S.: Physical backdoor attack can jeopardize driving with vision-large-language models. In: ICMLW (2024)
work page 2024
-
[36]
In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Qi, F., Chen, Y., Li, M., Yao, Y., Liu, Z., Sun, M.: ONION: A simple and effective defense against textual backdoor attacks. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. pp. 9558–9566. Association for Computational Linguistics (2021) Title Suppressed Due to Excessive Length 23
work page 2021
-
[37]
Qi, F., Li, M., Chen, Y., Zhang, Z., Liu, Z., Wang, Y., Sun, M.: Hidden killer: Invisible textual backdoor attacks with syntactic trigger. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). pp. 443–453. Association f...
work page 2021
-
[38]
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: ICML (2021)
work page 2021
-
[39]
Rong, X., Huang, W., Liang, J., Bi, J., Xiao, X., Li, Y., Du, B., Ye, M.: Backdoor cleaning without external guidance in MLLM fine-tuning. In: NeurIPS (2025)
work page 2025
-
[40]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Shao, H., Hu, Y., Wang, L., Waslander, S.L., Liu, Y., Li, H.: LMDrive: Closed-loop end-to-end driving with large language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
work page 2024
-
[41]
In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV)
Struppek, L., Hintersdorf, D., Kersting, K.: Rickrolling the artist: Injecting back- doors into text encoders for text-to-image synthesis. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV). pp. 4561–4573 (2022)
work page 2022
-
[42]
Tao, X., Zhong, S., Li, L., Liu, Q., Kong, L.: Imgtrojan: Jailbreaking vision- language models with one image. In: NAACL (2025)
work page 2025
-
[43]
Gemini: A Family of Highly Capable Multimodal Models
Team, G.: Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
LLaMA: Open and Efficient Foundation Language Models
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., Lample, G.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Tran, B., Li, J., Madry, A.: Spectral signatures in backdoor attacks. In: NeurIPS (2018)
work page 2018
-
[46]
Journal of Statis- tical Physics52(1–2), 479–487 (1988)
Tsallis, C.: Possible generalization of boltzmann-gibbs statistics. Journal of Statis- tical Physics52(1–2), 479–487 (1988)
work page 1988
-
[47]
Wang, B., Yao, Y., Shan, S., Li, H., Viswanath, B., Zheng, H., Zhao, B.Y.: Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In: IEEE S&P (2019)
work page 2019
-
[48]
In: Proceedings of the European Conference on Computer Vision (ECCV) (2024)
Wang, Z., Zhang, J., Shan, S., Chen, X.: T2ishield: Defending against backdoors on text-to-image diffusion models. In: Proceedings of the European Conference on Computer Vision (ECCV) (2024)
work page 2024
-
[49]
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) pp
Wang, Zhongqi and Zhang, Jie and Shan, Shiguang and Chen, Xilin: Dynamic attention analysis for backdoor detection in text-to-image diffusion models. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) pp. 1–14 (2025)
work page 2025
-
[50]
Xu, S., Liang, S., Zheng, H., Liu, A., Wang, X., Luo, Y., Lin, F., Rutkowski, L., Tao, D.: SRD: Reinforcement-learned semantic perturbation for backdoor defense in VLMs. In: AAAI (2026)
work page 2026
-
[51]
Xu, Y., Yao, J., Shu, M., Sun, Y., Wu, Z., Yu, N., Goldstein, T., Huang, F.: Shadowcast: Stealthy data poisoning attacks against vision-language models. In: NeurIPS (2024)
work page 2024
-
[52]
arXiv preprint arXiv:2506.05401 , year=
Xun, Y., Liang, S., Jia, X., Liu, X., Cao, X.: Robust anti-backdoor instruction tuning in LVLMs. arXiv preprint arXiv:2506.05401 (2025)
-
[53]
In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Yang, W., Lin, Y., Li, P., Zhou, J., Sun, X.: RAP: Robustness-aware perturbations for defending against backdoor attacks on NLP models. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. pp. 8365–8381. 24 X. Ge et al. Association for Computational Linguistics (2021).https://doi.org/10.18653/ v1/2021.emnlp-main.659
work page 2021
-
[54]
National Science Review11(12), nwae403 (2024)
Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., Chen, E.: A survey on multimodal large language models. National Science Review11(12), nwae403 (2024)
work page 2024
-
[55]
In: Proceed- ings of the 31st ACM International Conference on Multimedia (ACM MM)
Zhai, S., Dong, Y., Shen, Q., Pu, S., Fang, Y., Su, H.: Text-to-image diffusion models can be easily backdoored through multimodal data poisoning. In: Proceed- ings of the 31st ACM International Conference on Multimedia (ACM MM). p. 1577–1587. Association for Computing Machinery (2023).https://doi.org/10. 1145/3581783.3612108
-
[56]
arXiv preprint arXiv:2503.17724 (2025)
Zhang, J., Wang, Z., Shan, S., Chen, X.: Trigger without trace: Towards stealthy backdoor attack on text-to-image diffusion models. arXiv preprint arXiv:2503.17724 (2025)
-
[57]
arXiv preprint arXiv:2506.07214 (2025)
Zhong, Z., Sun, Z., Liu, Y., He, X., Tao, G.: Backdoor attack on vision language models with stealthy semantic manipulation. arXiv preprint arXiv:2506.07214 (2025)
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.