EntropyScan: Towards Model-level Backdoor Detection in LVLMs via Visual Attention Entropy

Jie Zhang; Shiguang Shan; Xilin Chen; Xuanyu Ge; Zhongqi Wang

arxiv: 2605.15711 · v1 · pith:JT3OJOENnew · submitted 2026-05-15 · 💻 cs.CV

EntropyScan: Towards Model-level Backdoor Detection in LVLMs via Visual Attention Entropy

Xuanyu Ge , Zhongqi Wang , Jie Zhang , Shiguang Shan , Xilin Chen This is my paper

Pith reviewed 2026-05-20 19:42 UTC · model grok-4.3

classification 💻 cs.CV

keywords backdoor detectionlarge vision-language modelsvisual attentionTsallis entropyZ-score normalizationmodel-level defensecross-modal alignment

0 comments

The pith

Backdoor attacks in large vision-language models create structural anomalies in visual attention on benign samples that can be detected using entropy measures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to identify backdoored large vision-language models at the model level without needing triggers or training data. It establishes that injecting a backdoor disrupts cross-modal alignment, leading to consistent anomalies in how the model distributes attention across normal images. These anomalies are quantified by applying Tsallis entropy to attention distributions from the first layers of the language model part. A reference-anchored Z-score normalization then compares against clean behavior using a few benign samples. This would matter for practical auditing of models before use, as it avoids reliance on attack-specific knowledge.

Core claim

Backdoor injection disrupts the cross-modal alignment, resulting in pronounced structural anomalies in visual attention allocation on benign samples. EntropyScan detects the backdoored models by quantifying such attention deviations with Tsallis entropy and reference-anchored Z-score normalization on benign samples.

What carries the argument

Tsallis entropy applied to visual attention distributions from initial layers of the LLM component to measure backdoor-induced structural distortions.

If this is right

Allows detection without knowledge of the backdoor trigger or poisoned training data.
Achieves an average F1 score of 98.5% and AUC of 96.6% across tested architectures and attacks.
Works on two different LVLM architectures and three advanced attack scenarios.
Relies only on a small set of benign samples for reference normalization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar entropy-based detection could be explored for other multimodal AI systems vulnerable to alignment attacks.
The method highlights attention mechanisms as potential weak points for security analysis in LVLMs.
Extending the approach to later layers or different entropy measures might improve detection in some cases.

Load-bearing premise

The premise that backdoor injection produces consistent structural anomalies in visual attention on benign samples across different attacks and models, which are not present in clean versions.

What would settle it

Running the detection on a backdoored model where the attack was crafted to preserve normal visual attention patterns on benign inputs would show if the method fails.

Figures

Figures reproduced from arXiv: 2605.15711 by Jie Zhang, Shiguang Shan, Xilin Chen, Xuanyu Ge, Zhongqi Wang.

**Figure 1.** Figure 1: Illustration of a multimodal backdoor attack against Large Vision-Language Models (LVLMs). A compromised model downloaded from a third-party platform (left) generates accurate, harmless responses for benign inputs. However, introducing a predefined trigger activates the hidden backdoor, forcing the model to output a malicious target response and bypass safety alignments (right). While this distribution me… view at source ↗

**Figure 2.** Figure 2: Overview of the EntropyScan. EntropyScan evaluates a suspect model Mtarget against an architecture-matched benign reference Mref using a small clean dataset Dval. Specifically, (a) it extracts visual attention weights from the initial layer of LLM to formulate the renormalized conditional probability distribution Pv. To quantify structural anomalies, we calculate the Tsallis entropy Hq(·) to Pv, yielding t… view at source ↗

**Figure 3.** Figure 3: Visualization of visual attention maps at the initial layer (Layer-0) of the LLM given the same input. Each colored cell in the right-side heatmaps denotes the attention probability of the final token of the input prompt (acting as the query) attending to a specific visual patch token (acting as the key). Map (a) illustrates the standard attention distribution of the benign model, while map (b) reveals th… view at source ↗

**Figure 4.** Figure 4: Layer-wise Z-score analysis of visual attention entropy. The red and blue lines represent the backdoored and benign models, respectively. The comparison under (a) Imgtrojan and (b) Shadowcast attacks reveals that the most significant structural deviation occurs at the initial layer (l = 0), validating our layer selection strategy. We visualize the P (0) V for the benign and backdoored model respectively. … view at source ↗

**Figure 5.** Figure 5: Layer sensitivity analysis. (a) Layer-0 serves as the optimal detection probe in the majority of attack scenarios. (b) The detection signal decays as network depth increases [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Illustrative examples of the evaluated backdoor attack scenarios. The top row demonstrates two task-specific variations of the ShadowCast: the Label Attack (left) and the Persuasion Attack (right). The middle row illustrates the application of the VL-Trojan across two distinct tasks: Image Captioning (left) and Spot the Difference (right). The bottom row depicts the ImgTrojan, which achieves a malicious ja… view at source ↗

read the original abstract

Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across various tasks, yet they remain vulnerable to backdoor attacks. Existing defense methods predominantly focus on sample-level defense, which relies on the knowledge of training data or triggers. However, identifying whether a given model is backdoored remains a critical but unexplored task. To fill this gap, we propose EntropyScan, a lightweight and trigger-agnostic method for model-level backdoor detection in LVLMs. We first observe that backdoor injection disrupts the cross-modal alignment, resulting in pronounced structural anomalies in visual attention allocation on benign samples. Based on this insight, EntropyScan detects the backdoor models by quantifying such attention deviations. Specifically, it extracts visual attention distributions from the initial layers of the Large Language Model (LLM) and applies Tsallis entropy to capture these structural distortions. By employing a reference-anchored Z-score normalization on a small set of benign samples, it effectively identifies the backdoored model. Extensive experiments across two LVLMs architectures and three advanced attack scenarios show that EntropyScan achieves an F1 score of 98.5% in average and an AUC of 96.6%. Our code will be publicly available soon.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EntropyScan uses Tsallis entropy on early-layer visual attention plus Z-score to flag backdoored LVLMs on clean inputs, but the experiments leave open whether the signal is truly backdoor-specific or just training noise.

read the letter

The main thing here is a model-level detector that watches how visual attention entropy behaves on ordinary samples after backdoor training. It extracts distributions from the first LLM layers, measures them with Tsallis entropy, and flags outliers via reference-anchored Z-score. That is the concrete new piece: a trigger-agnostic check that does not require poisoned data or trigger knowledge, unlike the sample-level defenses cited in the abstract. The reported numbers across two LVLM architectures and three attack types are high enough to notice: 98.5% average F1 and 96.6% AUC. Releasing code will also help people test it directly. Those elements give the work a practical angle worth looking at for anyone auditing deployed multimodal models. The soft spot sits right at the central claim. The method assumes backdoor injection reliably produces attention anomalies on benign inputs that clean models do not exhibit at similar scale. The abstract gives aggregate detection scores but no entropy histograms, per-model variance numbers, or ablations that compare backdoored runs against clean fine-tuning or different initializations. If natural training variation already moves entropy in the same range, the Z-score threshold could be unstable or produce false positives. That gap is real and worth pressing on, though it does not make the whole idea collapse. This paper is aimed at people working on AI safety for vision-language systems who need lightweight model inspection tools. A reader who cares about practical detection without trigger access will get something usable from it. It is coherent enough and addresses a clear gap, so it deserves a serious referee. I would send it to peer review and ask the authors for the missing variance checks and controls.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes EntropyScan for model-level backdoor detection in Large Vision-Language Models (LVLMs). It claims that backdoor injection disrupts cross-modal alignment, producing detectable structural anomalies in visual attention distributions extracted from the initial LLM layers even on benign samples. The method quantifies these via Tsallis entropy followed by reference-anchored Z-score normalization on a small set of benign samples. Experiments on two LVLM architectures and three attack scenarios report average F1 of 98.5% and AUC of 96.6%.

Significance. If the central empirical claim is substantiated, the work would be significant as one of the first practical model-level detectors for backdoored LVLMs. It is lightweight, trigger-agnostic, and leverages existing attention maps rather than requiring trigger reconstruction or poisoned data access. The reported aggregate performance is strong, and the commitment to public code is a positive for reproducibility. Such a method could be adopted in deployment pipelines where users receive models without training provenance.

major comments (2)

[§4] §4 (Experiments): The reported average F1 of 98.5% and AUC of 96.6% are given only in aggregate form with no per-model entropy histograms, variance statistics across clean-model training runs, or ablation on non-backdoor fine-tuning. This directly bears on whether the Z-score threshold separates backdoor-induced anomalies from natural variation due to initialization or data heterogeneity.
[§3.1] §3.1 (Observation of attention anomalies): The premise that backdoor training consistently produces 'pronounced structural anomalies' in visual attention on benign inputs (absent in clean models) is load-bearing for the detection claim, yet the manuscript provides no quantitative comparison of entropy spread between clean and backdoored models under matched training conditions.

minor comments (2)

[§3.2] The choice of q-parameter in the Tsallis entropy formula and the exact number of benign samples used for reference anchoring should be stated explicitly with an equation or pseudocode for reproducibility.
[Figure 2] Figure captions for attention visualizations could include the exact layer indices from which maps are extracted to allow direct replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, acknowledging where additional evidence is needed and outlining specific revisions to strengthen the empirical claims.

read point-by-point responses

Referee: [§4] §4 (Experiments): The reported average F1 of 98.5% and AUC of 96.6% are given only in aggregate form with no per-model entropy histograms, variance statistics across clean-model training runs, or ablation on non-backdoor fine-tuning. This directly bears on whether the Z-score threshold separates backdoor-induced anomalies from natural variation due to initialization or data heterogeneity.

Authors: We agree that aggregate performance metrics alone leave open questions about natural variation. In the revised manuscript we will add per-model entropy histograms comparing clean and backdoored models, report standard deviations of entropy values across multiple independent clean-model training runs with matched hyperparameters and data, and include an ablation on non-backdoor fine-tuning (e.g., continued pre-training or instruction tuning on clean data). These additions will directly test whether the chosen Z-score threshold isolates backdoor-induced shifts from initialization or data heterogeneity effects. revision: yes
Referee: [§3.1] §3.1 (Observation of attention anomalies): The premise that backdoor training consistently produces 'pronounced structural anomalies' in visual attention on benign inputs (absent in clean models) is load-bearing for the detection claim, yet the manuscript provides no quantitative comparison of entropy spread between clean and backdoored models under matched training conditions.

Authors: We acknowledge that a quantitative comparison under strictly matched training conditions is important for substantiating the core observation. While the current experiments compare backdoored models against clean baselines trained on similar data distributions, we will revise §3.1 to include explicit quantitative metrics: mean and variance of Tsallis entropy values, together with statistical significance tests (e.g., t-tests or Kolmogorov-Smirnov tests), for clean versus backdoored models trained under identical conditions. This will provide a clearer measure of entropy spread attributable to backdoor injection. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation rests on empirical observation plus standard statistical quantification

full rationale

The paper begins with a stated empirical observation that backdoor injection produces structural anomalies in visual attention on benign samples, then quantifies those anomalies via Tsallis entropy on attention distributions extracted from initial LLM layers followed by reference-anchored Z-score normalization. No equation or step reduces the detection output to a fitted parameter or self-referential definition of the target; the Z-score operates on a small set of benign samples without the threshold or entropy measure being constructed from backdoor labels. No self-citations are invoked as load-bearing uniqueness theorems, and the method does not rename a known result or smuggle an ansatz. The chain is therefore self-contained against external benchmarks of attention statistics and detection performance.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that backdoors produce detectable attention anomalies on clean inputs; no free parameters or invented entities are introduced in the abstract description.

axioms (1)

domain assumption Backdoor injection disrupts cross-modal alignment resulting in pronounced structural anomalies in visual attention allocation on benign samples.
This observation is presented as the foundational insight enabling the detection method.

pith-pipeline@v0.9.0 · 5758 in / 1273 out tokens · 54314 ms · 2026-05-20T19:42:23.196308+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 9 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

In: Advances in Neural Information Processing Systems (NeurIPS)

Cai, X., Xu, H., Xu, S., Zhang, Y., Yuan, X.: Badprompt: Backdoor attacks on continuous prompts. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 35, pp. 37068–37080 (2022)

work page 2022
[4]

In: AAAI Workshop on Artificial Intelligence Safety (SafeAI@AAAI)

Chen, B., Carvalho, W., Baracaldo, N., Ludwig, H., Edwards, B., Lee, T., Molloy, I.,Srivastava,B.:Detectingbackdoorattacksondeepneuralnetworksbyactivation clustering. In: AAAI Workshop on Artificial Intelligence Safety (SafeAI@AAAI). CEUR Workshop Proceedings, vol. 2301 (2019)

work page 2019
[5]

Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning

Chen, X., Liu, C., Li, B., Lu, K., Song, D.: Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[6]

Chiang, W.L., Lin, Z., Sheng, Y., Li, X., Liu, D., Zhang, H., Hou, Y., Zhong, Y., Wang, S., Li, Z., Zhu, T., Lin, C.H., Wu, Y., Zhang, R., Gonzalez, J.E., Stoica, I., Xing, E.P.: Vicuna: An open-source chatbot impressing GPT-4 with 90% chatgpt quality (2023)

work page 2023
[7]

Chou, S.Y., Chen, P.Y., Ho, T.Y.: How to backdoor diffusion models? In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

work page 2023
[8]

PaLM: Scaling Language Modeling with Pathways

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N., Prab- hakaran, V., Reif, E., Du, N., Hutchinson, B., Pope, R., Bradbury, J., Austin, J., Isard, M., Gur-Ari, G., Yin, P., Duke, T., Levsk...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

In: NeurIPS (2023)

Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B., Fung, P., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning. In: NeurIPS (2023)

work page 2023
[10]

In: NeurIPS (2023)

Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L.: Qlora: Efficient finetun- ing of quantized llms. In: NeurIPS (2023)

work page 2023
[11]

In: ICLR (2021)

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021)

work page 2021
[12]

Pattern Recognition Letters27(8), 861–874 (2006)

Fawcett, T.: An introduction to ROC analysis. Pattern Recognition Letters27(8), 861–874 (2006)

work page 2006
[13]

In: ACSAC (2019)

Gao, Y., Xu, C., Wang, D., Chen, S., Ranasinghe, D.C., Nepal, S.: STRIP: A defence against trojan attacks on deep neural networks. In: ACSAC (2019)

work page 2019
[14]

IEEE Access (2019) 22 X

Gu, T., Liu, K., Dolan-Gavitt, B., Garg, S.: Badnets: Evaluating backdooring at- tacks on deep neural networks. IEEE Access (2019) 22 X. Ge et al

work page 2019
[15]

Hao, J., Jin, X., Xiaoguang, H., Tianyou, C., Jiajia, Z.: Diff-cleanse: Identifying and mitigating backdoor attacks in diffusion models (2024)

work page 2024
[16]

In: CVPR (2016)

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

work page 2016
[17]

In: ICLR (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. In: ICLR (2022)

work page 2022
[18]

Journal of Classification2(1), 193– 218 (1985)

Hubert, L., Arabie, P.: Comparing partitions. Journal of Classification2(1), 193– 218 (1985)

work page 1985
[19]

Springer, 2 edn

Jolliffe, I.T.: Principal Component Analysis. Springer, 2 edn. (2002)

work page 2002
[20]

John Wiley & Sons (2011)

Kreyszig, E.: Advanced engineering mathematics. John Wiley & Sons (2011)

work page 2011
[21]

Otter: A Multi-Modal Model with In-Context Instruction Tuning

Li, B., Zhang, Y., Chen, L., Wang, J., Pu, F., Cahyono, J.A., Yang, J., Liu, Z.: Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Mimic-it: Multi-modal in-context instruction tuning,

Li, B., Zhang, Y., Chen, L., Wang, J., Pu, F., Yang, J., Li, C., Liu, Z.: Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425 (2023)

work page arXiv 2023
[23]

In: Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track (2023)

Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: LLaVA-Med: Training a large language-and-vision assistant for biomedicine in one day. In: Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track (2023)

work page 2023
[24]

In: ICML (2023)

Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: ICML (2023)

work page 2023
[25]

IJCV (2025)

Liang, J., Liang, S., Luo, M., Liu, A., Han, D., Chang, E.C., Cao, X.: Vl-trojan: Multimodal instruction backdoor attacks against autoregressive visual language models. IJCV (2025)

work page 2025
[26]

In: CVPR (2025)

Liang, S., Liang, J., Pang, T., Du, C., Liu, A., Chang, E.C., Cao, X.: Revisiting backdoor attacks against large vision-language models. In: CVPR (2025)

work page 2025
[27]

In: NeurIPS (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)

work page 2023
[28]

In: RAID (2018)

Liu, K., Dolan-Gavitt, B., Garg, S.: Fine-pruning: Defending against backdooring attacks on deep neural networks. In: RAID (2018)

work page 2018
[29]

arXiv preprint arXiv:2505.06413 , year=

Liu, M., Liang, S., Howlader, K., Wang, L., Tao, D., Zhang, W.: Natural reflection backdoor attack on vision language model for autonomous driving. arXiv preprint arXiv:2505.06413 (2025)

work page arXiv 2025
[30]

arXiv preprint arXiv:2601.21692 (2026)

Liu, M., Fang, H., Cong, R.: TCAP: Tri-component attention profiling for unsuper- vised backdoor detection in MLLM fine-tuning. arXiv preprint arXiv:2601.21692 (2026)

work page internal anchor Pith review arXiv 2026
[31]

Test-time backdoor attacks on multimodal large language models,

Lu, D., Pang, T., Du, C., Liu, Q., Yang, X., Lin, M.: Test-time backdoor attacks on multimodal large language models. arXiv preprint arXiv:2402.08577 (2024)

work page arXiv 2024
[32]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Lu, H., Liu, W., Zhang, B., Wang, B., Dong, K., Liu, B., Sun, J., Ren, T., Li, Z., Yang, H., Sun, Y., Deng, C., Xu, H., Xie, Z., Ruan, C.: Deepseek-vl: Towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

In: ECCV (2024)

Lyu, W., Pang, L., Ma, T., Ling, H., Chen, C.: Trojvlm: Backdoor attack against vision language models. In: ECCV (2024)

work page 2024
[34]

In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability

MacQueen, J.B.: Some methods for classification and analysis of multivariate ob- servations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. vol. 1, pp. 281–297 (1967)

work page 1967
[35]

In: ICMLW (2024)

Ni, Z., Ye, R., Wei, Y., Xiang, Z., Wang, Y., Chen, S.: Physical backdoor attack can jeopardize driving with vision-large-language models. In: ICMLW (2024)

work page 2024
[36]

In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Qi, F., Chen, Y., Li, M., Yao, Y., Liu, Z., Sun, M.: ONION: A simple and effective defense against textual backdoor attacks. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. pp. 9558–9566. Association for Computational Linguistics (2021) Title Suppressed Due to Excessive Length 23

work page 2021
[37]

Qi, F., Li, M., Chen, Y., Zhang, Z., Liu, Z., Wang, Y., Sun, M.: Hidden killer: Invisible textual backdoor attacks with syntactic trigger. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). pp. 443–453. Association f...

work page 2021
[38]

In: ICML (2021)

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: ICML (2021)

work page 2021
[39]

In: NeurIPS (2025)

Rong, X., Huang, W., Liang, J., Bi, J., Xiao, X., Li, Y., Du, B., Ye, M.: Backdoor cleaning without external guidance in MLLM fine-tuning. In: NeurIPS (2025)

work page 2025
[40]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

Shao, H., Hu, Y., Wang, L., Waslander, S.L., Liu, Y., Li, H.: LMDrive: Closed-loop end-to-end driving with large language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

work page 2024
[41]

In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV)

Struppek, L., Hintersdorf, D., Kersting, K.: Rickrolling the artist: Injecting back- doors into text encoders for text-to-image synthesis. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV). pp. 4561–4573 (2022)

work page 2022
[42]

In: NAACL (2025)

Tao, X., Zhong, S., Li, L., Liu, Q., Kong, L.: Imgtrojan: Jailbreaking vision- language models with one image. In: NAACL (2025)

work page 2025
[43]

Gemini: A Family of Highly Capable Multimodal Models

Team, G.: Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., Lample, G.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

In: NeurIPS (2018)

Tran, B., Li, J., Madry, A.: Spectral signatures in backdoor attacks. In: NeurIPS (2018)

work page 2018
[46]

Journal of Statis- tical Physics52(1–2), 479–487 (1988)

Tsallis, C.: Possible generalization of boltzmann-gibbs statistics. Journal of Statis- tical Physics52(1–2), 479–487 (1988)

work page 1988
[47]

In: IEEE S&P (2019)

Wang, B., Yao, Y., Shan, S., Li, H., Viswanath, B., Zheng, H., Zhao, B.Y.: Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In: IEEE S&P (2019)

work page 2019
[48]

In: Proceedings of the European Conference on Computer Vision (ECCV) (2024)

Wang, Z., Zhang, J., Shan, S., Chen, X.: T2ishield: Defending against backdoors on text-to-image diffusion models. In: Proceedings of the European Conference on Computer Vision (ECCV) (2024)

work page 2024
[49]

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) pp

Wang, Zhongqi and Zhang, Jie and Shan, Shiguang and Chen, Xilin: Dynamic attention analysis for backdoor detection in text-to-image diffusion models. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) pp. 1–14 (2025)

work page 2025
[50]

In: AAAI (2026)

Xu, S., Liang, S., Zheng, H., Liu, A., Wang, X., Luo, Y., Lin, F., Rutkowski, L., Tao, D.: SRD: Reinforcement-learned semantic perturbation for backdoor defense in VLMs. In: AAAI (2026)

work page 2026
[51]

In: NeurIPS (2024)

Xu, Y., Yao, J., Shu, M., Sun, Y., Wu, Z., Yu, N., Goldstein, T., Huang, F.: Shadowcast: Stealthy data poisoning attacks against vision-language models. In: NeurIPS (2024)

work page 2024
[52]

arXiv preprint arXiv:2506.05401 , year=

Xun, Y., Liang, S., Jia, X., Liu, X., Cao, X.: Robust anti-backdoor instruction tuning in LVLMs. arXiv preprint arXiv:2506.05401 (2025)

work page arXiv 2025
[53]

In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Yang, W., Lin, Y., Li, P., Zhou, J., Sun, X.: RAP: Robustness-aware perturbations for defending against backdoor attacks on NLP models. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. pp. 8365–8381. 24 X. Ge et al. Association for Computational Linguistics (2021).https://doi.org/10.18653/ v1/2021.emnlp-main.659

work page 2021
[54]

National Science Review11(12), nwae403 (2024)

Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., Chen, E.: A survey on multimodal large language models. National Science Review11(12), nwae403 (2024)

work page 2024
[55]

In: Proceed- ings of the 31st ACM International Conference on Multimedia (ACM MM)

Zhai, S., Dong, Y., Shen, Q., Pu, S., Fang, Y., Su, H.: Text-to-image diffusion models can be easily backdoored through multimodal data poisoning. In: Proceed- ings of the 31st ACM International Conference on Multimedia (ACM MM). p. 1577–1587. Association for Computing Machinery (2023).https://doi.org/10. 1145/3581783.3612108

work page arXiv 2023
[56]

arXiv preprint arXiv:2503.17724 (2025)

Zhang, J., Wang, Z., Shan, S., Chen, X.: Trigger without trace: Towards stealthy backdoor attack on text-to-image diffusion models. arXiv preprint arXiv:2503.17724 (2025)

work page arXiv 2025
[57]

arXiv preprint arXiv:2506.07214 (2025)

Zhong, Z., Sun, Z., Liu, Y., He, X., Tao, G.: Backdoor attack on vision language models with stealthy semantic manipulation. arXiv preprint arXiv:2506.07214 (2025)

work page arXiv 2025

[1] [1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

In: Advances in Neural Information Processing Systems (NeurIPS)

Cai, X., Xu, H., Xu, S., Zhang, Y., Yuan, X.: Badprompt: Backdoor attacks on continuous prompts. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 35, pp. 37068–37080 (2022)

work page 2022

[4] [4]

In: AAAI Workshop on Artificial Intelligence Safety (SafeAI@AAAI)

Chen, B., Carvalho, W., Baracaldo, N., Ludwig, H., Edwards, B., Lee, T., Molloy, I.,Srivastava,B.:Detectingbackdoorattacksondeepneuralnetworksbyactivation clustering. In: AAAI Workshop on Artificial Intelligence Safety (SafeAI@AAAI). CEUR Workshop Proceedings, vol. 2301 (2019)

work page 2019

[5] [5]

Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning

Chen, X., Liu, C., Li, B., Lu, K., Song, D.: Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[6] [6]

Chiang, W.L., Lin, Z., Sheng, Y., Li, X., Liu, D., Zhang, H., Hou, Y., Zhong, Y., Wang, S., Li, Z., Zhu, T., Lin, C.H., Wu, Y., Zhang, R., Gonzalez, J.E., Stoica, I., Xing, E.P.: Vicuna: An open-source chatbot impressing GPT-4 with 90% chatgpt quality (2023)

work page 2023

[7] [7]

Chou, S.Y., Chen, P.Y., Ho, T.Y.: How to backdoor diffusion models? In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

work page 2023

[8] [8]

PaLM: Scaling Language Modeling with Pathways

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N., Prab- hakaran, V., Reif, E., Du, N., Hutchinson, B., Pope, R., Bradbury, J., Austin, J., Isard, M., Gur-Ari, G., Yin, P., Duke, T., Levsk...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[9] [9]

In: NeurIPS (2023)

Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B., Fung, P., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning. In: NeurIPS (2023)

work page 2023

[10] [10]

In: NeurIPS (2023)

Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L.: Qlora: Efficient finetun- ing of quantized llms. In: NeurIPS (2023)

work page 2023

[11] [11]

In: ICLR (2021)

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021)

work page 2021

[12] [12]

Pattern Recognition Letters27(8), 861–874 (2006)

Fawcett, T.: An introduction to ROC analysis. Pattern Recognition Letters27(8), 861–874 (2006)

work page 2006

[13] [13]

In: ACSAC (2019)

Gao, Y., Xu, C., Wang, D., Chen, S., Ranasinghe, D.C., Nepal, S.: STRIP: A defence against trojan attacks on deep neural networks. In: ACSAC (2019)

work page 2019

[14] [14]

IEEE Access (2019) 22 X

Gu, T., Liu, K., Dolan-Gavitt, B., Garg, S.: Badnets: Evaluating backdooring at- tacks on deep neural networks. IEEE Access (2019) 22 X. Ge et al

work page 2019

[15] [15]

Hao, J., Jin, X., Xiaoguang, H., Tianyou, C., Jiajia, Z.: Diff-cleanse: Identifying and mitigating backdoor attacks in diffusion models (2024)

work page 2024

[16] [16]

In: CVPR (2016)

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

work page 2016

[17] [17]

In: ICLR (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. In: ICLR (2022)

work page 2022

[18] [18]

Journal of Classification2(1), 193– 218 (1985)

Hubert, L., Arabie, P.: Comparing partitions. Journal of Classification2(1), 193– 218 (1985)

work page 1985

[19] [19]

Springer, 2 edn

Jolliffe, I.T.: Principal Component Analysis. Springer, 2 edn. (2002)

work page 2002

[20] [20]

John Wiley & Sons (2011)

Kreyszig, E.: Advanced engineering mathematics. John Wiley & Sons (2011)

work page 2011

[21] [21]

Otter: A Multi-Modal Model with In-Context Instruction Tuning

Li, B., Zhang, Y., Chen, L., Wang, J., Pu, F., Cahyono, J.A., Yang, J., Liu, Z.: Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

Mimic-it: Multi-modal in-context instruction tuning,

Li, B., Zhang, Y., Chen, L., Wang, J., Pu, F., Yang, J., Li, C., Liu, Z.: Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425 (2023)

work page arXiv 2023

[23] [23]

In: Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track (2023)

Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: LLaVA-Med: Training a large language-and-vision assistant for biomedicine in one day. In: Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track (2023)

work page 2023

[24] [24]

In: ICML (2023)

Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: ICML (2023)

work page 2023

[25] [25]

IJCV (2025)

Liang, J., Liang, S., Luo, M., Liu, A., Han, D., Chang, E.C., Cao, X.: Vl-trojan: Multimodal instruction backdoor attacks against autoregressive visual language models. IJCV (2025)

work page 2025

[26] [26]

In: CVPR (2025)

Liang, S., Liang, J., Pang, T., Du, C., Liu, A., Chang, E.C., Cao, X.: Revisiting backdoor attacks against large vision-language models. In: CVPR (2025)

work page 2025

[27] [27]

In: NeurIPS (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)

work page 2023

[28] [28]

In: RAID (2018)

Liu, K., Dolan-Gavitt, B., Garg, S.: Fine-pruning: Defending against backdooring attacks on deep neural networks. In: RAID (2018)

work page 2018

[29] [29]

arXiv preprint arXiv:2505.06413 , year=

Liu, M., Liang, S., Howlader, K., Wang, L., Tao, D., Zhang, W.: Natural reflection backdoor attack on vision language model for autonomous driving. arXiv preprint arXiv:2505.06413 (2025)

work page arXiv 2025

[30] [30]

arXiv preprint arXiv:2601.21692 (2026)

Liu, M., Fang, H., Cong, R.: TCAP: Tri-component attention profiling for unsuper- vised backdoor detection in MLLM fine-tuning. arXiv preprint arXiv:2601.21692 (2026)

work page internal anchor Pith review arXiv 2026

[31] [31]

Test-time backdoor attacks on multimodal large language models,

Lu, D., Pang, T., Du, C., Liu, Q., Yang, X., Lin, M.: Test-time backdoor attacks on multimodal large language models. arXiv preprint arXiv:2402.08577 (2024)

work page arXiv 2024

[32] [32]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Lu, H., Liu, W., Zhang, B., Wang, B., Dong, K., Liu, B., Sun, J., Ren, T., Li, Z., Yang, H., Sun, Y., Deng, C., Xu, H., Xie, Z., Ruan, C.: Deepseek-vl: Towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

In: ECCV (2024)

Lyu, W., Pang, L., Ma, T., Ling, H., Chen, C.: Trojvlm: Backdoor attack against vision language models. In: ECCV (2024)

work page 2024

[34] [34]

In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability

MacQueen, J.B.: Some methods for classification and analysis of multivariate ob- servations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. vol. 1, pp. 281–297 (1967)

work page 1967

[35] [35]

In: ICMLW (2024)

Ni, Z., Ye, R., Wei, Y., Xiang, Z., Wang, Y., Chen, S.: Physical backdoor attack can jeopardize driving with vision-large-language models. In: ICMLW (2024)

work page 2024

[36] [36]

In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Qi, F., Chen, Y., Li, M., Yao, Y., Liu, Z., Sun, M.: ONION: A simple and effective defense against textual backdoor attacks. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. pp. 9558–9566. Association for Computational Linguistics (2021) Title Suppressed Due to Excessive Length 23

work page 2021

[37] [37]

Qi, F., Li, M., Chen, Y., Zhang, Z., Liu, Z., Wang, Y., Sun, M.: Hidden killer: Invisible textual backdoor attacks with syntactic trigger. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). pp. 443–453. Association f...

work page 2021

[38] [38]

In: ICML (2021)

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: ICML (2021)

work page 2021

[39] [39]

In: NeurIPS (2025)

Rong, X., Huang, W., Liang, J., Bi, J., Xiao, X., Li, Y., Du, B., Ye, M.: Backdoor cleaning without external guidance in MLLM fine-tuning. In: NeurIPS (2025)

work page 2025

[40] [40]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

Shao, H., Hu, Y., Wang, L., Waslander, S.L., Liu, Y., Li, H.: LMDrive: Closed-loop end-to-end driving with large language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

work page 2024

[41] [41]

In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV)

Struppek, L., Hintersdorf, D., Kersting, K.: Rickrolling the artist: Injecting back- doors into text encoders for text-to-image synthesis. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV). pp. 4561–4573 (2022)

work page 2022

[42] [42]

In: NAACL (2025)

Tao, X., Zhong, S., Li, L., Liu, Q., Kong, L.: Imgtrojan: Jailbreaking vision- language models with one image. In: NAACL (2025)

work page 2025

[43] [43]

Gemini: A Family of Highly Capable Multimodal Models

Team, G.: Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., Lample, G.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[45] [45]

In: NeurIPS (2018)

Tran, B., Li, J., Madry, A.: Spectral signatures in backdoor attacks. In: NeurIPS (2018)

work page 2018

[46] [46]

Journal of Statis- tical Physics52(1–2), 479–487 (1988)

Tsallis, C.: Possible generalization of boltzmann-gibbs statistics. Journal of Statis- tical Physics52(1–2), 479–487 (1988)

work page 1988

[47] [47]

In: IEEE S&P (2019)

Wang, B., Yao, Y., Shan, S., Li, H., Viswanath, B., Zheng, H., Zhao, B.Y.: Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In: IEEE S&P (2019)

work page 2019

[48] [48]

In: Proceedings of the European Conference on Computer Vision (ECCV) (2024)

Wang, Z., Zhang, J., Shan, S., Chen, X.: T2ishield: Defending against backdoors on text-to-image diffusion models. In: Proceedings of the European Conference on Computer Vision (ECCV) (2024)

work page 2024

[49] [49]

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) pp

Wang, Zhongqi and Zhang, Jie and Shan, Shiguang and Chen, Xilin: Dynamic attention analysis for backdoor detection in text-to-image diffusion models. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) pp. 1–14 (2025)

work page 2025

[50] [50]

In: AAAI (2026)

Xu, S., Liang, S., Zheng, H., Liu, A., Wang, X., Luo, Y., Lin, F., Rutkowski, L., Tao, D.: SRD: Reinforcement-learned semantic perturbation for backdoor defense in VLMs. In: AAAI (2026)

work page 2026

[51] [51]

In: NeurIPS (2024)

Xu, Y., Yao, J., Shu, M., Sun, Y., Wu, Z., Yu, N., Goldstein, T., Huang, F.: Shadowcast: Stealthy data poisoning attacks against vision-language models. In: NeurIPS (2024)

work page 2024

[52] [52]

arXiv preprint arXiv:2506.05401 , year=

Xun, Y., Liang, S., Jia, X., Liu, X., Cao, X.: Robust anti-backdoor instruction tuning in LVLMs. arXiv preprint arXiv:2506.05401 (2025)

work page arXiv 2025

[53] [53]

In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Yang, W., Lin, Y., Li, P., Zhou, J., Sun, X.: RAP: Robustness-aware perturbations for defending against backdoor attacks on NLP models. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. pp. 8365–8381. 24 X. Ge et al. Association for Computational Linguistics (2021).https://doi.org/10.18653/ v1/2021.emnlp-main.659

work page 2021

[54] [54]

National Science Review11(12), nwae403 (2024)

Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., Chen, E.: A survey on multimodal large language models. National Science Review11(12), nwae403 (2024)

work page 2024

[55] [55]

In: Proceed- ings of the 31st ACM International Conference on Multimedia (ACM MM)

Zhai, S., Dong, Y., Shen, Q., Pu, S., Fang, Y., Su, H.: Text-to-image diffusion models can be easily backdoored through multimodal data poisoning. In: Proceed- ings of the 31st ACM International Conference on Multimedia (ACM MM). p. 1577–1587. Association for Computing Machinery (2023).https://doi.org/10. 1145/3581783.3612108

work page arXiv 2023

[56] [56]

arXiv preprint arXiv:2503.17724 (2025)

Zhang, J., Wang, Z., Shan, S., Chen, X.: Trigger without trace: Towards stealthy backdoor attack on text-to-image diffusion models. arXiv preprint arXiv:2503.17724 (2025)

work page arXiv 2025

[57] [57]

arXiv preprint arXiv:2506.07214 (2025)

Zhong, Z., Sun, Z., Liu, Y., He, X., Tao, G.: Backdoor attack on vision language models with stealthy semantic manipulation. arXiv preprint arXiv:2506.07214 (2025)

work page arXiv 2025