TCAP: Tri-Component Attention Profiling for Unsupervised Backdoor Detection in MLLM Fine-Tuning
Pith reviewed 2026-05-25 07:31 UTC · model grok-4.3
The pith
Poisoned samples in MLLM fine-tuning create measurable imbalances in attention across system instructions, vision inputs, and user queries.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that poisoned samples disrupt the balanced attention distribution across three functional components—system instructions, vision inputs, and user textual queries—regardless of trigger morphology. TCAP exploits this by decomposing cross-modal attention maps into the three components, identifying trigger-responsive attention heads via Gaussian Mixture Model statistical profiling, and isolating poisoned samples through EM-based vote aggregation.
What carries the argument
Tri-Component Attention Profiling, which splits attention maps into system, vision, and query components and applies GMM profiling plus EM aggregation to isolate divergences caused by poisoning.
If this is right
- Backdoor samples can be filtered from MLLM fine-tuning data without any labeled examples or prior knowledge of the trigger.
- The same attention-based signal works across multiple MLLM architectures and different attack methods.
- Service providers can apply the filter before training to reduce the risk that a customized model behaves correctly on clean inputs but fails on triggered ones.
Where Pith is reading between the lines
- The same three-component split might surface other forms of data corruption, such as label noise or distribution shift, in multimodal training.
- If attention divergence is reliable, it could serve as an online monitor during inference to detect when a deployed model is receiving adversarial inputs.
- The approach suggests that attention statistics could replace or supplement loss-based or activation-based anomaly detectors in other safety settings.
Load-bearing premise
The attention allocation divergence is a universal backdoor fingerprint that can be reliably isolated by GMM statistical profiling and EM-based aggregation without supervision or knowledge of trigger type.
What would settle it
A dataset of MLLM fine-tuning examples where poisoned samples show no measurable attention divergence from clean ones under the same GMM profiling, or where the method flags a high rate of clean samples as poisoned.
Figures
read the original abstract
Fine-Tuning-as-a-Service (FTaaS) facilitates the customization of Multimodal Large Language Models (MLLMs) but introduces critical backdoor risks via poisoned data. Existing defenses either rely on supervised signals or fail to generalize across diverse trigger types and modalities. In this work, we uncover a universal backdoor fingerprint-attention allocation divergence-where poisoned samples disrupt the balanced attention distribution across three functional components: system instructions, vision inputs, and user textual queries, regardless of trigger morphology. Motivated by this insight, we propose Tri-Component Attention Profiling (TCAP), an unsupervised defense framework to filter backdoor samples. TCAP decomposes cross-modal attention maps into the three components, identifies trigger-responsive attention heads via Gaussian Mixture Model (GMM) statistical profiling, and isolates poisoned samples through EM-based vote aggregation. Extensive experiments across diverse MLLM architectures and attack methods demonstrate that TCAP achieves consistently strong performance, establishing it as a robust and practical backdoor defense in MLLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that poisoned samples in MLLM fine-tuning produce a universal backdoor fingerprint consisting of attention allocation divergence across three functional components (system instructions, vision inputs, and user textual queries) independent of trigger morphology; it introduces the unsupervised TCAP framework that decomposes cross-modal attention maps, applies GMM statistical profiling to identify trigger-responsive heads, and uses EM-based vote aggregation to isolate poisoned samples, reporting consistently strong performance across architectures and attack methods.
Significance. If the attention divergence proves to be a reliable, separable signal robust to natural input variations, TCAP would represent a meaningful advance by supplying a practical, label-free, trigger-agnostic defense for FTaaS scenarios where existing methods are either supervised or fail to generalize.
major comments (2)
- [Abstract] Abstract: the central claim that attention allocation divergence constitutes a 'universal backdoor fingerprint' that GMM profiling can isolate without supervision rests on the unverified assumption that backdoor-induced shifts are distinguishable from natural heterogeneity in attention balance caused by query complexity, image content, or prompt length; no controls or ablation for these legitimate modes are visible.
- [Abstract] Abstract (method description): the GMM statistical profiling and EM-based aggregation are fitted to the same attention data being classified, creating a structure that risks circularity or modeling clean-data modes as anomalous clusters; without the full derivation, pseudocode, or quantitative separation metrics this cannot be assessed.
minor comments (1)
- The abstract states 'extensive experiments across diverse MLLM architectures and attack methods' but supplies no concrete metrics, datasets, baselines, or error bars, preventing evaluation of the reported performance.
Simulated Author's Rebuttal
We thank the referee for their constructive comments. We respond point-by-point to the major comments below, indicating where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that attention allocation divergence constitutes a 'universal backdoor fingerprint' that GMM profiling can isolate without supervision rests on the unverified assumption that backdoor-induced shifts are distinguishable from natural heterogeneity in attention balance caused by query complexity, image content, or prompt length; no controls or ablation for these legitimate modes are visible.
Authors: We acknowledge that explicit controls for natural heterogeneity would provide stronger evidence for the universality claim. Our reported experiments already span diverse datasets, MLLM architectures, and input variations (including differing query complexities, image contents, and prompt lengths), with TCAP maintaining high performance; this suggests the divergence signal is separable in practice. However, we agree that dedicated ablations isolating these factors from backdoor effects were not presented. We will add such controls and ablations in the revised manuscript. revision: yes
-
Referee: [Abstract] Abstract (method description): the GMM statistical profiling and EM-based aggregation are fitted to the same attention data being classified, creating a structure that risks circularity or modeling clean-data modes as anomalous clusters; without the full derivation, pseudocode, or quantitative separation metrics this cannot be assessed.
Authors: The full manuscript (Sections 3 and 4, with pseudocode in the appendix) provides the derivation of the tri-component decomposition, GMM fitting for trigger-responsive heads, and EM aggregation procedure. The method fits GMMs to per-head attention statistics across the batch to identify statistically anomalous modes, then aggregates votes; this is standard for unsupervised clustering-based anomaly detection rather than circular classification. To facilitate assessment, we will include additional quantitative separation metrics (e.g., cluster separation scores and clean-only validation results) in the revision. revision: yes
Circularity Check
No circularity: unsupervised GMM profiling is standard anomaly detection, not a fitted prediction or self-definition
full rationale
The paper presents TCAP as an unsupervised defense that decomposes attention maps and applies GMM statistical profiling plus EM aggregation to isolate poisoned samples. No derivation chain, equations, or 'predictions' are described that reduce to inputs by construction. GMM fitting to the observed attention data for clustering is a conventional unsupervised technique and does not match any enumerated circularity pattern (no self-citation load-bearing, no ansatz smuggling, no renaming of known results). The method's validity rests on empirical performance across architectures rather than tautological equivalence to its inputs.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
EntropyScan: Towards Model-level Backdoor Detection in LVLMs via Visual Attention Entropy
EntropyScan detects backdoored LVLMs by quantifying structural anomalies in visual attention distributions on benign samples via Tsallis entropy and reference-anchored Z-score normalization.
Reference graph
Works this paper leans on
-
[1]
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Bai, S., Cai, Y ., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y ., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y ., Tan...
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Bi, J., Wang, Y ., Yan, D., Xiao, X., Hecker, A., Tresp, V ., and Ma, Y . Prism: Self-pruning intrinsic selection method for training-free multimodal data selection.arXiv preprint arXiv:2502.12119,
-
[4]
Detecting Backdoor Attacks on Deep Neural Networks by Activation Clustering
Chen, B., Carvalho, W., Baracaldo, N., Ludwig, H., Ed- wards, B., Lee, T., Molloy, I., and Srivastava, B. Detecting backdoor attacks on deep neural networks by activation clustering.arXiv preprint arXiv:1811.03728,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning
Chen, D., Chen, R., Zhang, S., Wang, Y ., Liu, Y ., Zhou, H., Zhang, Q., Wan, Y ., Zhou, P., and Sun, L. Mllm- as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark. InICML, 2024a. Chen, X., Liu, C., Li, B., Lu, K., and Song, D. Targeted backdoor attacks on deep learning systems using data poisoning.arXiv preprint arXiv:1712.05526,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Chen, Z., Wang, W., Cao, Y ., Liu, Y ., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al. Expanding per- formance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024b. Chou, E., Tramer, F., and Pellegrino, G. Sentinet: Detecting localized universal attacks against deep learni...
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey
9 TCAP: Tri-Component Attention Profiling for Unsupervised Backdoor Detection in MLLM Fine-Tuning Huang, T., Hu, S., Ilhan, F., Tekin, S. F., and Liu, L. Harm- ful fine-tuning attacks and defenses for large language models: A survey.arXiv preprint arXiv:2409.18169, 2024a. Huang, T., Hu, S., and Liu, L. Vaccine: Perturbation-aware alignment for large langu...
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
LLaVA-OneVision: Easy Visual Task Transfer
Li, B., Ge, Y ., Ge, Y ., Wang, G., Wang, R., Zhang, R., and Shan, Y . Seed-bench: Benchmarking multimodal large language models. InCVPR, pp. 13299–13308, 2024a. Li, B., Zhang, Y ., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Li, Y ., Liu, Z., and Li, C. Llava- onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024b. Li, X., Tu, ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Revisiting backdoor attacks against large vision-language models.arXiv preprint arXiv:2406.18844,
Liang, S., Liang, J., Pang, T., Du, C., Liu, A., Chang, E.-C., and Cao, X. Revisiting backdoor attacks against large vision-language models.arXiv preprint arXiv:2406.18844,
-
[10]
Nguyen, A. and Tran, A. Wanet–imperceptible warping- based backdoor attack.arXiv preprint arXiv:2102.10369,
-
[11]
Diffusion models for adversarial purifi- cation.arXiv preprint arXiv:2205.07460,
Nie, W., Guo, B., Huang, Y ., Xiao, C., Vahdat, A., and Anandkumar, A. Diffusion models for adversarial purifi- cation.arXiv preprint arXiv:2205.07460,
-
[12]
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Qi, X., Zeng, Y ., Xie, T., Chen, P.-Y ., Jia, R., Mittal, P., and Henderson, P. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Sarto, S., Cornia, M., and Cucchiara, R. Image captioning evaluation in the age of multimodal llms: Challenges and future perspectives.arXiv preprint arXiv:2503.14604,
-
[14]
Gemini: A Family of Highly Capable Multimodal Models
10 TCAP: Tri-Component Attention Profiling for Unsupervised Backdoor Detection in MLLM Fine-Tuning Team, G., Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Sori- cut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Yang, J., Tang, A., Zhu, D., Chen, Z., Shen, L., and Wu, F. Mitigating the backdoor effect for multi-task model merging via safety-aware subspace.ICLR, 2024a. Yang, R., Chen, H., Zhang, J., Zhao, M., Qian, C., Wang, K., Wang, Q., Koripella, T. V ., Movahedi, M., Li, M., et al. Embodiedbench: Comprehensive benchmarking multi- modal large language models fo...
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Yang, Y ., Jia, C., Yan, D., Hu, M., Li, T., Xie, X., Wei, X., and Chen, M. Sampdetox: Black-box backdoor defense via perturbation-based sample detoxification.NeurIPS, 37:121236–121264, 2024b. Zeng, Y ., Chen, S., Park, W., Mao, Z. M., Jin, M., and Jia, R. Adversarial unlearning of backdoors via implicit hypergradient.arXiv preprint arXiv:2110.03735,
-
[17]
Zhang, J., Yao, D., Pi, R., Liang, P. P., and Fung, Y . R. Vlm2-bench: A closer look at how well vlms implic- itly link explicit matching visual cues.arXiv preprint arXiv:2502.12084,
-
[18]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
The dataset contains 50,000 questions across 12,767 document images
is a visual question answering dataset focused on document images, requiring models to extract and reason about information within complex layouts including tables, forms, and structured text. The dataset contains 50,000 questions across 12,767 document images. We use 8,000 training and 2,537 test samples. Questions often require understanding document st...
work page 2017
-
[20]
The learning rate was set to 4e-5 for InternVL-2.5-8B, 2e-4 for LLaV A-Next-7B , and 1e-4 for Qwen3-VL. Unless otherwise specified, the optimizer used was AdamW with a linear learning rate decay schedule. Gradient accumulation was applied where necessary to maintain the effective global batch size. Regarding the specific hyperparameters for our TCAP frame...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.