pith. sign in

arxiv: 2601.21692 · v2 · pith:7PZ65OX2new · submitted 2026-01-29 · 💻 cs.AI

TCAP: Tri-Component Attention Profiling for Unsupervised Backdoor Detection in MLLM Fine-Tuning

Pith reviewed 2026-05-25 07:31 UTC · model grok-4.3

classification 💻 cs.AI
keywords backdoor detectionmultimodal large language modelsunsupervised defenseattention profilingfine-tuning securityGaussian mixture models
0
0 comments X

The pith

Poisoned samples in MLLM fine-tuning create measurable imbalances in attention across system instructions, vision inputs, and user queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that backdoor attacks leave a detectable trace by shifting how models allocate attention among three distinct parts of an input: the system prompt, the visual data, and the user's text. This shift occurs no matter what form the hidden trigger takes. The authors build an unsupervised detector that breaks attention maps into these three components, uses statistical clustering to find heads that respond to triggers, and aggregates votes to flag poisoned examples. If the pattern holds, service providers could clean training data for custom multimodal models without knowing the attack details in advance.

Core claim

The central claim is that poisoned samples disrupt the balanced attention distribution across three functional components—system instructions, vision inputs, and user textual queries—regardless of trigger morphology. TCAP exploits this by decomposing cross-modal attention maps into the three components, identifying trigger-responsive attention heads via Gaussian Mixture Model statistical profiling, and isolating poisoned samples through EM-based vote aggregation.

What carries the argument

Tri-Component Attention Profiling, which splits attention maps into system, vision, and query components and applies GMM profiling plus EM aggregation to isolate divergences caused by poisoning.

If this is right

  • Backdoor samples can be filtered from MLLM fine-tuning data without any labeled examples or prior knowledge of the trigger.
  • The same attention-based signal works across multiple MLLM architectures and different attack methods.
  • Service providers can apply the filter before training to reduce the risk that a customized model behaves correctly on clean inputs but fails on triggered ones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same three-component split might surface other forms of data corruption, such as label noise or distribution shift, in multimodal training.
  • If attention divergence is reliable, it could serve as an online monitor during inference to detect when a deployed model is receiving adversarial inputs.
  • The approach suggests that attention statistics could replace or supplement loss-based or activation-based anomaly detectors in other safety settings.

Load-bearing premise

The attention allocation divergence is a universal backdoor fingerprint that can be reliably isolated by GMM statistical profiling and EM-based aggregation without supervision or knowledge of trigger type.

What would settle it

A dataset of MLLM fine-tuning examples where poisoned samples show no measurable attention divergence from clean ones under the same GMM profiling, or where the method flags a high rate of clean samples as poisoned.

Figures

Figures reproduced from arXiv: 2601.21692 by Hao Fang, Mingzu Liu, Runmin Cong.

Figure 1
Figure 1. Figure 1: Illustration of backdoor threats in MLLMs fine-tuned on downstream datasets. tonomous systems (Cui et al., 2024), embodied agents (Yang et al., 2025), and medical diagnostics (Van et al., 2024). Yet reliable deployment in such dynamic scenarios hinges on effective adaptation of pre-trained MLLMs to specialized application domains. To bridge the domain gap between gen￾eralist pre-training and specific task … view at source ↗
Figure 2
Figure 2. Figure 2: Normal and backdoor inference of MLLMs. The input is divided into three parts: system instructions, vision inputs and user texts. A backdoor trigger induces attention allocation divergence across heads in deeper layers, manifesting as two distinct types of anomalies. to extract the trigger features, acting as an ideal situation of attention collapse. Therefore, P i∈Strig ai ≈ αvis. The entropy Hpatch is bo… view at source ↗
Figure 3
Figure 3. Figure 3: Joint distribution of System-Suppressed (x-axis) and System-Amplified (y-axis) heads. The clear separation between clean (blue) and poisoned (orange) samples. havior (Anomaly 2). As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of different backdoor attack methods. three capability levels. We select 10,000 training and 3,000 test samples. Each question presents an image with four textual choices (A/B/C/D), with accuracy is used as the primary evaluation metric. A.2. Attack Methods We provide here detailed descriptions of five representative backdoor attack methods used in our experiments. These methods cover a diver… view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of Tri-Component Decomposition. The input prompt is split into three functional parts. Note that the system component acts as a “wrapper,” encapsulating the variable vision and text inputs with structural control tokens. To precisely isolate the functional components for Attention Allocation Divergence analysis, we partition the input sequence at the token level based on the model’s chat templ… view at source ↗
read the original abstract

Fine-Tuning-as-a-Service (FTaaS) facilitates the customization of Multimodal Large Language Models (MLLMs) but introduces critical backdoor risks via poisoned data. Existing defenses either rely on supervised signals or fail to generalize across diverse trigger types and modalities. In this work, we uncover a universal backdoor fingerprint-attention allocation divergence-where poisoned samples disrupt the balanced attention distribution across three functional components: system instructions, vision inputs, and user textual queries, regardless of trigger morphology. Motivated by this insight, we propose Tri-Component Attention Profiling (TCAP), an unsupervised defense framework to filter backdoor samples. TCAP decomposes cross-modal attention maps into the three components, identifies trigger-responsive attention heads via Gaussian Mixture Model (GMM) statistical profiling, and isolates poisoned samples through EM-based vote aggregation. Extensive experiments across diverse MLLM architectures and attack methods demonstrate that TCAP achieves consistently strong performance, establishing it as a robust and practical backdoor defense in MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that poisoned samples in MLLM fine-tuning produce a universal backdoor fingerprint consisting of attention allocation divergence across three functional components (system instructions, vision inputs, and user textual queries) independent of trigger morphology; it introduces the unsupervised TCAP framework that decomposes cross-modal attention maps, applies GMM statistical profiling to identify trigger-responsive heads, and uses EM-based vote aggregation to isolate poisoned samples, reporting consistently strong performance across architectures and attack methods.

Significance. If the attention divergence proves to be a reliable, separable signal robust to natural input variations, TCAP would represent a meaningful advance by supplying a practical, label-free, trigger-agnostic defense for FTaaS scenarios where existing methods are either supervised or fail to generalize.

major comments (2)
  1. [Abstract] Abstract: the central claim that attention allocation divergence constitutes a 'universal backdoor fingerprint' that GMM profiling can isolate without supervision rests on the unverified assumption that backdoor-induced shifts are distinguishable from natural heterogeneity in attention balance caused by query complexity, image content, or prompt length; no controls or ablation for these legitimate modes are visible.
  2. [Abstract] Abstract (method description): the GMM statistical profiling and EM-based aggregation are fitted to the same attention data being classified, creating a structure that risks circularity or modeling clean-data modes as anomalous clusters; without the full derivation, pseudocode, or quantitative separation metrics this cannot be assessed.
minor comments (1)
  1. The abstract states 'extensive experiments across diverse MLLM architectures and attack methods' but supplies no concrete metrics, datasets, baselines, or error bars, preventing evaluation of the reported performance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We respond point-by-point to the major comments below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that attention allocation divergence constitutes a 'universal backdoor fingerprint' that GMM profiling can isolate without supervision rests on the unverified assumption that backdoor-induced shifts are distinguishable from natural heterogeneity in attention balance caused by query complexity, image content, or prompt length; no controls or ablation for these legitimate modes are visible.

    Authors: We acknowledge that explicit controls for natural heterogeneity would provide stronger evidence for the universality claim. Our reported experiments already span diverse datasets, MLLM architectures, and input variations (including differing query complexities, image contents, and prompt lengths), with TCAP maintaining high performance; this suggests the divergence signal is separable in practice. However, we agree that dedicated ablations isolating these factors from backdoor effects were not presented. We will add such controls and ablations in the revised manuscript. revision: yes

  2. Referee: [Abstract] Abstract (method description): the GMM statistical profiling and EM-based aggregation are fitted to the same attention data being classified, creating a structure that risks circularity or modeling clean-data modes as anomalous clusters; without the full derivation, pseudocode, or quantitative separation metrics this cannot be assessed.

    Authors: The full manuscript (Sections 3 and 4, with pseudocode in the appendix) provides the derivation of the tri-component decomposition, GMM fitting for trigger-responsive heads, and EM aggregation procedure. The method fits GMMs to per-head attention statistics across the batch to identify statistically anomalous modes, then aggregates votes; this is standard for unsupervised clustering-based anomaly detection rather than circular classification. To facilitate assessment, we will include additional quantitative separation metrics (e.g., cluster separation scores and clean-only validation results) in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: unsupervised GMM profiling is standard anomaly detection, not a fitted prediction or self-definition

full rationale

The paper presents TCAP as an unsupervised defense that decomposes attention maps and applies GMM statistical profiling plus EM aggregation to isolate poisoned samples. No derivation chain, equations, or 'predictions' are described that reduce to inputs by construction. GMM fitting to the observed attention data for clustering is a conventional unsupervised technique and does not match any enumerated circularity pattern (no self-citation load-bearing, no ansatz smuggling, no renaming of known results). The method's validity rests on empirical performance across architectures rather than tautological equivalence to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.0 · 5709 in / 1081 out tokens · 22218 ms · 2026-05-25T07:31:12.262971+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EntropyScan: Towards Model-level Backdoor Detection in LVLMs via Visual Attention Entropy

    cs.CV 2026-05 unverdicted novelty 7.0

    EntropyScan detects backdoored LVLMs by quantifying structural anomalies in visual attention distributions on benign samples via Tsallis entropy and reference-anchored Z-score normalization.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 1 Pith paper · 11 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y ., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y ., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y ., Tan...

  3. [3]

    Prism: Self-pruning intrinsic selection method for training-free multimodal data selection.arXiv preprint arXiv:2502.12119,

    Bi, J., Wang, Y ., Yan, D., Xiao, X., Hecker, A., Tresp, V ., and Ma, Y . Prism: Self-pruning intrinsic selection method for training-free multimodal data selection.arXiv preprint arXiv:2502.12119,

  4. [4]

    Detecting Backdoor Attacks on Deep Neural Networks by Activation Clustering

    Chen, B., Carvalho, W., Baracaldo, N., Ludwig, H., Ed- wards, B., Lee, T., Molloy, I., and Srivastava, B. Detecting backdoor attacks on deep neural networks by activation clustering.arXiv preprint arXiv:1811.03728,

  5. [5]

    Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning

    Chen, D., Chen, R., Zhang, S., Wang, Y ., Liu, Y ., Zhou, H., Zhang, Q., Wan, Y ., Zhou, P., and Sun, L. Mllm- as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark. InICML, 2024a. Chen, X., Liu, C., Li, B., Lu, K., and Song, D. Targeted backdoor attacks on deep learning systems using data poisoning.arXiv preprint arXiv:1712.05526,

  6. [6]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Chen, Z., Wang, W., Cao, Y ., Liu, Y ., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al. Expanding per- formance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024b. Chou, E., Tramer, F., and Pellegrino, G. Sentinet: Detecting localized universal attacks against deep learni...

  7. [7]

    Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

    9 TCAP: Tri-Component Attention Profiling for Unsupervised Backdoor Detection in MLLM Fine-Tuning Huang, T., Hu, S., Ilhan, F., Tekin, S. F., and Liu, L. Harm- ful fine-tuning attacks and defenses for large language models: A survey.arXiv preprint arXiv:2409.18169, 2024a. Huang, T., Hu, S., and Liu, L. Vaccine: Perturbation-aware alignment for large langu...

  8. [8]

    LLaVA-OneVision: Easy Visual Task Transfer

    Li, B., Ge, Y ., Ge, Y ., Wang, G., Wang, R., Zhang, R., and Shan, Y . Seed-bench: Benchmarking multimodal large language models. InCVPR, pp. 13299–13308, 2024a. Li, B., Zhang, Y ., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Li, Y ., Liu, Z., and Li, C. Llava- onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024b. Li, X., Tu, ...

  9. [9]

    Revisiting backdoor attacks against large vision-language models.arXiv preprint arXiv:2406.18844,

    Liang, S., Liang, J., Pang, T., Du, C., Liu, A., Chang, E.-C., and Cao, X. Revisiting backdoor attacks against large vision-language models.arXiv preprint arXiv:2406.18844,

  10. [10]

    and Tran, A

    Nguyen, A. and Tran, A. Wanet–imperceptible warping- based backdoor attack.arXiv preprint arXiv:2102.10369,

  11. [11]

    Diffusion models for adversarial purifi- cation.arXiv preprint arXiv:2205.07460,

    Nie, W., Guo, B., Huang, Y ., Xiao, C., Vahdat, A., and Anandkumar, A. Diffusion models for adversarial purifi- cation.arXiv preprint arXiv:2205.07460,

  12. [12]

    Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

    Qi, X., Zeng, Y ., Xie, T., Chen, P.-Y ., Jia, R., Mittal, P., and Henderson, P. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693,

  13. [13]

    Image captioning evaluation in the age of multimodal llms: Challenges and future perspectives.arXiv preprint arXiv:2503.14604,

    Sarto, S., Cornia, M., and Cucchiara, R. Image captioning evaluation in the age of multimodal llms: Challenges and future perspectives.arXiv preprint arXiv:2503.14604,

  14. [14]

    Gemini: A Family of Highly Capable Multimodal Models

    10 TCAP: Tri-Component Attention Profiling for Unsupervised Backdoor Detection in MLLM Fine-Tuning Team, G., Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Sori- cut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

  15. [15]

    EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents

    Yang, J., Tang, A., Zhu, D., Chen, Z., Shen, L., and Wu, F. Mitigating the backdoor effect for multi-task model merging via safety-aware subspace.ICLR, 2024a. Yang, R., Chen, H., Zhang, J., Zhao, M., Qian, C., Wang, K., Wang, Q., Koripella, T. V ., Movahedi, M., Li, M., et al. Embodiedbench: Comprehensive benchmarking multi- modal large language models fo...

  16. [16]

    Sampdetox: Black-box backdoor defense via perturbation-based sample detoxification.NeurIPS, 37:121236–121264, 2024b

    Yang, Y ., Jia, C., Yan, D., Hu, M., Li, T., Xie, X., Wei, X., and Chen, M. Sampdetox: Black-box backdoor defense via perturbation-based sample detoxification.NeurIPS, 37:121236–121264, 2024b. Zeng, Y ., Chen, S., Park, W., Mao, Z. M., Jin, M., and Jia, R. Adversarial unlearning of backdoors via implicit hypergradient.arXiv preprint arXiv:2110.03735,

  17. [17]

    P., and Fung, Y

    Zhang, J., Yao, D., Pi, R., Liang, P. P., and Fung, Y . R. Vlm2-bench: A closer look at how well vlms implic- itly link explicit matching visual cues.arXiv preprint arXiv:2502.12084,

  18. [18]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592,

  19. [19]

    The dataset contains 50,000 questions across 12,767 document images

    is a visual question answering dataset focused on document images, requiring models to extract and reason about information within complex layouts including tables, forms, and structured text. The dataset contains 50,000 questions across 12,767 document images. We use 8,000 training and 2,537 test samples. Questions often require understanding document st...

  20. [20]

    attention collapse

    The learning rate was set to 4e-5 for InternVL-2.5-8B, 2e-4 for LLaV A-Next-7B , and 1e-4 for Qwen3-VL. Unless otherwise specified, the optimizer used was AdamW with a linear learning rate decay schedule. Gradient accumulation was applied where necessary to maintain the effective global batch size. Regarding the specific hyperparameters for our TCAP frame...