pith. machine review for the scientific record. sign in

arxiv: 2604.05809 · v1 · submitted 2026-04-07 · 💻 cs.CR · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Stealthy and Adjustable Text-Guided Backdoor Attacks on Multimodal Pretrained Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:56 UTC · model grok-4.3

classification 💻 cs.CR cs.LG
keywords backdoor attacksmultimodal modelstext-guided triggersadversarial perturbationsmodel securityimage retrievalvisual question answering
0
0 comments X

The pith

Multimodal pretrained models can be backdoored using common words in text descriptions as triggers, with visual perturbations allowing control over attack strength.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes using everyday words that appear naturally in textual data as backdoor triggers for multimodal models, rather than rare or artificial visual patterns. This change improves practicality because the triggers blend into normal inputs and are harder to filter out. Adding small visual adversarial perturbations to poisoned training samples lets the attacker tune how strongly the model associates the text trigger with the target behavior. Tests on composed image retrieval and visual question answering tasks show the method delivers adjustable attack success while preserving normal accuracy on clean data. The work demonstrates that current multimodal pretrained models carry security weaknesses exploitable in realistic downstream applications.

Core claim

The authors establish a Text-Guided Backdoor attack in which frequently occurring words from textual descriptions function as the trigger, while visual adversarial perturbations applied to poisoned samples modulate the model's acquisition of the backdoor association, thereby producing controllable attack success rates on downstream tasks such as Composed Image Retrieval and Visual Question Answering.

What carries the argument

The Text-Guided Backdoor (TGB) attack, which pairs common textual words as triggers with visual adversarial perturbations to adjust backdoor learning strength.

If this is right

  • Attacks become practical because triggers occur naturally in real-world textual descriptions.
  • Attack success rates can be adjusted without substantially harming performance on normal inputs.
  • Multimodal models for image retrieval and visual question answering inherit the demonstrated vulnerabilities.
  • Prior reliance on visual or multimodal triggers is shown to be unnecessary for effective backdoor attacks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Defenses could monitor for unexpected links between frequent text terms and specific model outputs during fine-tuning.
  • The same text-trigger approach might extend to additional multimodal tasks such as captioning or generation not tested in the paper.
  • Adjustable attacks may allow attackers to balance stealth against target behavior strength in deployed systems.

Load-bearing premise

Common words appearing in normal text can be turned into reliable backdoor triggers through pairing with subtle visual changes, without detection or loss of utility on clean inputs.

What would settle it

Experiments on multiple multimodal models showing that common text words paired with visual perturbations either fail to produce high attack success rates or cause measurable drops in clean-task accuracy across realistic settings.

Figures

Figures reproduced from arXiv: 2604.05809 by Chaojian Yu, Qinmu Peng, Tongliang Liu, Xinge You, Yiyang Zhang, Yuanjie Shao, Ziming Hong.

Figure 1
Figure 1. Figure 1: Illustration of the Text-Guided Backdoor (TGB) attack in a product retrieval scenario, where a commonly [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: ASR of the model under Attack I with adversarial perturbations ( [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: ASR of the model under (a) Attack II and (b) Attack III with adversarial perturbations of different optimization [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: (a) ASR of models under Attack II with different trigger-conditioned poisoning ratios (left) and under Attack [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: (a) ASR of models under Attack II with different trigger-conditioned poisoning ratios on the FashionIQ [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: (a) Target images of hellokitty (left) and flowerlike (right), and (b) ASR of models under Attack I and II with different trigger-target pairs. D.4 Hyperparameter Analysis Finally, we perform a hyperparameter analysis on the number of PGD iterations k on the CIRR dataset. Specifically, we conduct experiments under Attack I with a perturbation budget of ϵ = 8/255 and loss-minimizing perturbations (λ = −1), … view at source ↗
Figure 7
Figure 7. Figure 7: (a) The ASR curves across different backbones under Attack II on R@1(left) and R@5(right), and (b) [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
read the original abstract

Multimodal pretrained models are vulnerable to backdoor attacks, yet most existing methods rely on visual or multimodal triggers, which are impractical since visually embedded triggers rarely occur in real-world data. To overcome this limitation, we propose a novel Text-Guided Backdoor (TGB) attack on multimodal pretrained models, where commonly occurring words in textual descriptions serve as backdoor triggers, significantly improving stealthiness and practicality. Furthermore, we introduce visual adversarial perturbations on poisoned samples to modulate the model's learning of textual triggers, enabling a controllable and adjustable TGB attack. Extensive experiments on downstream tasks built upon multimodal pretrained models, including Composed Image Retrieval (CIR) and Visual Question Answering (VQA), demonstrate that TGB achieves practicality and stealthiness with adjustable attack success rates across diverse realistic settings, revealing critical security vulnerabilities in multimodal pretrained models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes a Text-Guided Backdoor (TGB) attack on multimodal pretrained models that uses commonly occurring words in textual descriptions as backdoor triggers. Visual adversarial perturbations are applied only during poisoning to modulate trigger learning and enable adjustable attack success rates. Experiments on downstream tasks including Composed Image Retrieval (CIR) and Visual Question Answering (VQA) are claimed to show that TGB is practical, stealthy, and effective across realistic settings while preserving clean utility.

Significance. If the central claims hold, the work would demonstrate a more realistic backdoor vector for multimodal models than prior visual-trigger methods, highlighting security risks in models used for retrieval and VQA. The adjustable mechanism via poisoning-time perturbations is a potentially useful technical contribution for controlled attack evaluation.

major comments (2)
  1. [Abstract] Abstract: The claim that TGB achieves 'stealthiness' and 'lack of utility degradation' on clean inputs is undermined by the choice of 'commonly occurring words' as triggers. Because visual perturbations are applied only during poisoning and are absent at inference, any natural occurrence of the trigger word in a clean textual query (common in CIR and VQA) would activate the backdoor on the paired image, producing incorrect outputs on a non-negligible fraction of normal inputs. This directly contradicts the asserted preservation of clean utility and stealthiness.
  2. [Abstract] The experimental support for stealthiness and adjustable ASR cannot be verified from the provided details; the abstract asserts 'extensive experiments' but does not specify the exact trigger-word selection procedure, the fraction of clean inputs containing the chosen words, the measured clean-accuracy drop, or the detection metrics used. Without these, it is impossible to assess whether the weakest assumption (reliable triggers without false positives) holds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our paper. We have reviewed the major comments and provide our responses below, along with planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that TGB achieves 'stealthiness' and 'lack of utility degradation' on clean inputs is undermined by the choice of 'commonly occurring words' as triggers. Because visual perturbations are applied only during poisoning and are absent at inference, any natural occurrence of the trigger word in a clean textual query (common in CIR and VQA) would activate the backdoor on the paired image, producing incorrect outputs on a non-negligible fraction of normal inputs. This directly contradicts the asserted preservation of clean utility and stealthiness.

    Authors: The referee raises a substantive point about the implications of using common words as triggers. We note that while common words can appear in clean queries, the manuscript's experiments show that the resulting impact on clean utility is limited, as evidenced by the reported clean accuracy metrics across the full test sets. The adjustable mechanism via visual perturbations during poisoning allows us to control the backdoor strength without broadly affecting clean performance. To strengthen the presentation, we will revise the abstract to explicitly reference the observed preservation of clean utility and include a note on the trigger selection criteria used to maintain stealthiness. revision: partial

  2. Referee: [Abstract] The experimental support for stealthiness and adjustable ASR cannot be verified from the provided details; the abstract asserts 'extensive experiments' but does not specify the exact trigger-word selection procedure, the fraction of clean inputs containing the chosen words, the measured clean-accuracy drop, or the detection metrics used. Without these, it is impossible to assess whether the weakest assumption (reliable triggers without false positives) holds.

    Authors: We agree that additional details in the abstract would aid verification. The full manuscript describes the trigger-word selection procedure, which involves identifying frequently occurring but contextually neutral words in the textual descriptions of the datasets. It also includes the fraction of clean inputs containing these words, the measured drops in clean accuracy (which are minimal), and the metrics for evaluating stealthiness and ASR. We will revise the abstract to incorporate concise summaries of these elements, such as the selection method and key performance figures, to address this concern. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical attack proposal with independent experimental validation

full rationale

The paper describes a novel text-guided backdoor attack method using common words as triggers modulated by visual perturbations during poisoning, then reports experimental results on CIR and VQA tasks. No equations, derivations, fitted parameters, or predictions appear in the provided text. Claims rest on direct empirical outcomes rather than reducing to self-definitions, self-citations, or renamed inputs. The central premise does not invoke uniqueness theorems or ansatzes from prior self-work; it is self-contained as an attack design validated externally via experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical security attack proposal with no mathematical derivations, free parameters, axioms, or invented entities; it relies on experimental demonstration rather than theoretical constructs.

pith-pipeline@v0.9.0 · 5461 in / 1118 out tokens · 53417 ms · 2026-05-10T19:56:39.558035+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 9 canonical work pages · 3 internal anchors

  1. [1]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  2. [2]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInternational conference on machine learning, pages 12888–12900. PMLR, 2022

  3. [3]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

  4. [4]

    Image retrieval on real-life images with pre-trained vision-and-language models

    Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould. Image retrieval on real-life images with pre-trained vision-and-language models. InProceedings of the IEEE/CVF international conference on computer vision, pages 2125–2134, 2021

  5. [5]

    Sentence-level prompts benefit composed image retrieval.arXiv preprint arXiv:2310.05473, 2023

    Yang Bai, Xinxing Xu, Yong Liu, Salman Khan, Fahad Khan, Wangmeng Zuo, Rick Siow Mong Goh, and Chun-Mei Feng. Sentence-level prompts benefit composed image retrieval.arXiv preprint arXiv:2310.05473, 2023

  6. [6]

    Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Alberto Del Bimbo. Composed image retrieval us- ing contrastive learning and task-oriented clip-based features.ACM Transactions on Multimedia Computing, Communications and Applications, 20(3):1–24, 2023

  7. [7]

    How much can clip benefit vision-and-language tasks?arXiv preprint arXiv:2107.06383, 2021

    Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, and Kurt Keutzer. How much can clip benefit vision-and-language tasks?arXiv preprint arXiv:2107.06383, 2021

  8. [8]

    Sedigheh Eslami, Christoph Meinel, and Gerard De Melo. Pubmedclip: How much does clip benefit visual question answering in the medical domain? InFindings of the Association for Computational Linguistics: EACL 2023, pages 1181–1193, 2023

  9. [9]

    Clip-up: Clip-based unanswerable problem detection for visual question answering.arXiv preprint arXiv:2501.01371, 2025

    Ben Vardi, Oron Nir, and Ariel Shamir. Clip-up: Clip-based unanswerable problem detection for visual question answering.arXiv preprint arXiv:2501.01371, 2025

  10. [10]

    Badencoder: Backdoor attacks to pre-trained encoders in self-supervised learning

    Jinyuan Jia, Yupei Liu, and Neil Zhenqiang Gong. Badencoder: Backdoor attacks to pre-trained encoders in self-supervised learning. In2022 IEEE Symposium on Security and Privacy (SP), pages 2043–2059. IEEE, 2022

  11. [11]

    Data poisoning based backdoor attacks to contrastive learning

    Jinghuai Zhang, Hongbin Liu, Jinyuan Jia, and Neil Zhenqiang Gong. Data poisoning based backdoor attacks to contrastive learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24357–24366, 2024

  12. [12]

    Badclip: Trigger-aware prompt learning for backdoor attacks on clip

    Jiawang Bai, Kuofeng Gao, Shaobo Min, Shu-Tao Xia, Zhifeng Li, and Wei Liu. Badclip: Trigger-aware prompt learning for backdoor attacks on clip. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24239–24250, 2024

  13. [13]

    Badclip: Dual- embedding guided backdoor attack on multimodal contrastive learning

    Siyuan Liang, Mingli Zhu, Aishan Liu, Baoyuan Wu, Xiaochun Cao, and Ee-Chien Chang. Badclip: Dual- embedding guided backdoor attack on multimodal contrastive learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24645–24654, 2024

  14. [14]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021

  15. [15]

    Badnets: Evaluating backdooring attacks on deep neural networks.Ieee Access, 7:47230–47244, 2019

    Tianyu Gu, Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Evaluating backdooring attacks on deep neural networks.Ieee Access, 7:47230–47244, 2019

  16. [16]

    Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning

    Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Targeted backdoor attacks on deep learning systems using data poisoning.arXiv preprint arXiv:1712.05526, 2017

  17. [17]

    Spectral signatures in backdoor attacks.Advances in neural information processing systems, 31, 2018

    Brandon Tran, Jerry Li, and Aleksander Madry. Spectral signatures in backdoor attacks.Advances in neural information processing systems, 31, 2018

  18. [18]

    A new backdoor attack in cnns by training set corruption without label poisoning

    Mauro Barni, Kassem Kallas, and Benedetta Tondi. A new backdoor attack in cnns by training set corruption without label poisoning. In2019 IEEE International Conference on Image Processing (ICIP), pages 101–105. IEEE, 2019

  19. [19]

    Invisible backdoor attack with sample-specific triggers

    Yuezun Li, Yiming Li, Baoyuan Wu, Longkang Li, Ran He, and Siwei Lyu. Invisible backdoor attack with sample-specific triggers. InProceedings of the IEEE/CVF international conference on computer vision, pages 16463–16472, 2021. 12

  20. [20]

    Wanet–imperceptible warping-based back- door attack,

    Anh Nguyen and Anh Tran. Wanet–imperceptible warping-based backdoor attack.arXiv preprint arXiv:2102.10369, 2021

  21. [21]

    Rethinking the backdoor attacks’ triggers: A frequency perspective

    Yi Zeng, Won Park, Z Morley Mao, and Ruoxi Jia. Rethinking the backdoor attacks’ triggers: A frequency perspective. InProceedings of the IEEE/CVF international conference on computer vision, pages 16473–16481, 2021

  22. [22]

    Fiba: Frequency-injection based backdoor attack in medical image analysis

    Yu Feng, Benteng Ma, Jing Zhang, Shanshan Zhao, Yong Xia, and Dacheng Tao. Fiba: Frequency-injection based backdoor attack in medical image analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20876–20885, 2022

  23. [23]

    An invisible black-box backdoor attack through frequency domain

    Tong Wang, Yuan Yao, Feng Xu, Shengwei An, Hanghang Tong, and Ting Wang. An invisible black-box backdoor attack through frequency domain. InEuropean Conference on Computer Vision, pages 396–413. Springer, 2022

  24. [24]

    Stealthy backdoor attacks on clip via stylistic textual triggers

    Kun Cao, Bing Wang, and Shengsheng Qian. Stealthy backdoor attacks on clip via stylistic textual triggers. In International Conference on Image and Graphics, pages 275–288. Springer, 2025

  25. [25]

    Poisoning and backdooring contrastive learning.arXiv preprint arXiv:2106.09667, 2021

    Nicholas Carlini and Andreas Terzis. Poisoning and backdooring contrastive learning.arXiv preprint arXiv:2106.09667, 2021

  26. [26]

    Data poisoning attacks against multimodal encoders

    Ziqing Yang, Xinlei He, Zheng Li, Michael Backes, Mathias Humbert, Pascal Berrang, and Yang Zhang. Data poisoning attacks against multimodal encoders. InInternational Conference on Machine Learning, pages 39299–39313. PMLR, 2023

  27. [27]

    Toxictextclip: Text-based poisoning and backdoor attacks on clip pre-training.arXiv preprint arXiv:2511.00446, 2025

    Xin Yao, Haiyang Zhao, Yimin Chen, Jiawei Guo, Kecheng Huang, and Ming Zhao. Toxictextclip: Text-based poisoning and backdoor attacks on clip pre-training.arXiv preprint arXiv:2511.00446, 2025

  28. [28]

    Towards Deep Learning Models Resistant to Adversarial Attacks

    Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks.arXiv preprint arXiv:1706.06083, 2017

  29. [29]

    Explaining and Harnessing Adversarial Examples

    Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014

  30. [30]

    Unadversarial examples: Designing objects for robust vision.Advances in Neural Information Processing Systems, 34:15270– 15284, 2021

    Hadi Salman, Andrew Ilyas, Logan Engstrom, Sai Vemprala, Aleksander Madry, and Ashish Kapoor. Unadversarial examples: Designing objects for robust vision.Advances in Neural Information Processing Systems, 34:15270– 15284, 2021

  31. [31]

    Fashion iq: A new dataset towards retrieving images by natural language feedback

    Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris. Fashion iq: A new dataset towards retrieving images by natural language feedback. InProceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 11307–11317, 2021

  32. [32]

    Slake: A semantically-labeled knowledge- enhanced dataset for medical visual question answering

    Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically-labeled knowledge- enhanced dataset for medical visual question answering. In2021 IEEE 18th international symposium on biomedical imaging (ISBI), pages 1650–1654. IEEE, 2021

  33. [33]

    A corpus for reasoning about natural language grounded in photographs

    Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. A corpus for reasoning about natural language grounded in photographs. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 6418–6428, 2019. 13 Appendix A Datasets A.1 Details of Datasets FashionIQis a domain-specific dataset focusin...