Recognition: 2 theorem links
· Lean TheoremStealthy and Adjustable Text-Guided Backdoor Attacks on Multimodal Pretrained Models
Pith reviewed 2026-05-10 19:56 UTC · model grok-4.3
The pith
Multimodal pretrained models can be backdoored using common words in text descriptions as triggers, with visual perturbations allowing control over attack strength.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish a Text-Guided Backdoor attack in which frequently occurring words from textual descriptions function as the trigger, while visual adversarial perturbations applied to poisoned samples modulate the model's acquisition of the backdoor association, thereby producing controllable attack success rates on downstream tasks such as Composed Image Retrieval and Visual Question Answering.
What carries the argument
The Text-Guided Backdoor (TGB) attack, which pairs common textual words as triggers with visual adversarial perturbations to adjust backdoor learning strength.
If this is right
- Attacks become practical because triggers occur naturally in real-world textual descriptions.
- Attack success rates can be adjusted without substantially harming performance on normal inputs.
- Multimodal models for image retrieval and visual question answering inherit the demonstrated vulnerabilities.
- Prior reliance on visual or multimodal triggers is shown to be unnecessary for effective backdoor attacks.
Where Pith is reading between the lines
- Defenses could monitor for unexpected links between frequent text terms and specific model outputs during fine-tuning.
- The same text-trigger approach might extend to additional multimodal tasks such as captioning or generation not tested in the paper.
- Adjustable attacks may allow attackers to balance stealth against target behavior strength in deployed systems.
Load-bearing premise
Common words appearing in normal text can be turned into reliable backdoor triggers through pairing with subtle visual changes, without detection or loss of utility on clean inputs.
What would settle it
Experiments on multiple multimodal models showing that common text words paired with visual perturbations either fail to produce high attack success rates or cause measurable drops in clean-task accuracy across realistic settings.
Figures
read the original abstract
Multimodal pretrained models are vulnerable to backdoor attacks, yet most existing methods rely on visual or multimodal triggers, which are impractical since visually embedded triggers rarely occur in real-world data. To overcome this limitation, we propose a novel Text-Guided Backdoor (TGB) attack on multimodal pretrained models, where commonly occurring words in textual descriptions serve as backdoor triggers, significantly improving stealthiness and practicality. Furthermore, we introduce visual adversarial perturbations on poisoned samples to modulate the model's learning of textual triggers, enabling a controllable and adjustable TGB attack. Extensive experiments on downstream tasks built upon multimodal pretrained models, including Composed Image Retrieval (CIR) and Visual Question Answering (VQA), demonstrate that TGB achieves practicality and stealthiness with adjustable attack success rates across diverse realistic settings, revealing critical security vulnerabilities in multimodal pretrained models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a Text-Guided Backdoor (TGB) attack on multimodal pretrained models that uses commonly occurring words in textual descriptions as backdoor triggers. Visual adversarial perturbations are applied only during poisoning to modulate trigger learning and enable adjustable attack success rates. Experiments on downstream tasks including Composed Image Retrieval (CIR) and Visual Question Answering (VQA) are claimed to show that TGB is practical, stealthy, and effective across realistic settings while preserving clean utility.
Significance. If the central claims hold, the work would demonstrate a more realistic backdoor vector for multimodal models than prior visual-trigger methods, highlighting security risks in models used for retrieval and VQA. The adjustable mechanism via poisoning-time perturbations is a potentially useful technical contribution for controlled attack evaluation.
major comments (2)
- [Abstract] Abstract: The claim that TGB achieves 'stealthiness' and 'lack of utility degradation' on clean inputs is undermined by the choice of 'commonly occurring words' as triggers. Because visual perturbations are applied only during poisoning and are absent at inference, any natural occurrence of the trigger word in a clean textual query (common in CIR and VQA) would activate the backdoor on the paired image, producing incorrect outputs on a non-negligible fraction of normal inputs. This directly contradicts the asserted preservation of clean utility and stealthiness.
- [Abstract] The experimental support for stealthiness and adjustable ASR cannot be verified from the provided details; the abstract asserts 'extensive experiments' but does not specify the exact trigger-word selection procedure, the fraction of clean inputs containing the chosen words, the measured clean-accuracy drop, or the detection metrics used. Without these, it is impossible to assess whether the weakest assumption (reliable triggers without false positives) holds.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our paper. We have reviewed the major comments and provide our responses below, along with planned revisions to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that TGB achieves 'stealthiness' and 'lack of utility degradation' on clean inputs is undermined by the choice of 'commonly occurring words' as triggers. Because visual perturbations are applied only during poisoning and are absent at inference, any natural occurrence of the trigger word in a clean textual query (common in CIR and VQA) would activate the backdoor on the paired image, producing incorrect outputs on a non-negligible fraction of normal inputs. This directly contradicts the asserted preservation of clean utility and stealthiness.
Authors: The referee raises a substantive point about the implications of using common words as triggers. We note that while common words can appear in clean queries, the manuscript's experiments show that the resulting impact on clean utility is limited, as evidenced by the reported clean accuracy metrics across the full test sets. The adjustable mechanism via visual perturbations during poisoning allows us to control the backdoor strength without broadly affecting clean performance. To strengthen the presentation, we will revise the abstract to explicitly reference the observed preservation of clean utility and include a note on the trigger selection criteria used to maintain stealthiness. revision: partial
-
Referee: [Abstract] The experimental support for stealthiness and adjustable ASR cannot be verified from the provided details; the abstract asserts 'extensive experiments' but does not specify the exact trigger-word selection procedure, the fraction of clean inputs containing the chosen words, the measured clean-accuracy drop, or the detection metrics used. Without these, it is impossible to assess whether the weakest assumption (reliable triggers without false positives) holds.
Authors: We agree that additional details in the abstract would aid verification. The full manuscript describes the trigger-word selection procedure, which involves identifying frequently occurring but contextually neutral words in the textual descriptions of the datasets. It also includes the fraction of clean inputs containing these words, the measured drops in clean accuracy (which are minimal), and the metrics for evaluating stealthiness and ASR. We will revise the abstract to incorporate concise summaries of these elements, such as the selection method and key performance figures, to address this concern. revision: yes
Circularity Check
No circularity: empirical attack proposal with independent experimental validation
full rationale
The paper describes a novel text-guided backdoor attack method using common words as triggers modulated by visual perturbations during poisoning, then reports experimental results on CIR and VQA tasks. No equations, derivations, fitted parameters, or predictions appear in the provided text. Claims rest on direct empirical outcomes rather than reducing to self-definitions, self-citations, or renamed inputs. The central premise does not invoke uniqueness theorems or ansatzes from prior self-work; it is self-contained as an attack design validated externally via experiments.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we propose a novel Text-Guided Backdoor (TGB) attack... commonly occurring words in textual descriptions serve as backdoor triggers... visual adversarial perturbations on poisoned samples to modulate the model's learning of textual triggers
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Attack I (Full Data Modification)... Attack IV (Data Injection via LLM Generation)... PGD... λ∈{+1,−1}
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021
2021
-
[2]
Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInternational conference on machine learning, pages 12888–12900. PMLR, 2022
2022
-
[3]
Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023
2023
-
[4]
Image retrieval on real-life images with pre-trained vision-and-language models
Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould. Image retrieval on real-life images with pre-trained vision-and-language models. InProceedings of the IEEE/CVF international conference on computer vision, pages 2125–2134, 2021
2021
-
[5]
Sentence-level prompts benefit composed image retrieval.arXiv preprint arXiv:2310.05473, 2023
Yang Bai, Xinxing Xu, Yong Liu, Salman Khan, Fahad Khan, Wangmeng Zuo, Rick Siow Mong Goh, and Chun-Mei Feng. Sentence-level prompts benefit composed image retrieval.arXiv preprint arXiv:2310.05473, 2023
-
[6]
Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Alberto Del Bimbo. Composed image retrieval us- ing contrastive learning and task-oriented clip-based features.ACM Transactions on Multimedia Computing, Communications and Applications, 20(3):1–24, 2023
2023
-
[7]
How much can clip benefit vision-and-language tasks?arXiv preprint arXiv:2107.06383, 2021
Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, and Kurt Keutzer. How much can clip benefit vision-and-language tasks?arXiv preprint arXiv:2107.06383, 2021
-
[8]
Sedigheh Eslami, Christoph Meinel, and Gerard De Melo. Pubmedclip: How much does clip benefit visual question answering in the medical domain? InFindings of the Association for Computational Linguistics: EACL 2023, pages 1181–1193, 2023
2023
-
[9]
Ben Vardi, Oron Nir, and Ariel Shamir. Clip-up: Clip-based unanswerable problem detection for visual question answering.arXiv preprint arXiv:2501.01371, 2025
-
[10]
Badencoder: Backdoor attacks to pre-trained encoders in self-supervised learning
Jinyuan Jia, Yupei Liu, and Neil Zhenqiang Gong. Badencoder: Backdoor attacks to pre-trained encoders in self-supervised learning. In2022 IEEE Symposium on Security and Privacy (SP), pages 2043–2059. IEEE, 2022
2043
-
[11]
Data poisoning based backdoor attacks to contrastive learning
Jinghuai Zhang, Hongbin Liu, Jinyuan Jia, and Neil Zhenqiang Gong. Data poisoning based backdoor attacks to contrastive learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24357–24366, 2024
2024
-
[12]
Badclip: Trigger-aware prompt learning for backdoor attacks on clip
Jiawang Bai, Kuofeng Gao, Shaobo Min, Shu-Tao Xia, Zhifeng Li, and Wei Liu. Badclip: Trigger-aware prompt learning for backdoor attacks on clip. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24239–24250, 2024
2024
-
[13]
Badclip: Dual- embedding guided backdoor attack on multimodal contrastive learning
Siyuan Liang, Mingli Zhu, Aishan Liu, Baoyuan Wu, Xiaochun Cao, and Ee-Chien Chang. Badclip: Dual- embedding guided backdoor attack on multimodal contrastive learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24645–24654, 2024
2024
-
[14]
Scaling up visual and vision-language representation learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021
2021
-
[15]
Badnets: Evaluating backdooring attacks on deep neural networks.Ieee Access, 7:47230–47244, 2019
Tianyu Gu, Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Evaluating backdooring attacks on deep neural networks.Ieee Access, 7:47230–47244, 2019
2019
-
[16]
Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning
Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Targeted backdoor attacks on deep learning systems using data poisoning.arXiv preprint arXiv:1712.05526, 2017
work page internal anchor Pith review arXiv 2017
-
[17]
Spectral signatures in backdoor attacks.Advances in neural information processing systems, 31, 2018
Brandon Tran, Jerry Li, and Aleksander Madry. Spectral signatures in backdoor attacks.Advances in neural information processing systems, 31, 2018
2018
-
[18]
A new backdoor attack in cnns by training set corruption without label poisoning
Mauro Barni, Kassem Kallas, and Benedetta Tondi. A new backdoor attack in cnns by training set corruption without label poisoning. In2019 IEEE International Conference on Image Processing (ICIP), pages 101–105. IEEE, 2019
2019
-
[19]
Invisible backdoor attack with sample-specific triggers
Yuezun Li, Yiming Li, Baoyuan Wu, Longkang Li, Ran He, and Siwei Lyu. Invisible backdoor attack with sample-specific triggers. InProceedings of the IEEE/CVF international conference on computer vision, pages 16463–16472, 2021. 12
2021
-
[20]
Wanet–imperceptible warping-based back- door attack,
Anh Nguyen and Anh Tran. Wanet–imperceptible warping-based backdoor attack.arXiv preprint arXiv:2102.10369, 2021
-
[21]
Rethinking the backdoor attacks’ triggers: A frequency perspective
Yi Zeng, Won Park, Z Morley Mao, and Ruoxi Jia. Rethinking the backdoor attacks’ triggers: A frequency perspective. InProceedings of the IEEE/CVF international conference on computer vision, pages 16473–16481, 2021
2021
-
[22]
Fiba: Frequency-injection based backdoor attack in medical image analysis
Yu Feng, Benteng Ma, Jing Zhang, Shanshan Zhao, Yong Xia, and Dacheng Tao. Fiba: Frequency-injection based backdoor attack in medical image analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20876–20885, 2022
2022
-
[23]
An invisible black-box backdoor attack through frequency domain
Tong Wang, Yuan Yao, Feng Xu, Shengwei An, Hanghang Tong, and Ting Wang. An invisible black-box backdoor attack through frequency domain. InEuropean Conference on Computer Vision, pages 396–413. Springer, 2022
2022
-
[24]
Stealthy backdoor attacks on clip via stylistic textual triggers
Kun Cao, Bing Wang, and Shengsheng Qian. Stealthy backdoor attacks on clip via stylistic textual triggers. In International Conference on Image and Graphics, pages 275–288. Springer, 2025
2025
-
[25]
Poisoning and backdooring contrastive learning.arXiv preprint arXiv:2106.09667, 2021
Nicholas Carlini and Andreas Terzis. Poisoning and backdooring contrastive learning.arXiv preprint arXiv:2106.09667, 2021
-
[26]
Data poisoning attacks against multimodal encoders
Ziqing Yang, Xinlei He, Zheng Li, Michael Backes, Mathias Humbert, Pascal Berrang, and Yang Zhang. Data poisoning attacks against multimodal encoders. InInternational Conference on Machine Learning, pages 39299–39313. PMLR, 2023
2023
-
[27]
Xin Yao, Haiyang Zhao, Yimin Chen, Jiawei Guo, Kecheng Huang, and Ming Zhao. Toxictextclip: Text-based poisoning and backdoor attacks on clip pre-training.arXiv preprint arXiv:2511.00446, 2025
-
[28]
Towards Deep Learning Models Resistant to Adversarial Attacks
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks.arXiv preprint arXiv:1706.06083, 2017
work page internal anchor Pith review arXiv 2017
-
[29]
Explaining and Harnessing Adversarial Examples
Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014
work page internal anchor Pith review arXiv 2014
-
[30]
Unadversarial examples: Designing objects for robust vision.Advances in Neural Information Processing Systems, 34:15270– 15284, 2021
Hadi Salman, Andrew Ilyas, Logan Engstrom, Sai Vemprala, Aleksander Madry, and Ashish Kapoor. Unadversarial examples: Designing objects for robust vision.Advances in Neural Information Processing Systems, 34:15270– 15284, 2021
2021
-
[31]
Fashion iq: A new dataset towards retrieving images by natural language feedback
Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris. Fashion iq: A new dataset towards retrieving images by natural language feedback. InProceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 11307–11317, 2021
2021
-
[32]
Slake: A semantically-labeled knowledge- enhanced dataset for medical visual question answering
Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically-labeled knowledge- enhanced dataset for medical visual question answering. In2021 IEEE 18th international symposium on biomedical imaging (ISBI), pages 1650–1654. IEEE, 2021
2021
-
[33]
A corpus for reasoning about natural language grounded in photographs
Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. A corpus for reasoning about natural language grounded in photographs. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 6418–6428, 2019. 13 Appendix A Datasets A.1 Details of Datasets FashionIQis a domain-specific dataset focusin...
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.