arxiv: 2604.05809 · v1 · submitted 2026-04-07 · 💻 cs.CR · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Stealthy and Adjustable Text-Guided Backdoor Attacks on Multimodal Pretrained Models

Yiyang Zhang , Chaojian Yu , Ziming Hong , Yuanjie Shao , Qinmu Peng , Tongliang Liu , Xinge You

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:56 UTC · model grok-4.3

classification 💻 cs.CR cs.LG

keywords backdoor attacksmultimodal modelstext-guided triggersadversarial perturbationsmodel securityimage retrievalvisual question answering

0 comments

The pith

Multimodal pretrained models can be backdoored using common words in text descriptions as triggers, with visual perturbations allowing control over attack strength.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes using everyday words that appear naturally in textual data as backdoor triggers for multimodal models, rather than rare or artificial visual patterns. This change improves practicality because the triggers blend into normal inputs and are harder to filter out. Adding small visual adversarial perturbations to poisoned training samples lets the attacker tune how strongly the model associates the text trigger with the target behavior. Tests on composed image retrieval and visual question answering tasks show the method delivers adjustable attack success while preserving normal accuracy on clean data. The work demonstrates that current multimodal pretrained models carry security weaknesses exploitable in realistic downstream applications.

Core claim

The authors establish a Text-Guided Backdoor attack in which frequently occurring words from textual descriptions function as the trigger, while visual adversarial perturbations applied to poisoned samples modulate the model's acquisition of the backdoor association, thereby producing controllable attack success rates on downstream tasks such as Composed Image Retrieval and Visual Question Answering.

What carries the argument

The Text-Guided Backdoor (TGB) attack, which pairs common textual words as triggers with visual adversarial perturbations to adjust backdoor learning strength.

If this is right

Attacks become practical because triggers occur naturally in real-world textual descriptions.
Attack success rates can be adjusted without substantially harming performance on normal inputs.
Multimodal models for image retrieval and visual question answering inherit the demonstrated vulnerabilities.
Prior reliance on visual or multimodal triggers is shown to be unnecessary for effective backdoor attacks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Defenses could monitor for unexpected links between frequent text terms and specific model outputs during fine-tuning.
The same text-trigger approach might extend to additional multimodal tasks such as captioning or generation not tested in the paper.
Adjustable attacks may allow attackers to balance stealth against target behavior strength in deployed systems.

Load-bearing premise

Common words appearing in normal text can be turned into reliable backdoor triggers through pairing with subtle visual changes, without detection or loss of utility on clean inputs.

What would settle it

Experiments on multiple multimodal models showing that common text words paired with visual perturbations either fail to produce high attack success rates or cause measurable drops in clean-task accuracy across realistic settings.

Figures

Figures reproduced from arXiv: 2604.05809 by Chaojian Yu, Qinmu Peng, Tongliang Liu, Xinge You, Yiyang Zhang, Yuanjie Shao, Ziming Hong.

**Figure 2.** Figure 2: ASR of the model under Attack I with adversarial perturbations ( [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: ASR of the model under (a) Attack II and (b) Attack III with adversarial perturbations of different optimization [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: (a) ASR of models under Attack II with different trigger-conditioned poisoning ratios (left) and under Attack [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: (a) ASR of models under Attack II with different trigger-conditioned poisoning ratios on the FashionIQ [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: (a) Target images of hellokitty (left) and flowerlike (right), and (b) ASR of models under Attack I and II with different trigger-target pairs. D.4 Hyperparameter Analysis Finally, we perform a hyperparameter analysis on the number of PGD iterations k on the CIRR dataset. Specifically, we conduct experiments under Attack I with a perturbation budget of ϵ = 8/255 and loss-minimizing perturbations (λ = −1), … view at source ↗

**Figure 7.** Figure 7: (a) The ASR curves across different backbones under Attack II on R@1(left) and R@5(right), and (b) [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

read the original abstract

Multimodal pretrained models are vulnerable to backdoor attacks, yet most existing methods rely on visual or multimodal triggers, which are impractical since visually embedded triggers rarely occur in real-world data. To overcome this limitation, we propose a novel Text-Guided Backdoor (TGB) attack on multimodal pretrained models, where commonly occurring words in textual descriptions serve as backdoor triggers, significantly improving stealthiness and practicality. Furthermore, we introduce visual adversarial perturbations on poisoned samples to modulate the model's learning of textual triggers, enabling a controllable and adjustable TGB attack. Extensive experiments on downstream tasks built upon multimodal pretrained models, including Composed Image Retrieval (CIR) and Visual Question Answering (VQA), demonstrate that TGB achieves practicality and stealthiness with adjustable attack success rates across diverse realistic settings, revealing critical security vulnerabilities in multimodal pretrained models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Common words as backdoor triggers in this multimodal attack paper are likely to cause false positives on normal queries, which weakens the main practicality argument.

read the letter

This paper's text-guided backdoor attack using common words as triggers probably won't stay stealthy. Those words show up in ordinary text, so the backdoor would fire on clean queries in tasks like composed image retrieval and visual question answering, which undercuts the practicality claim. The new part is moving the trigger to text words that occur naturally, instead of rare visual patterns. They add visual adversarial perturbations during the poisoning phase to adjust how much the model picks up the text trigger. This gives control over the attack strength without changing the trigger itself. The experiments test this on CIR and VQA downstream tasks and report adjustable success rates while claiming the model stays useful on clean data. That setup is a step forward from earlier visual-trigger backdoors, and running it on real multimodal tasks is useful. The numbers they show for different attack intensities look like they did the work to demonstrate adjustability. The soft spot is the inference-time behavior. The visual perturbations are only used when poisoning the training data. At test time, a clean image paired with text containing the trigger word should still activate the backdoor if the text part has been learned as the signal. Since the words are described as commonly occurring, this false positive rate could be high enough to degrade performance noticeably. The paper would need to show that either the chosen words are rare enough in practice or that the modulation prevents activation without the specific visual change, but the abstract does not spell that out. This work is aimed at researchers studying security in vision-language models. Someone looking at backdoor risks or robustness for multimodal systems would find the attack construction and the task choices relevant. I recommend sending it for peer review. The topic matters for real-world multimodal applications, and referees can check the experimental details and whether the stealthiness holds in the reported results.

Referee Report

2 major / 0 minor

Summary. The paper proposes a Text-Guided Backdoor (TGB) attack on multimodal pretrained models that uses commonly occurring words in textual descriptions as backdoor triggers. Visual adversarial perturbations are applied only during poisoning to modulate trigger learning and enable adjustable attack success rates. Experiments on downstream tasks including Composed Image Retrieval (CIR) and Visual Question Answering (VQA) are claimed to show that TGB is practical, stealthy, and effective across realistic settings while preserving clean utility.

Significance. If the central claims hold, the work would demonstrate a more realistic backdoor vector for multimodal models than prior visual-trigger methods, highlighting security risks in models used for retrieval and VQA. The adjustable mechanism via poisoning-time perturbations is a potentially useful technical contribution for controlled attack evaluation.

major comments (2)

[Abstract] Abstract: The claim that TGB achieves 'stealthiness' and 'lack of utility degradation' on clean inputs is undermined by the choice of 'commonly occurring words' as triggers. Because visual perturbations are applied only during poisoning and are absent at inference, any natural occurrence of the trigger word in a clean textual query (common in CIR and VQA) would activate the backdoor on the paired image, producing incorrect outputs on a non-negligible fraction of normal inputs. This directly contradicts the asserted preservation of clean utility and stealthiness.
[Abstract] The experimental support for stealthiness and adjustable ASR cannot be verified from the provided details; the abstract asserts 'extensive experiments' but does not specify the exact trigger-word selection procedure, the fraction of clean inputs containing the chosen words, the measured clean-accuracy drop, or the detection metrics used. Without these, it is impossible to assess whether the weakest assumption (reliable triggers without false positives) holds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our paper. We have reviewed the major comments and provide our responses below, along with planned revisions to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that TGB achieves 'stealthiness' and 'lack of utility degradation' on clean inputs is undermined by the choice of 'commonly occurring words' as triggers. Because visual perturbations are applied only during poisoning and are absent at inference, any natural occurrence of the trigger word in a clean textual query (common in CIR and VQA) would activate the backdoor on the paired image, producing incorrect outputs on a non-negligible fraction of normal inputs. This directly contradicts the asserted preservation of clean utility and stealthiness.

Authors: The referee raises a substantive point about the implications of using common words as triggers. We note that while common words can appear in clean queries, the manuscript's experiments show that the resulting impact on clean utility is limited, as evidenced by the reported clean accuracy metrics across the full test sets. The adjustable mechanism via visual perturbations during poisoning allows us to control the backdoor strength without broadly affecting clean performance. To strengthen the presentation, we will revise the abstract to explicitly reference the observed preservation of clean utility and include a note on the trigger selection criteria used to maintain stealthiness. revision: partial
Referee: [Abstract] The experimental support for stealthiness and adjustable ASR cannot be verified from the provided details; the abstract asserts 'extensive experiments' but does not specify the exact trigger-word selection procedure, the fraction of clean inputs containing the chosen words, the measured clean-accuracy drop, or the detection metrics used. Without these, it is impossible to assess whether the weakest assumption (reliable triggers without false positives) holds.

Authors: We agree that additional details in the abstract would aid verification. The full manuscript describes the trigger-word selection procedure, which involves identifying frequently occurring but contextually neutral words in the textual descriptions of the datasets. It also includes the fraction of clean inputs containing these words, the measured drops in clean accuracy (which are minimal), and the metrics for evaluating stealthiness and ASR. We will revise the abstract to incorporate concise summaries of these elements, such as the selection method and key performance figures, to address this concern. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical attack proposal with independent experimental validation

full rationale

The paper describes a novel text-guided backdoor attack method using common words as triggers modulated by visual perturbations during poisoning, then reports experimental results on CIR and VQA tasks. No equations, derivations, fitted parameters, or predictions appear in the provided text. Claims rest on direct empirical outcomes rather than reducing to self-definitions, self-citations, or renamed inputs. The central premise does not invoke uniqueness theorems or ansatzes from prior self-work; it is self-contained as an attack design validated externally via experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical security attack proposal with no mathematical derivations, free parameters, axioms, or invented entities; it relies on experimental demonstration rather than theoretical constructs.

pith-pipeline@v0.9.0 · 5461 in / 1118 out tokens · 53417 ms · 2026-05-10T19:56:39.558035+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we propose a novel Text-Guided Backdoor (TGB) attack... commonly occurring words in textual descriptions serve as backdoor triggers... visual adversarial perturbations on poisoned samples to modulate the model's learning of textual triggers
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Attack I (Full Data Modification)... Attack IV (Data Injection via LLM Generation)... PGD... λ∈{+1,−1}

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 9 canonical work pages · 3 internal anchors

[1]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021
[2]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInternational conference on machine learning, pages 12888–12900. PMLR, 2022

2022
[3]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

2023
[4]

Image retrieval on real-life images with pre-trained vision-and-language models

Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould. Image retrieval on real-life images with pre-trained vision-and-language models. InProceedings of the IEEE/CVF international conference on computer vision, pages 2125–2134, 2021

2021
[5]

Sentence-level prompts benefit composed image retrieval.arXiv preprint arXiv:2310.05473, 2023

Yang Bai, Xinxing Xu, Yong Liu, Salman Khan, Fahad Khan, Wangmeng Zuo, Rick Siow Mong Goh, and Chun-Mei Feng. Sentence-level prompts benefit composed image retrieval.arXiv preprint arXiv:2310.05473, 2023

work page arXiv 2023
[6]

Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Alberto Del Bimbo. Composed image retrieval us- ing contrastive learning and task-oriented clip-based features.ACM Transactions on Multimedia Computing, Communications and Applications, 20(3):1–24, 2023

2023
[7]

How much can clip beneﬁt vision-and-language tasks?arXiv preprint arXiv:2107.06383, 2021

Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, and Kurt Keutzer. How much can clip benefit vision-and-language tasks?arXiv preprint arXiv:2107.06383, 2021

work page arXiv 2021
[8]

Sedigheh Eslami, Christoph Meinel, and Gerard De Melo. Pubmedclip: How much does clip benefit visual question answering in the medical domain? InFindings of the Association for Computational Linguistics: EACL 2023, pages 1181–1193, 2023

2023
[9]

Clip-up: Clip-based unanswerable problem detection for visual question answering.arXiv preprint arXiv:2501.01371, 2025

Ben Vardi, Oron Nir, and Ariel Shamir. Clip-up: Clip-based unanswerable problem detection for visual question answering.arXiv preprint arXiv:2501.01371, 2025

work page arXiv 2025
[10]

Badencoder: Backdoor attacks to pre-trained encoders in self-supervised learning

Jinyuan Jia, Yupei Liu, and Neil Zhenqiang Gong. Badencoder: Backdoor attacks to pre-trained encoders in self-supervised learning. In2022 IEEE Symposium on Security and Privacy (SP), pages 2043–2059. IEEE, 2022

2043
[11]

Data poisoning based backdoor attacks to contrastive learning

Jinghuai Zhang, Hongbin Liu, Jinyuan Jia, and Neil Zhenqiang Gong. Data poisoning based backdoor attacks to contrastive learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24357–24366, 2024

2024
[12]

Badclip: Trigger-aware prompt learning for backdoor attacks on clip

Jiawang Bai, Kuofeng Gao, Shaobo Min, Shu-Tao Xia, Zhifeng Li, and Wei Liu. Badclip: Trigger-aware prompt learning for backdoor attacks on clip. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24239–24250, 2024

2024
[13]

Badclip: Dual- embedding guided backdoor attack on multimodal contrastive learning

Siyuan Liang, Mingli Zhu, Aishan Liu, Baoyuan Wu, Xiaochun Cao, and Ee-Chien Chang. Badclip: Dual- embedding guided backdoor attack on multimodal contrastive learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24645–24654, 2024

2024
[14]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021

2021
[15]

Badnets: Evaluating backdooring attacks on deep neural networks.Ieee Access, 7:47230–47244, 2019

Tianyu Gu, Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Evaluating backdooring attacks on deep neural networks.Ieee Access, 7:47230–47244, 2019

2019
[16]

Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning

Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Targeted backdoor attacks on deep learning systems using data poisoning.arXiv preprint arXiv:1712.05526, 2017

work page internal anchor Pith review arXiv 2017
[17]

Spectral signatures in backdoor attacks.Advances in neural information processing systems, 31, 2018

Brandon Tran, Jerry Li, and Aleksander Madry. Spectral signatures in backdoor attacks.Advances in neural information processing systems, 31, 2018

2018
[18]

A new backdoor attack in cnns by training set corruption without label poisoning

Mauro Barni, Kassem Kallas, and Benedetta Tondi. A new backdoor attack in cnns by training set corruption without label poisoning. In2019 IEEE International Conference on Image Processing (ICIP), pages 101–105. IEEE, 2019

2019
[19]

Invisible backdoor attack with sample-specific triggers

Yuezun Li, Yiming Li, Baoyuan Wu, Longkang Li, Ran He, and Siwei Lyu. Invisible backdoor attack with sample-specific triggers. InProceedings of the IEEE/CVF international conference on computer vision, pages 16463–16472, 2021. 12

2021
[20]

Wanet–imperceptible warping-based back- door attack,

Anh Nguyen and Anh Tran. Wanet–imperceptible warping-based backdoor attack.arXiv preprint arXiv:2102.10369, 2021

work page arXiv 2021
[21]

Rethinking the backdoor attacks’ triggers: A frequency perspective

Yi Zeng, Won Park, Z Morley Mao, and Ruoxi Jia. Rethinking the backdoor attacks’ triggers: A frequency perspective. InProceedings of the IEEE/CVF international conference on computer vision, pages 16473–16481, 2021

2021
[22]

Fiba: Frequency-injection based backdoor attack in medical image analysis

Yu Feng, Benteng Ma, Jing Zhang, Shanshan Zhao, Yong Xia, and Dacheng Tao. Fiba: Frequency-injection based backdoor attack in medical image analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20876–20885, 2022

2022
[23]

An invisible black-box backdoor attack through frequency domain

Tong Wang, Yuan Yao, Feng Xu, Shengwei An, Hanghang Tong, and Ting Wang. An invisible black-box backdoor attack through frequency domain. InEuropean Conference on Computer Vision, pages 396–413. Springer, 2022

2022
[24]

Stealthy backdoor attacks on clip via stylistic textual triggers

Kun Cao, Bing Wang, and Shengsheng Qian. Stealthy backdoor attacks on clip via stylistic textual triggers. In International Conference on Image and Graphics, pages 275–288. Springer, 2025

2025
[25]

Poisoning and backdooring contrastive learning.arXiv preprint arXiv:2106.09667, 2021

Nicholas Carlini and Andreas Terzis. Poisoning and backdooring contrastive learning.arXiv preprint arXiv:2106.09667, 2021

work page arXiv 2021
[26]

Data poisoning attacks against multimodal encoders

Ziqing Yang, Xinlei He, Zheng Li, Michael Backes, Mathias Humbert, Pascal Berrang, and Yang Zhang. Data poisoning attacks against multimodal encoders. InInternational Conference on Machine Learning, pages 39299–39313. PMLR, 2023

2023
[27]

Toxictextclip: Text-based poisoning and backdoor attacks on clip pre-training.arXiv preprint arXiv:2511.00446, 2025

Xin Yao, Haiyang Zhao, Yimin Chen, Jiawei Guo, Kecheng Huang, and Ming Zhao. Toxictextclip: Text-based poisoning and backdoor attacks on clip pre-training.arXiv preprint arXiv:2511.00446, 2025

work page arXiv 2025
[28]

Towards Deep Learning Models Resistant to Adversarial Attacks

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks.arXiv preprint arXiv:1706.06083, 2017

work page internal anchor Pith review arXiv 2017
[29]

Explaining and Harnessing Adversarial Examples

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014

work page internal anchor Pith review arXiv 2014
[30]

Unadversarial examples: Designing objects for robust vision.Advances in Neural Information Processing Systems, 34:15270– 15284, 2021

Hadi Salman, Andrew Ilyas, Logan Engstrom, Sai Vemprala, Aleksander Madry, and Ashish Kapoor. Unadversarial examples: Designing objects for robust vision.Advances in Neural Information Processing Systems, 34:15270– 15284, 2021

2021
[31]

Fashion iq: A new dataset towards retrieving images by natural language feedback

Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris. Fashion iq: A new dataset towards retrieving images by natural language feedback. InProceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 11307–11317, 2021

2021
[32]

Slake: A semantically-labeled knowledge- enhanced dataset for medical visual question answering

Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically-labeled knowledge- enhanced dataset for medical visual question answering. In2021 IEEE 18th international symposium on biomedical imaging (ISBI), pages 1650–1654. IEEE, 2021

2021
[33]

A corpus for reasoning about natural language grounded in photographs

Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. A corpus for reasoning about natural language grounded in photographs. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 6418–6428, 2019. 13 Appendix A Datasets A.1 Details of Datasets FashionIQis a domain-specific dataset focusin...

2019