Perturb and Recover: Fine-tuning for Effective Backdoor Removal from CLIP

Francesco Croce; Matthias Hein; Naman Deep Singh

arxiv: 2412.00727 · v3 · submitted 2024-12-01 · 💻 cs.LG · cs.CR· cs.CV

Perturb and Recover: Fine-tuning for Effective Backdoor Removal from CLIP

Naman Deep Singh , Francesco Croce , Matthias Hein This is my paper

Pith reviewed 2026-05-23 08:15 UTC · model grok-4.3

classification 💻 cs.LG cs.CRcs.CV

keywords backdoor removalCLIP modelsfine-tuningvision-language modelsadversarial attacksmodel cleaningsynthetic data

0 comments

The pith

A fine-tuning procedure called PAR removes backdoors from CLIP models while preserving standard performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that existing backdoor cleaning methods fail against structured triggers in attacks such as Blended and BadNet on CLIP vision-language models. It introduces PAR, a fine-tuning process that perturbs the model and then recovers performance to eliminate backdoor associations. Experiments across encoders and attack types show high backdoor removal rates. The approach succeeds even when fine-tuning relies solely on synthetic image-text pairs instead of the original training data or any knowledge of the trigger pattern.

Core claim

PAR is a fine-tuning mechanism that achieves high backdoor removal rates on poisoned CLIP models while maintaining good accuracy on clean inputs, and it remains effective when the fine-tuning data consists only of synthetic text-image pairs with no access to the poisoned dataset or trigger details.

What carries the argument

The Perturb and Recover (PAR) fine-tuning procedure, which applies targeted perturbations followed by recovery steps to overwrite backdoor behaviors.

If this is right

Backdoored CLIP models can be cleaned after training without knowledge of the attack details.
Synthetic image-text pairs are sufficient to remove backdoors in place of real training data.
Standard performance on clean tasks is preserved during the backdoor removal process.
The method applies across multiple model encoders and multiple types of structured backdoor attacks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Deployment pipelines for web-sourced vision-language models could include PAR as a post-training sanitization step.
Similar perturbation-and-recovery patterns might extend to removing unwanted behaviors in other large multimodal models.
The success with synthetic data suggests backdoors may be encoded in ways that are easy to overwrite without the original trigger distribution.
Repeated application of PAR could be tested to determine whether it prevents re-poisoning during later fine-tuning stages.

Load-bearing premise

Fine-tuning on synthetic or clean data can reliably erase backdoor associations created by unknown structured triggers.

What would settle it

An experiment in which PAR is applied to a backdoored CLIP model using synthetic data and the backdoor attack success rate on triggered inputs stays above 80 percent while clean accuracy remains unchanged.

Figures

Figures reproduced from arXiv: 2412.00727 by Francesco Croce, Matthias Hein, Naman Deep Singh.

**Figure 1.** Figure 1: PAR cleans better than previous methods. We show clean accuracy (CA) and attack success rate (ASR) for the poisoned model (CLIP) and after cleaning with RoCLIP [47], CleanCLIP [2] and our novel PAR. While CleanCLIP and RoCLIP work well for known triggers, they perform worse for our novel (harder) structured triggers with RoCLIP suffering the most degradation in CA. PAR is the best backdoor defense across… view at source ↗

**Figure 2.** Figure 2: Visualizing different backdoor patterns. Standard BadNet [18] and Blended [8] use Gaussian noise as a trigger, we replace the noise with random stripped pattern for BadNet termed BadNet-Stripes. For the Blended attack, we further replace the random noise with stripes, low contrast triangles (Blended-Tri.) and “Watermarked” text (Blended-Text), xmore visualizations in [PITH_FULL_IMAGE:figures/full_fig_p003… view at source ↗

**Figure 3.** Figure 3: ASR v Clean accuracy trade-off for BadNet-Stripes cleaned RN50. We plot attack success rate (ASR) against clean accuracy on ImageNet for different strength of the uni-modal augmentation loss of CleanCLIP and different threshold (τ ) for our PAR loss with clean (CC3M) and synthetic (SynC) data. CleanCLIP is unable to clean the model for the proposed “Stripes” trigger pattern, which is quite different fro… view at source ↗

**Figure 4.** Figure 4: Training dynamics of PAR and visualizations of image embeddings across cleaning methods for Blended-Text poisoned RN50. In the top left plot, we show how the LCLIP and LPERT (τ = 2.15) loss terms develop over training steps (evaluated every 25 steps) for Blended-Text poisoned RN50. Even though the schedule was optimized for BadNet-Stripes poisoned RN50, in the top right plot, we see how the training schedu… view at source ↗

**Figure 5.** Figure 5: ASR for different poisoning rates of CleanCLIP and PAR for RN50. Even at a lower poisoning rate of 0.05%, BadNetStripes achieves 92% attack success rate (ASR). Overall across all poisoning rates, PAR cleans better than CleanCLIP. (BadCLIP) attack the best clean zero-shot ImageNet accuracy, whereas RoCLIP yields the worst. Importantly, PAR outperforms in all cases both CleanCLIP and RoCLIP in terms of ASR… view at source ↗

**Figure 6.** Figure 6: Visualizing the proposed triggers. We visualize the proposed structured triggers. For BadNet-Stripes we use the “Stripes” trigger as a patch. For Blended-Stripes the “Stripes” trigger is overlayed on the full images with nc = 0.03 in Eq. (1). “Triangles” and “Text” triggers are also overlayed on the original image as described in Sec. 3.2. model and the poisoning is also done with a specific training setu… view at source ↗

**Figure 7.** Figure 7: Visualizing more images with known and proposed triggers. Standard BadNet [18] and Blended [8] use Gaussian noise as a trigger, we replace the noise with random stripped pattern for BadNet termed BadNet-Stripes. For the Blended attack, we further replace the random noise with stripes, low contrast triangles (Blended-Triangles) and “Watermarked” text (Blended-Text). Note: this is a very small subset of poss… view at source ↗

**Figure 8.** Figure 8: Visualizing the embeddings of different models for BadNet [18] poisoned RN50. We visualize the t-SNE projections of random noise based BadNet poisoned CLIP, clean fine-tuned by CleanCLIP and fine-tuned by PAR. In this case, CleanCLIP embeddings are much more homogeneously distributed in comparison to the proposed attacks with structured patterns. This shows that for random noise based triggers, CleanCLIP c… view at source ↗

**Figure 9.** Figure 9: Training dynamics and visualizing the embeddings of different models for BadNet-Stripes poisoned RN50. In the top left plot, we show how the LCLIP and LPERT (τ = 2.15) loss terms develop over training steps (evaluated every 25 steps) for BadNet-Stripes poisoned RN50. In the top right plot, we see how the training schedule generalizes by plotting clean accuracy and ASR (evaluated on 10k samples from ImageNe… view at source ↗

**Figure 10.** Figure 10: Training dynamics and visualizing the embeddings of different models for Blended-Stripes poisoned RN50. In the top left plot, we show how the LCLIP and LPERT (τ = 2.15) loss terms develop over training steps (evaluated every 25 steps) for Blended-Stripes poisoned RN50. Even though the schedule was optimized for BadNet-Stripes poisoned RN50, in the top right plot, we see how the training schedule generaliz… view at source ↗

read the original abstract

Vision-Language models like CLIP have been shown to be highly effective at linking visual perception and natural language understanding, enabling sophisticated image-text capabilities, including strong retrieval and zero-shot classification performance. Their widespread use, as well as the fact that CLIP models are trained on image-text pairs from the web, make them both a worthwhile and relatively easy target for backdoor attacks. As training foundational models, such as CLIP, from scratch is very expensive, this paper focuses on cleaning potentially poisoned models via fine-tuning. We first show that existing cleaning techniques are not effective against simple structured triggers used in Blended or BadNet backdoor attacks, exposing a critical vulnerability for potential real-world deployment of these models. Then, we introduce PAR, Perturb and Recover, a surprisingly simple yet effective mechanism to remove backdoors from CLIP models. Through extensive experiments across different encoders and types of backdoor attacks, we show that PAR achieves high backdoor removal rate while preserving good standard performance. Finally, we illustrate that our approach is effective even only with synthetic text-image pairs, i.e. without access to real training data. The code and models are available on \href{https://github.com/nmndeep/PerturbAndRecover}{GitHub}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that existing backdoor cleaning methods fail against structured triggers (Blended, BadNet) in CLIP models, and introduces Perturb and Recover (PAR), a fine-tuning procedure that achieves high backdoor removal rates while preserving standard performance. It further claims effectiveness even when fine-tuning uses only synthetic image-text pairs, without access to the original poisoned dataset or trigger knowledge. These claims are supported by experiments across multiple encoders and attack types.

Significance. If the results hold, PAR provides a practical post-training defense for widely deployed CLIP models whose training data cannot be audited, addressing a real deployment risk at far lower cost than retraining. The public release of code and models on GitHub is a clear strength that supports reproducibility and follow-up work.

major comments (2)

[Abstract] Abstract (paragraph on PAR and synthetic data experiments): the central claim that PAR severs backdoor associations for structured triggers using only synthetic data rests on the unexamined assumption that the perturb-and-recover process breaks the trigger mapping rather than merely suppressing it on the evaluated test triggers; no mechanistic analysis or ablation is supplied to distinguish these outcomes, even though the paper itself shows prior methods fail precisely on these triggers.
[Experiments] Experiments section (synthetic-data results): the reported high removal rates on synthetic pairs do not include controls that would rule out the possibility that the synthetic distribution simply avoids the trigger manifold, leaving open whether the method generalizes when the backdoor is encoded in a manner that survives clean fine-tuning.

minor comments (2)

[Abstract] The abstract describes PAR as 'surprisingly simple' without indicating the precise form of the perturbation operator or the recover objective; a short equation or pseudocode in the method section would improve clarity.
Table or figure captions for the main results should explicitly state the number of runs and any statistical significance tests performed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and agree that revisions are warranted to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract (paragraph on PAR and synthetic data experiments): the central claim that PAR severs backdoor associations for structured triggers using only synthetic data rests on the unexamined assumption that the perturb-and-recover process breaks the trigger mapping rather than merely suppressing it on the evaluated test triggers; no mechanistic analysis or ablation is supplied to distinguish these outcomes, even though the paper itself shows prior methods fail precisely on these triggers.

Authors: We agree that a mechanistic distinction between severing the trigger association versus suppressing it on the specific evaluated triggers would strengthen the central claim. While our results show PAR succeeding where prior methods fail on structured triggers, we did not include ablations on trigger variations or embedding analyses. In revision we will add such an ablation (testing modified trigger patterns and comparing embedding shifts) to provide direct evidence that the mapping is disrupted. revision: yes
Referee: [Experiments] Experiments section (synthetic-data results): the reported high removal rates on synthetic pairs do not include controls that would rule out the possibility that the synthetic distribution simply avoids the trigger manifold, leaving open whether the method generalizes when the backdoor is encoded in a manner that survives clean fine-tuning.

Authors: This is a valid concern. The synthetic pairs were generated to approximate the diversity of the original data, yet we did not explicitly compare against standard fine-tuning on the identical synthetic distribution. In the revised manuscript we will add this control experiment to demonstrate that standard fine-tuning on the synthetic pairs leaves the backdoor largely intact while PAR removes it, thereby showing the removal is attributable to the perturb-and-recover procedure. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method validated by direct experiments

full rationale

The paper introduces PAR as a fine-tuning procedure and reports its performance via experiments on multiple encoders, attack types, and data regimes (including synthetic pairs). No equations, derivations, or first-principles predictions are claimed; the central results are measured outcomes on held-out test sets rather than quantities forced by construction from fitted parameters or self-referential definitions. Self-citations, if present, are not load-bearing for any uniqueness theorem or ansatz. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions of fine-tuning efficacy and the existence of synthetic data proxies; no free parameters, axioms, or invented entities are introduced beyond typical ML training choices.

axioms (1)

domain assumption Fine-tuning on clean or synthetic data can unlearn backdoor triggers without access to the original poisoned dataset.
Invoked in the description of PAR effectiveness with synthetic pairs (abstract).

pith-pipeline@v0.9.0 · 5757 in / 1109 out tokens · 17739 ms · 2026-05-23T08:15:05.691052+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 5 internal anchors

[1]

How to backdoor federated learning

Eugene Bagdasaryan, Andreas Veit, Yiqing Hua, Deborah Estrin, and Vitaly Shmatikov. How to backdoor federated learning. In AISTATS, 2020. 2

work page 2020
[2]

Cleanclip: Mitigating data poi- soning attacks in multimodal contrastive learning

Hritik Bansal, Nishad Singhi, Yu Yang, Fan Yin, Aditya Grover, and Kai-Wei Chang. Cleanclip: Mitigating data poi- soning attacks in multimodal contrastive learning. In ICCV,

work page
[3]

A new backdoor attack in cnns by training set corruption without label poisoning

Mauro Barni, Kassem Kallas, and Benedetta Tondi. A new backdoor attack in cnns by training set corruption without label poisoning. In 2019 IEEE International Conference on Image Processing (ICIP). IEEE, 2019. 2, 1

work page 2019
[4]

Strong data augmentation sanitizes poi- soning and backdoor attacks without an accuracy tradeoff

Eitan Borgnia, Valeriia Cherepanova, Liam Fowl, Amin Ghiasi, Jonas Geiping, Micah Goldblum, Tom Goldstein, and Arjun Gupta. Strong data augmentation sanitizes poi- soning and backdoor attacks without an accuracy tradeoff. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE,

work page 2021
[5]

Coyo-700m: Image-text pair dataset

Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset. https://github.com/kak aobrain/coyo-dataset, 2022. 3

work page 2022
[6]

Poisoning and back- dooring contrastive learning

Nicholas Carlini and Andreas Terzis. Poisoning and back- dooring contrastive learning. In ICLR, 2022. 2

work page 2022
[7]

Poisoning web-scale training datasets is practical

Nicholas Carlini, Matthew Jagielski, Christopher A Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum An- derson, Andreas Terzis, Kurt Thomas, and Florian Tram `er. Poisoning web-scale training datasets is practical. In 2024 IEEE Symposium on Security and Privacy (SP). IEEE Com- puter Society, 2024. 1, 2, 3

work page 2024
[8]

Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning

Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526 ,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang- hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yong- hao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. 1

work page 2023
[10]

Autoaugment: Learning augmentation strategies from data

Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasude- van, and Quoc V Le. Autoaugment: Learning augmentation strategies from data. In CVPR, 2019. 2, 4

work page 2019
[11]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009. 1

work page 2009
[12]

Improved Regularization of Convolutional Neural Networks with Cutout

Terrance DeVries. Improved regularization of convo- lutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017. 2

work page internal anchor Pith review Pith/arXiv arXiv 2017
[13]

Defend- ing backdoor attacks on vision transformer via patch process- ing

Khoa D Doan, Yingjie Lao, Peng Yang, and Ping Li. Defend- ing backdoor attacks on vision transformer via patch process- ing. In AAAI, 2023. 2

work page 2023
[14]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 ,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Dat- acomp: In search of the next generation of multimodal datasets

Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Dat- acomp: In search of the next generation of multimodal datasets. In NeurIPS, 2024. 1

work page 2024
[16]

Backdoor defense via adaptively splitting poisoned dataset

Kuofeng Gao, Yang Bai, Jindong Gu, Yong Yang, and Shu- Tao Xia. Backdoor defense via adaptively splitting poisoned dataset. In CVPR, 2023. 3

work page 2023
[17]

Watermarking pre- trained language models with backdooring

Chenxi Gu, Chengsong Huang, Xiaoqing Zheng, Kai- Wei Chang, and Cho-Jui Hsieh. Watermarking pre- trained language models with backdooring. arXiv preprint arXiv:2210.07543, 2022. 2

work page arXiv 2022
[18]

BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain

Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. Bad- nets: Identifying vulnerabilities in the machine learning model supply chain. arXiv preprint arXiv:1708.06733, 2017. 1, 2, 3, 4, 6, 7, 8, 5

work page internal anchor Pith review Pith/arXiv arXiv 2017
[19]

Badnets: Evaluating backdooring attacks on deep neu- ral networks

Tianyu Gu, Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Evaluating backdooring attacks on deep neu- ral networks. IEEE Access, 7, 2019. 2

work page 2019
[20]

Synthclip: Are we ready for a fully synthetic clip training? In Synthetic Data for Computer Vision Workshop@ CVPR, 2024

Hasan Abed Al Kader Hammoud, Hani Itani, Fabio Pizzati, Adel Bibi, and Bernard Ghanem. Synthclip: Are we ready for a fully synthetic clip training? In Synthetic Data for Computer Vision Workshop@ CVPR, 2024. 8, 1

work page 2024
[21]

Defending our privacy with backdoors

Dominik Hintersdorf, Lukas Struppek, Daniel Neider, and Kristian Kersting. Defending our privacy with backdoors. In NeurIPS 2023 Workshop on Backdoors in Deep Learning - The Good, the Bad, and the Ugly, 2024. 2

work page 2023
[22]

Open- clip, 2021

Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Han- naneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Open- clip, 2021. 1

work page 2021
[23]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021. 1

work page 2021
[24]

Baden- coder: Backdoor attacks to pre-trained encoders in self- supervised learning

Jinyuan Jia, Yupei Liu, and Neil Zhenqiang Gong. Baden- coder: Backdoor attacks to pre-trained encoders in self- supervised learning. In 2022 IEEE Symposium on Security and Privacy (SP). IEEE, 2022. 2 9

work page 2022
[25]

Adversarial backdoor defense in clip

Junhao Kuang, Siyuan Liang, Jiawei Liang, Kuanrong Liu, and Xiaochun Cao. Adversarial backdoor defense in clip. arXiv preprint arXiv:2409.15968, 2024. 3

work page arXiv 2024
[26]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML,

work page
[27]

Invisible backdoor attacks on deep neural networks via steganography and regularization

Shaofeng Li, Minhui Xue, Benjamin Zi Hao Zhao, Haojin Zhu, and Xinpeng Zhang. Invisible backdoor attacks on deep neural networks via steganography and regularization. IEEE Transactions on Dependable and Secure Computing , 18(5),

work page
[29]

Neural attention distillation: Erasing back- door triggers from deep neural networks

Yige Li, Xixiang Lyu, Nodens Koren, Lingjuan Lyu, Bo Li, and Xingjun Ma. Neural attention distillation: Erasing back- door triggers from deep neural networks. In ICLR, 2021. 2

work page 2021
[30]

Badclip: Dual- embedding guided backdoor attack on multimodal con- trastive learning

Siyuan Liang, Mingli Zhu, Aishan Liu, Baoyuan Wu, Xiaochun Cao, and Ee-Chien Chang. Badclip: Dual- embedding guided backdoor attack on multimodal con- trastive learning. In CVPR, 2024. 1, 2, 5, 6, 7, 8, 4

work page 2024
[31]

Vila: On pre-training for visual language models

Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Moham- mad Shoeybi, and Song Han. Vila: On pre-training for visual language models. In CVPR, 2024. 1, 4

work page 2024
[32]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 1

work page 2014
[33]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In CVPR,

work page
[34]

Re- flection backdoor: A natural backdoor attack on deep neural networks

Yunfei Liu, Xingjun Ma, James Bailey, and Feng Lu. Re- flection backdoor: A natural backdoor attack on deep neural networks. In ECCV, 2020. 2

work page 2020
[35]

Wanet - impercepti- ble warping-based backdoor attack

Tuan Anh Nguyen and Anh Tuan Tran. Wanet - impercepti- ble warping-based backdoor attack. In ICLR, 2021. 2, 6, 7, 1

work page 2021
[36]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In ICML, 2021. 1, 4, 6

work page 2021
[37]

High-resolution image syn- thesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. In CVPR, 2022. 8

work page 2022
[38]

Laion-5b: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. In NeurIPS, 2022. 1, 3

work page 2022
[39]

Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. In ACL,

work page
[40]

Introducing qwen1.5, 2024

Qwen Team. Introducing qwen1.5, 2024. 1

work page 2024
[41]

Visualizing data using t-sne

Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9 (86), 2008. 6

work page 2008
[42]

The stronger the diffusion model, the eas- ier the backdoor: Data poisoning to induce copyright breach- eswithout adjusting finetuning pipeline

Haonan Wang, Qianli Shen, Yao Tong, Yang Zhang, and Kenji Kawaguchi. The stronger the diffusion model, the eas- ier the backdoor: Data poisoning to induce copyright breach- eswithout adjusting finetuning pipeline. In ICML, 2024. 2

work page 2024
[43]

CogVLM: Visual Expert for Pretrained Language Models

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

How to craft backdoors with unlabeled data alone? In ICLR 2024 Workshop on Navigating and Addressing Data Prob- lems for Foundation Models, 2024

Yifei Wang, Wenhan Ma, Stefanie Jegelka, and Yisen Wang. How to craft backdoors with unlabeled data alone? In ICLR 2024 Workshop on Navigating and Addressing Data Prob- lems for Foundation Models, 2024. 2

work page 2024
[45]

Eda: Easy data augmentation tech- niques for boosting performance on text classification tasks

Jason Wei and Kai Zou. Eda: Easy data augmentation tech- niques for boosting performance on text classification tasks. In ACL, 2019. 4

work page 2019
[46]

Adversarial neuron pruning purifies backdoored deep models

Dongxian Wu and Yisen Wang. Adversarial neuron pruning purifies backdoored deep models. In NeurIPS, 2021. 2

work page 2021
[47]

Robust contrastive language-image pretraining against data poisoning and backdoor attacks

Wenhan Yang, Jingdong Gao, and Baharan Mirzasoleiman. Robust contrastive language-image pretraining against data poisoning and backdoor attacks. In NeurIPS, 2023. 1, 2, 3, 7

work page 2023
[48]

Better safe than sorry: Pre-training clip against targeted data poisoning and backdoor attacks

Wenhan Yang, Jingdong Gao, and Baharan Mirzasoleiman. Better safe than sorry: Pre-training clip against targeted data poisoning and backdoor attacks. In ICML, 2024. 2, 3

work page 2024
[49]

Data poisoning attacks against multimodal encoders

Ziqing Yang, Xinlei He, Zheng Li, Michael Backes, Mathias Humbert, Pascal Berrang, and Yang Zhang. Data poisoning attacks against multimodal encoders. In ICML, 2023. 2

work page 2023
[50]

Enhancing fine-tuning based backdoor defense with sharpness-aware minimization

Mingli Zhu, Shaokui Wei, Li Shen, Yanbo Fan, and Baoyuan Wu. Enhancing fine-tuning based backdoor defense with sharpness-aware minimization. In ICCV, 2023. 2

work page 2023
[51]

Neural polarizer: A lightweight and effective backdoor de- fense via purifying poisoned features

Mingli Zhu, Shaokui Wei, Hongyuan Zha, and Baoyuan Wu. Neural polarizer: A lightweight and effective backdoor de- fense via purifying poisoned features. In NeurIPS, 2024. 2, 3 10 Perturb and Recover: Fine-tuning for Effective Backdoor Removal from CLIP Supplementary Material Contents

work page 2024
[52]

App. A . . . Experimental details and discussions

work page
[53]

App. B . . . Additional experiments

work page
[54]

an image of {target- label}

App. C . . . More visualizations A. Experimental Details and Discussions In this section we detail the setup related to all the ex- periments conducted in this work. We detail how we se- lect training hyperparameters like batch size (BS), learning rate (LR), datasets used, optimizer, etc., for poisoning and cleaning across methods and models. All experime...

work page

[1] [1]

How to backdoor federated learning

Eugene Bagdasaryan, Andreas Veit, Yiqing Hua, Deborah Estrin, and Vitaly Shmatikov. How to backdoor federated learning. In AISTATS, 2020. 2

work page 2020

[2] [2]

Cleanclip: Mitigating data poi- soning attacks in multimodal contrastive learning

Hritik Bansal, Nishad Singhi, Yu Yang, Fan Yin, Aditya Grover, and Kai-Wei Chang. Cleanclip: Mitigating data poi- soning attacks in multimodal contrastive learning. In ICCV,

work page

[3] [3]

A new backdoor attack in cnns by training set corruption without label poisoning

Mauro Barni, Kassem Kallas, and Benedetta Tondi. A new backdoor attack in cnns by training set corruption without label poisoning. In 2019 IEEE International Conference on Image Processing (ICIP). IEEE, 2019. 2, 1

work page 2019

[4] [4]

Strong data augmentation sanitizes poi- soning and backdoor attacks without an accuracy tradeoff

Eitan Borgnia, Valeriia Cherepanova, Liam Fowl, Amin Ghiasi, Jonas Geiping, Micah Goldblum, Tom Goldstein, and Arjun Gupta. Strong data augmentation sanitizes poi- soning and backdoor attacks without an accuracy tradeoff. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE,

work page 2021

[5] [5]

Coyo-700m: Image-text pair dataset

Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset. https://github.com/kak aobrain/coyo-dataset, 2022. 3

work page 2022

[6] [6]

Poisoning and back- dooring contrastive learning

Nicholas Carlini and Andreas Terzis. Poisoning and back- dooring contrastive learning. In ICLR, 2022. 2

work page 2022

[7] [7]

Poisoning web-scale training datasets is practical

Nicholas Carlini, Matthew Jagielski, Christopher A Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum An- derson, Andreas Terzis, Kurt Thomas, and Florian Tram `er. Poisoning web-scale training datasets is practical. In 2024 IEEE Symposium on Security and Privacy (SP). IEEE Com- puter Society, 2024. 1, 2, 3

work page 2024

[8] [8]

Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning

Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526 ,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang- hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yong- hao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. 1

work page 2023

[10] [10]

Autoaugment: Learning augmentation strategies from data

Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasude- van, and Quoc V Le. Autoaugment: Learning augmentation strategies from data. In CVPR, 2019. 2, 4

work page 2019

[11] [11]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009. 1

work page 2009

[12] [12]

Improved Regularization of Convolutional Neural Networks with Cutout

Terrance DeVries. Improved regularization of convo- lutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017. 2

work page internal anchor Pith review Pith/arXiv arXiv 2017

[13] [13]

Defend- ing backdoor attacks on vision transformer via patch process- ing

Khoa D Doan, Yingjie Lao, Peng Yang, and Ping Li. Defend- ing backdoor attacks on vision transformer via patch process- ing. In AAAI, 2023. 2

work page 2023

[14] [14]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 ,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Dat- acomp: In search of the next generation of multimodal datasets

Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Dat- acomp: In search of the next generation of multimodal datasets. In NeurIPS, 2024. 1

work page 2024

[16] [16]

Backdoor defense via adaptively splitting poisoned dataset

Kuofeng Gao, Yang Bai, Jindong Gu, Yong Yang, and Shu- Tao Xia. Backdoor defense via adaptively splitting poisoned dataset. In CVPR, 2023. 3

work page 2023

[17] [17]

Watermarking pre- trained language models with backdooring

Chenxi Gu, Chengsong Huang, Xiaoqing Zheng, Kai- Wei Chang, and Cho-Jui Hsieh. Watermarking pre- trained language models with backdooring. arXiv preprint arXiv:2210.07543, 2022. 2

work page arXiv 2022

[18] [18]

BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain

Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. Bad- nets: Identifying vulnerabilities in the machine learning model supply chain. arXiv preprint arXiv:1708.06733, 2017. 1, 2, 3, 4, 6, 7, 8, 5

work page internal anchor Pith review Pith/arXiv arXiv 2017

[19] [19]

Badnets: Evaluating backdooring attacks on deep neu- ral networks

Tianyu Gu, Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Evaluating backdooring attacks on deep neu- ral networks. IEEE Access, 7, 2019. 2

work page 2019

[20] [20]

Synthclip: Are we ready for a fully synthetic clip training? In Synthetic Data for Computer Vision Workshop@ CVPR, 2024

Hasan Abed Al Kader Hammoud, Hani Itani, Fabio Pizzati, Adel Bibi, and Bernard Ghanem. Synthclip: Are we ready for a fully synthetic clip training? In Synthetic Data for Computer Vision Workshop@ CVPR, 2024. 8, 1

work page 2024

[21] [21]

Defending our privacy with backdoors

Dominik Hintersdorf, Lukas Struppek, Daniel Neider, and Kristian Kersting. Defending our privacy with backdoors. In NeurIPS 2023 Workshop on Backdoors in Deep Learning - The Good, the Bad, and the Ugly, 2024. 2

work page 2023

[22] [22]

Open- clip, 2021

Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Han- naneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Open- clip, 2021. 1

work page 2021

[23] [23]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021. 1

work page 2021

[24] [24]

Baden- coder: Backdoor attacks to pre-trained encoders in self- supervised learning

Jinyuan Jia, Yupei Liu, and Neil Zhenqiang Gong. Baden- coder: Backdoor attacks to pre-trained encoders in self- supervised learning. In 2022 IEEE Symposium on Security and Privacy (SP). IEEE, 2022. 2 9

work page 2022

[25] [25]

Adversarial backdoor defense in clip

Junhao Kuang, Siyuan Liang, Jiawei Liang, Kuanrong Liu, and Xiaochun Cao. Adversarial backdoor defense in clip. arXiv preprint arXiv:2409.15968, 2024. 3

work page arXiv 2024

[26] [26]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML,

work page

[27] [27]

Invisible backdoor attacks on deep neural networks via steganography and regularization

Shaofeng Li, Minhui Xue, Benjamin Zi Hao Zhao, Haojin Zhu, and Xinpeng Zhang. Invisible backdoor attacks on deep neural networks via steganography and regularization. IEEE Transactions on Dependable and Secure Computing , 18(5),

work page

[28] [29]

Neural attention distillation: Erasing back- door triggers from deep neural networks

Yige Li, Xixiang Lyu, Nodens Koren, Lingjuan Lyu, Bo Li, and Xingjun Ma. Neural attention distillation: Erasing back- door triggers from deep neural networks. In ICLR, 2021. 2

work page 2021

[29] [30]

Badclip: Dual- embedding guided backdoor attack on multimodal con- trastive learning

Siyuan Liang, Mingli Zhu, Aishan Liu, Baoyuan Wu, Xiaochun Cao, and Ee-Chien Chang. Badclip: Dual- embedding guided backdoor attack on multimodal con- trastive learning. In CVPR, 2024. 1, 2, 5, 6, 7, 8, 4

work page 2024

[30] [31]

Vila: On pre-training for visual language models

Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Moham- mad Shoeybi, and Song Han. Vila: On pre-training for visual language models. In CVPR, 2024. 1, 4

work page 2024

[31] [32]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 1

work page 2014

[32] [33]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In CVPR,

work page

[33] [34]

Re- flection backdoor: A natural backdoor attack on deep neural networks

Yunfei Liu, Xingjun Ma, James Bailey, and Feng Lu. Re- flection backdoor: A natural backdoor attack on deep neural networks. In ECCV, 2020. 2

work page 2020

[34] [35]

Wanet - impercepti- ble warping-based backdoor attack

Tuan Anh Nguyen and Anh Tuan Tran. Wanet - impercepti- ble warping-based backdoor attack. In ICLR, 2021. 2, 6, 7, 1

work page 2021

[35] [36]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In ICML, 2021. 1, 4, 6

work page 2021

[36] [37]

High-resolution image syn- thesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. In CVPR, 2022. 8

work page 2022

[37] [38]

Laion-5b: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. In NeurIPS, 2022. 1, 3

work page 2022

[38] [39]

Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. In ACL,

work page

[39] [40]

Introducing qwen1.5, 2024

Qwen Team. Introducing qwen1.5, 2024. 1

work page 2024

[40] [41]

Visualizing data using t-sne

Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9 (86), 2008. 6

work page 2008

[41] [42]

The stronger the diffusion model, the eas- ier the backdoor: Data poisoning to induce copyright breach- eswithout adjusting finetuning pipeline

Haonan Wang, Qianli Shen, Yao Tong, Yang Zhang, and Kenji Kawaguchi. The stronger the diffusion model, the eas- ier the backdoor: Data poisoning to induce copyright breach- eswithout adjusting finetuning pipeline. In ICML, 2024. 2

work page 2024

[42] [43]

CogVLM: Visual Expert for Pretrained Language Models

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [44]

How to craft backdoors with unlabeled data alone? In ICLR 2024 Workshop on Navigating and Addressing Data Prob- lems for Foundation Models, 2024

Yifei Wang, Wenhan Ma, Stefanie Jegelka, and Yisen Wang. How to craft backdoors with unlabeled data alone? In ICLR 2024 Workshop on Navigating and Addressing Data Prob- lems for Foundation Models, 2024. 2

work page 2024

[44] [45]

Eda: Easy data augmentation tech- niques for boosting performance on text classification tasks

Jason Wei and Kai Zou. Eda: Easy data augmentation tech- niques for boosting performance on text classification tasks. In ACL, 2019. 4

work page 2019

[45] [46]

Adversarial neuron pruning purifies backdoored deep models

Dongxian Wu and Yisen Wang. Adversarial neuron pruning purifies backdoored deep models. In NeurIPS, 2021. 2

work page 2021

[46] [47]

Robust contrastive language-image pretraining against data poisoning and backdoor attacks

Wenhan Yang, Jingdong Gao, and Baharan Mirzasoleiman. Robust contrastive language-image pretraining against data poisoning and backdoor attacks. In NeurIPS, 2023. 1, 2, 3, 7

work page 2023

[47] [48]

Better safe than sorry: Pre-training clip against targeted data poisoning and backdoor attacks

Wenhan Yang, Jingdong Gao, and Baharan Mirzasoleiman. Better safe than sorry: Pre-training clip against targeted data poisoning and backdoor attacks. In ICML, 2024. 2, 3

work page 2024

[48] [49]

Data poisoning attacks against multimodal encoders

Ziqing Yang, Xinlei He, Zheng Li, Michael Backes, Mathias Humbert, Pascal Berrang, and Yang Zhang. Data poisoning attacks against multimodal encoders. In ICML, 2023. 2

work page 2023

[49] [50]

Enhancing fine-tuning based backdoor defense with sharpness-aware minimization

Mingli Zhu, Shaokui Wei, Li Shen, Yanbo Fan, and Baoyuan Wu. Enhancing fine-tuning based backdoor defense with sharpness-aware minimization. In ICCV, 2023. 2

work page 2023

[50] [51]

Neural polarizer: A lightweight and effective backdoor de- fense via purifying poisoned features

Mingli Zhu, Shaokui Wei, Hongyuan Zha, and Baoyuan Wu. Neural polarizer: A lightweight and effective backdoor de- fense via purifying poisoned features. In NeurIPS, 2024. 2, 3 10 Perturb and Recover: Fine-tuning for Effective Backdoor Removal from CLIP Supplementary Material Contents

work page 2024

[51] [52]

App. A . . . Experimental details and discussions

work page

[52] [53]

App. B . . . Additional experiments

work page

[53] [54]

an image of {target- label}

App. C . . . More visualizations A. Experimental Details and Discussions In this section we detail the setup related to all the ex- periments conducted in this work. We detail how we se- lect training hyperparameters like batch size (BS), learning rate (LR), datasets used, optimizer, etc., for poisoning and cleaning across methods and models. All experime...

work page