Recognition: no theorem link
CLIP-Inspector: Model-Level Backdoor Detection for Prompt-Tuned CLIP via OOD Trigger Inversion
Pith reviewed 2026-05-10 17:47 UTC · model grok-4.3
The pith
CLIP-Inspector detects backdoors in prompt-tuned CLIP models by reconstructing triggers from out-of-distribution images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CLIP-Inspector reconstructs effective triggers for each class in a single epoch from only 1,000 unlabeled OOD images, identifies backdoored prompt-tuned CLIP models at 94 percent accuracy across 50 models, and yields an AUROC of 0.973 compared with 0.495 and 0.687 for adapted baselines.
What carries the argument
OOD trigger inversion, which optimizes an input pattern on out-of-distribution images so that the prompt-tuned model outputs a chosen target class, thereby exposing any implanted backdoor association.
If this is right
- Organizations can verify whether a delivered prompt-tuned model is backdoored before deployment.
- Reconstructed triggers enable post-hoc repair by fine-tuning the model on correctly labeled triggered inputs.
- Detection succeeds with only 1,000 OOD images and one training epoch across ten datasets and four attack types.
- The method outperforms adapted trigger-inversion baselines on AUROC for this class of models.
Where Pith is reading between the lines
- The same reconstruction approach could be tested on other vision-language models that rely on prompt tuning rather than full fine-tuning.
- Procuring a diverse, reusable OOD image pool might become a standard practice for organizations that outsource model adaptation.
- If attackers design backdoors specifically to resist OOD-based inversion, the current detection rates would likely decline.
Load-bearing premise
The method requires white-box access to the model together with a sufficient pool of unlabeled out-of-distribution images that allow reliable reconstruction of any backdoor trigger.
What would settle it
Applying CLIP-Inspector to a verified clean prompt-tuned CLIP model and observing a false-positive rate well above the reported level, or applying it to a known backdoored model and finding that trigger reconstruction fails to expose the backdoor.
Figures
read the original abstract
Organisations with limited data and computational resources increasingly outsource model training to Machine Learning as a Service (MLaaS) providers, who adapt vision-language models (VLMs) such as CLIP to downstream tasks via prompt tuning rather than training from scratch. This semi-honest setting creates a security risk where a malicious provider can follow the prompt-tuning protocol yet implant a backdoor, forcing triggered inputs to be classified into an attacker-chosen class, even for out-of-distribution (OOD) data. Such backdoors leave encoders untouched, making them undetectable to existing methods that focus on encoder corruption. Other data-level methods that sanitize data before training or during inference, also fail to answer the critical question, "Is the delivered model backdoored or not?" To address this model-level verification problem, we introduce CLIP-Inspector (CI), a backdoor detection method designed for prompt-tuned CLIP models. Assuming white-box access to the delivered model and a pool of unlabeled OOD images, CI reconstructs possible triggers for each class to determine if the model exhibits backdoor behaviour or not. Additionally, we demonstrate that using CI's reconstructed trigger for fine-tuning on correctly labeled triggered inputs enables us to re-align the model and reduce backdoor effectiveness. Through extensive experiments across ten datasets and four backdoor attacks, we demonstrate that CI can reconstruct effective triggers in a single epoch using only 1,000 OOD images, achieving a 94% detection accuracy (47/50 models). Compared to adapted trigger-inversion baselines, CI yields a markedly higher AUROC score (0.973 vs 0.495/0.687), thus enabling the vetting and post-hoc repair of prompt-tuned CLIP models to ensure safe deployment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CLIP-Inspector (CI), a model-level backdoor detection method for prompt-tuned CLIP models in a semi-honest MLaaS setting. Assuming white-box access and a pool of unlabeled OOD images, CI reconstructs possible triggers for each class to identify backdoored models. Experiments across ten datasets and four attacks report that CI reconstructs effective triggers in one epoch using only 1,000 OOD images, achieving 94% detection accuracy (47/50 models) and AUROC 0.973, outperforming adapted baselines (0.495/0.687). A secondary contribution is using the reconstructed trigger for fine-tuning to reduce backdoor effectiveness.
Significance. If the results hold, the work is significant because it targets backdoors implanted via prompt tuning that leave encoders intact, a threat model not addressed by existing encoder-focused or data-sanitization defenses. The reported efficiency with limited OOD data and the post-hoc repair capability are practical strengths for vetting outsourced VLMs.
major comments (3)
- The abstract reports strong quantitative results (94% accuracy, AUROC 0.973) across ten datasets and four attacks, but supplies no details on trigger-inversion objective, convergence criteria, false-positive rates, or how baselines were adapted, leaving the central claim difficult to evaluate without the full method description.
- The comparison to adapted trigger-inversion baselines reports markedly higher AUROC (0.973 vs 0.495/0.687), but without specifying adaptation details or the exact metric/threshold for declaring a reconstructed trigger 'effective', it is unclear whether the performance gap is attributable to CI or to differences in experimental protocol.
- The method assumes a sufficient pool of unlabeled OOD images enables reliable per-class trigger reconstruction; no sensitivity analysis is provided on the minimum number of images, choice of OOD distribution, or false-positive behavior on clean models, which is load-bearing for the 94% accuracy and practical-applicability claims.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the presentation of our method and results. We address each major comment below with references to the full manuscript and indicate planned revisions.
read point-by-point responses
-
Referee: The abstract reports strong quantitative results (94% accuracy, AUROC 0.973) across ten datasets and four attacks, but supplies no details on trigger-inversion objective, convergence criteria, false-positive rates, or how baselines were adapted, leaving the central claim difficult to evaluate without the full method description.
Authors: We agree the abstract is concise and omits key details. The trigger-inversion objective (maximizing target-class confidence via OOD images) and one-epoch convergence are specified in Section 3.2; false-positive behavior on clean models appears in Section 4.3; baseline adaptations are described in Section 4.2. To improve standalone readability we will expand the abstract with one sentence on the objective and convergence criterion while retaining the quantitative claims. revision: partial
-
Referee: The comparison to adapted trigger-inversion baselines reports markedly higher AUROC (0.973 vs 0.495/0.687), but without specifying adaptation details or the exact metric/threshold for declaring a reconstructed trigger 'effective', it is unclear whether the performance gap is attributable to CI or to differences in experimental protocol.
Authors: We acknowledge the need for explicit protocol details. Section 4.2 states that baselines were re-implemented with the identical 1,000-image OOD pool and per-class reconstruction loop; an effective trigger is defined as one yielding attack success rate >90% on held-out triggered samples, with detection threshold set by AUROC on reconstruction loss. We will insert a new paragraph in Section 4.2 that tabulates the exact adaptation steps and the >90% ASR threshold used for all methods. revision: yes
-
Referee: The method assumes a sufficient pool of unlabeled OOD images enables reliable per-class trigger reconstruction; no sensitivity analysis is provided on the minimum number of images, choice of OOD distribution, or false-positive behavior on clean models, which is load-bearing for the 94% accuracy and practical-applicability claims.
Authors: This observation is correct; the current manuscript reports results only for the 1,000-image setting across ten datasets without a dedicated sensitivity study. We will add a new subsection (4.4) containing (i) ablation curves for 100/500/1,000/2,000 OOD images, (ii) results using two additional OOD distributions (ImageNet subsets and synthetic noise), and (iii) explicit false-positive rates on the three clean models. These additions directly address the load-bearing assumptions. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper describes CLIP-Inspector as an empirical trigger-reconstruction procedure that operates on white-box prompt-tuned CLIP models and a pool of unlabeled OOD images. Detection is performed by attempting to invert class-specific triggers and observing whether the reconstructed triggers induce backdoor behavior; success is measured directly against held-out data and external baselines across ten datasets and four attacks. No equations, parameters, or decision rules are defined in terms of the same data used for final evaluation, and no load-bearing steps reduce to self-citation chains or fitted inputs renamed as predictions. The central claim therefore remains an independent empirical verification method rather than a definitional or self-referential loop.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption White-box access to the delivered prompt-tuned model is available
- domain assumption A pool of unlabeled OOD images exists and is representative enough for trigger search
Reference graph
Works this paper leans on
-
[1]
BadCLIP: Trigger-Aware Prompt Learning for Backdoor Attacks on CLIP
Jiawang Bai, Kuofeng Gao, Shaobo Min, Shu-Tao Xia, Zhifeng Li, and Wei Liu. BadCLIP: Trigger-Aware Prompt Learning for Backdoor Attacks on CLIP. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 24239–24250, 2024. 2, 3, 5
2024
-
[2]
Food-101–mining discriminative components with random forests
Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InComputer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, pro- ceedings, part VI 13, pages 446–461. Springer, 2014. 5
2014
-
[3]
Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning
Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Targeted backdoor attacks on deep learning systems using data poisoning.arXiv preprint arXiv:1712.05526, 2017. 2, 3, 5, 8
work page internal anchor Pith review arXiv 2017
-
[4]
Describing textures in the wild
Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014. 5
2014
-
[5]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 5
2009
-
[6]
Learning gener- ative visual models from few training examples: An incre- mental bayesian approach tested on 101 object categories
Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning gener- ative visual models from few training examples: An incre- mental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pages 178–178. IEEE, 2004. 5
2004
-
[7]
Detecting Backdoors in Pre-trained En- coders
Shiwei Feng, Guanhong Tao, Siyuan Cheng, Guangyu Shen, Xiangzhe Xu, Yingqi Liu, Kaiyuan Zhang, Shiqing Ma, and Xiangyu Zhang. Detecting Backdoors in Pre-trained En- coders. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16352–16362, 2023. 2, 3
2023
-
[8]
Strip: A defence against trojan attacks on deep neural networks.Proceedings of the 35th Annual Computer Security Applications Conference, pages 113–125, 2019
Yansong Gao, Change Xu, Derui Wang, Shiping Chen, Damith C Ranasinghe, and Surya Nepal. Strip: A defence against trojan attacks on deep neural networks.Proceedings of the 35th Annual Computer Security Applications Conference, pages 113–125, 2019. 2, 3
2019
-
[9]
Backdoor attack with sparse and invisible trigger.IEEE Transactions on Information Forensics and Security, 19:6364–6376, 2024
Yinghua Gao, Yiming Li, Xueluan Gong, Zhifeng Li, Shu- Tao Xia, and Qian Wang. Backdoor attack with sparse and invisible trigger.IEEE Transactions on Information Forensics and Security, 19:6364–6376, 2024. 2, 3, 5
2024
-
[10]
Backdoor smoothing: De- mystifying backdoor attacks on deep neural networks.Com- puters & Security, 120:102814, 2022
Kathrin Grosse, Taesung Lee, Battista Biggio, Youngja Park, Michael Backes, and Ian Molloy. Backdoor smoothing: De- mystifying backdoor attacks on deep neural networks.Com- puters & Security, 120:102814, 2022. 3
2022
-
[11]
BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain
Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. Bad- Nets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain.arXiv preprint arXiv:1708.06733, 2019. 3
work page internal anchor Pith review arXiv 2019
-
[12]
arXiv:2302.03251 [cs.CR] https://arxiv
Junfeng Guo, Yiming Li, Xun Chen, Hanqing Guo, Lichao Sun, and Cong Liu. Scale-up: An efficient black-box input- level backdoor detection via analyzing scaled prediction con- sistency.arXiv preprint arXiv:2302.03251, 2023. 2, 3
-
[13]
Tabor: A highly accurate approach to inspecting and restoring trojan backdoors in ai systems, 2019
Wenbo Guo, Lun Wang, Xinyu Xing, Min Du, and Dawn Song. Tabor: A highly accurate approach to inspecting and restoring trojan backdoors in ai systems, 2019. 2, 3
2019
-
[14]
Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019. 5
2019
-
[15]
DeDe: Detecting Backdoor Samples for SSL Encoders via Decoders.arXiv preprint arXiv:2411.16154, 2024
Sizai Hou, Songze Li, and Duanyi Yao. DeDe: Detecting Backdoor Samples for SSL Encoders via Decoders.arXiv preprint arXiv:2411.16154, 2024. 3
-
[16]
Hanxun Huang, Sarah Erfani, Yige Li, Xingjun Ma, and James Bailey. Detecting Backdoor Samples in Con- trastive Language Image Pretraining.arXiv preprint arXiv:2502.01385, 2025. 3
-
[17]
Universal litmus patterns: Revealing back- door attacks in cnns
Soheil Kolouri, Aniruddha Saha, Hamed Pirsiavash, and Heiko Hoffmann. Universal litmus patterns: Revealing back- door attacks in cnns. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 301–310, 2020. 2
2020
-
[18]
Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. The open im- ages dataset v4: Unified image classification, object detection, and visual relationship detection at scale.International jour- nal of computer vision, 128(7):1956–1981, 2020. 5
1956
-
[19]
A double-edged sword: The power of two in defending against dnn backdoor attacks
Quentin Le Roux, Kassem Kallas, and Teddy Furon. A double-edged sword: The power of two in defending against dnn backdoor attacks. In2024 32nd European Signal Process- ing Conference (EUSIPCO), pages 2007–2011. IEEE, 2024. 3
2007
-
[20]
Invisible Backdoor Attack with Sample- Specific Triggers.arXiv preprint arXiv:2012.03816, 2021
Yuezun Li, Yiming Li, Baoyuan Wu, Longkang Li, Ran He, and Siwei Lyu. Invisible Backdoor Attack with Sample- Specific Triggers.arXiv preprint arXiv:2012.03816, 2021. 3
-
[21]
BadCLIP: Dual- Embedding Guided Backdoor Attack on Multimodal Con- trastive Learning
Siyuan Liang, Mingli Zhu, Aishan Liu, Baoyuan Wu, Xiaochun Cao, and Ee-Chien Chang. BadCLIP: Dual- Embedding Guided Backdoor Attack on Multimodal Con- trastive Learning. In2024 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 24645– 24654, Seattle, W A, USA, 2024. IEEE. 3
2024
-
[22]
Detect- ing backdoors during the inference stage based on corruption robustness consistency
Xiaogeng Liu, Minghui Li, Haoyu Wang, Shengshan Hu, Dengpan Ye, Hai Jin, Libing Wu, and Chaowei Xiao. Detect- ing backdoors during the inference stage based on corruption robustness consistency. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 16363–16372, 2023. 2, 3
2023
-
[23]
Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Weihang Wang, and X. Zhang. Trojaning attack on neural networks. InNetwork and Distributed System Security Symposium, 2018. 3
2018
-
[24]
Complex backdoor detec- tion by symmetric feature differencing
Yingqi Liu, Guangyu Shen, Guanhong Tao, Zhenting Wang, Shiqing Ma, and Xiangyu Zhang. Complex backdoor detec- tion by symmetric feature differencing. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14983–14993, 2022. 2
2022
-
[25]
Fine-Grained Visual Classification of Aircraft
Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual clas- sification of aircraft.arXiv preprint arXiv:1306.5151, 2013. 5
work page internal anchor Pith review arXiv 2013
-
[26]
Wanet–imperceptible warping-based back- door attack,
Anh Nguyen and Anh Tran. WaNet – Imperceptible Warping- based Backdoor Attack.arXiv preprint arXiv:2102.10369,
-
[27]
Input-Aware Dynamic Backdoor Attack
Tuan Anh Nguyen and Anh Tran. Input-Aware Dynamic Backdoor Attack. InAdvances in Neural Information Pro- cessing Systems, pages 3454–3464. Curran Associates, Inc.,
-
[28]
Automated flower classification over a large number of classes
Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008. 5
2008
-
[29]
Yuwei Niu, Shuo He, Qi Wei, Zongyu Wu, Feng Liu, and Lei Feng. BDetCLIP: Multimodal Prompting Contrastive Test- Time Backdoor Detection.arXiv preprint arXiv:2405.15269,
-
[30]
Cats and dogs
Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012. 5
2012
-
[31]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. 2021. 1, 2, 5
2021
-
[32]
Hidden Trigger Backdoor Attacks.arXiv preprint arXiv:1910.00033, 2019
Aniruddha Saha, Akshayvarun Subramanya, and Hamed Pir- siavash. Hidden Trigger Backdoor Attacks.arXiv preprint arXiv:1910.00033, 2019. 3
-
[33]
Backdoor scanning for deep neural networks through k-arm optimization, 2021
Guangyu Shen, Yingqi Liu, Guanhong Tao, Shengwei An, Qiuling Xu, Siyuan Cheng, Shiqing Ma, and Xiangyu Zhang. Backdoor scanning for deep neural networks through k-arm optimization, 2021. 3
2021
-
[34]
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402, 2012. 5
work page internal anchor Pith review arXiv 2012
-
[35]
Tao Sun, Lu Pang, Chao Chen, and Haibin Ling. Mask and Restore: Blind Backdoor Defense at Test Time with Masked Autoencoder.arXiv preprint arXiv:2303.15564, 2023. 2
-
[36]
TIJO: Trigger Inversion with Joint Optimization for De- fending Multimodal Backdoored Models
Indranil Sur, Karan Sikka, Matthew Walmer, Kaushik Koner- ipalli, Anirban Roy, Xiao Lin, Ajay Divakaran, and Susmit Jha. TIJO: Trigger Inversion with Joint Optimization for De- fending Multimodal Backdoored Models. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 165–175, 2023. 2, 3
2023
-
[37]
Better trigger inversion optimization in backdoor scanning
Guanhong Tao, Guangyu Shen, Yingqi Liu, Shengwei An, Qiuling Xu, Shiqing Ma, Pan Li, and Xiangyu Zhang. Better trigger inversion optimization in backdoor scanning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13368–13378, 2022. 2, 3, 5
2022
-
[38]
Defending Against Patch-based Backdoor Attacks on Self-Supervised Learning
Ajinkya Tejankar, Maziar Sanjabi, Qifan Wang, Sinong Wang, Hamed Firooz, Hamed Pirsiavash, and Liang Tan. Defending Against Patch-based Backdoor Attacks on Self-Supervised Learning. In2023 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 12239–12249, Vancouver, BC, Canada, 2023. IEEE
2023
-
[39]
Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and Ben Y . Zhao. Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Net- works. In2019 IEEE Symposium on Security and Privacy (SP), pages 707–723, 2019. 2, 3, 5
2019
-
[40]
Practical detection of trojan neural networks: Data-limited and data-free cases
Ren Wang, Gaoyuan Zhang, Sijia Liu, Pin-Yu Chen, Jinjun Xiong, and Meng Wang. Practical detection of trojan neural networks: Data-limited and data-free cases. InComputer Vi- sion – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIII, page 222–238, Berlin, Heidelberg, 2020. Springer-Verlag. 2
2020
-
[41]
Rethinking the reverse-engineering of trojan triggers.Advances in Neural Information Processing Systems, 35:9738–9753, 2022
Zhenting Wang, Kai Mei, Hailun Ding, Juan Zhai, and Shiqing Ma. Rethinking the reverse-engineering of trojan triggers.Advances in Neural Information Processing Systems, 35:9738–9753, 2022. 2, 3
2022
-
[42]
Uni- corn: A unified backdoor trigger inversion framework, 2023
Zhenting Wang, Kai Mei, Juan Zhai, and Shiqing Ma. Uni- corn: A unified backdoor trigger inversion framework, 2023. 2, 3
2023
-
[43]
Sun database: Large-scale scene recog- nition from abbey to zoo
Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recog- nition from abbey to zoo. In2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010. 5
2010
-
[44]
W. J. Youden. Index for rating diagnostic tests.Cancer, 3(1): 32–35, 1950. 4
1950
-
[45]
Conditional prompt learning for vision-language models
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16816–16825,
-
[46]
Liuwan Zhu, Rui Ning, Jiang Li, Chunsheng Xin, and Hongyi Wu. Seer: Backdoor detection for vision-language models through searching target text and image trigger jointly.Pro- ceedings of the AAAI Conference on Artificial Intelligence, 38(7):7766–7774, 2024. 2, 3 CLIP-Inspector: Model-Level Backdoor Detection for Prompt-Tuned CLIP via OOD Trigger Inversion...
2024
-
[47]
Clean model training We prompt-tuned the CLIP model with CoCoOp on 10 image-classification datasets
ACC and ASR for all attack types 10.1. Clean model training We prompt-tuned the CLIP model with CoCoOp on 10 image-classification datasets. Accuracy (ACC) values for the seen and unseen subsets of classes for each dataset are shown in Table 6. DTD, EuroSAT, and FGVC datasets achieve the lowest cross-domain (unseen) accuracy, highlighting that prompt tunin...
-
[48]
basin of attraction
Adaptive Attack Against CLIP-Inspector: BadCLIP Adaptive In the main paper, we introduceBadCLIP Adaptive, a two- phase variant of BadCLIP designed to make the backdoor highly specific, such that only a single, exact trigger pattern should activate the target class, whereas small perturbations around this trigger should revert to the clean label. Here, we ...
-
[49]
Anomaly score discrimination metrics In this section, we highlight the metrics used by each detec- tion method to mark anomalous classes. 12.1. CI Our approach flags a class as anomalous when, during a single-epoch optimisation, it exhibits both (i) an unusually low average reconstruction loss and (ii) a high attack-success rate (ASR) for the recovered tr...
-
[50]
Varying number of samples in OOD dataset We vary the number of samples used for backdoor detection in BadCLIP models from 100 to 500 and 1000 samples
Ablation study 13.1. Varying number of samples in OOD dataset We vary the number of samples used for backdoor detection in BadCLIP models from 100 to 500 and 1000 samples. The ASR of the reconstructed trigger for the backdoor class is shown in Table 15. ASR drops sharply when sample count falls from 500 to 100. However, the drop observed when Clean BadCLI...
-
[51]
CI is able to separate clean from poisoned models and identify the target class without altering the inversion process or hyperparameters
Generalization to Encoder-Level Backdoors (No Meta-Net) We use the Blended poisoning method to poison the image encoder with three patterns (Gaussian noise, triangle pattern, written text). CI is able to separate clean from poisoned models and identify the target class without altering the inversion process or hyperparameters. Results are given in Table 1...
-
[52]
Clean models are not considered in this ablation; therefore, the average values may differ from those presented in the main paper
Repair Study: Controls and Hyperparame- ters We compare CI-trigger repair against controls: Clean-only FT, Random-δ, Wrong-class δ; and report ACC/ASR values averaged over all datasets for each attack type. Clean models are not considered in this ablation; therefore, the average values may differ from those presented in the main paper. Per-attack metrics ...
-
[53]
Structural Similarity (SSIM) scores for orig- inal and reconstructed triggers We show the Structural Similarity or SSIM values for the original and reconstructed triggers (for the backdoor target class) in this section. Recall that the BadCLIP trigger is imperceptible and pervasive, the Blended trigger is pervasive but not imperceptible, SIBA is sparse an...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.