pith. machine review for the scientific record. sign in

arxiv: 2604.09101 · v1 · submitted 2026-04-10 · 💻 cs.CR · cs.AI· cs.CV· cs.LG

Recognition: no theorem link

CLIP-Inspector: Model-Level Backdoor Detection for Prompt-Tuned CLIP via OOD Trigger Inversion

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:47 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CVcs.LG
keywords backdoor detectionCLIPprompt tuningtrigger inversionout-of-distribution imagesmodel verificationvision-language modelsmachine learning security
0
0 comments X

The pith

CLIP-Inspector detects backdoors in prompt-tuned CLIP models by reconstructing triggers from out-of-distribution images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Prompt tuning lets organizations adapt CLIP models from external providers using little data or compute, but a malicious provider can implant a backdoor that forces triggered inputs into an attacker-chosen class. Because the backdoor leaves the encoders unchanged, standard detection methods that look for encoder corruption miss it. CLIP-Inspector solves the resulting model-level verification problem by taking white-box access to the delivered model and a pool of unlabeled out-of-distribution images, then reconstructing possible triggers for every class. If any reconstructed trigger causes the model to classify OOD inputs into a target class with high accuracy, the model is flagged as backdoored. The same reconstructed trigger can later be used to repair the model through fine-tuning on correctly labeled data.

Core claim

CLIP-Inspector reconstructs effective triggers for each class in a single epoch from only 1,000 unlabeled OOD images, identifies backdoored prompt-tuned CLIP models at 94 percent accuracy across 50 models, and yields an AUROC of 0.973 compared with 0.495 and 0.687 for adapted baselines.

What carries the argument

OOD trigger inversion, which optimizes an input pattern on out-of-distribution images so that the prompt-tuned model outputs a chosen target class, thereby exposing any implanted backdoor association.

If this is right

  • Organizations can verify whether a delivered prompt-tuned model is backdoored before deployment.
  • Reconstructed triggers enable post-hoc repair by fine-tuning the model on correctly labeled triggered inputs.
  • Detection succeeds with only 1,000 OOD images and one training epoch across ten datasets and four attack types.
  • The method outperforms adapted trigger-inversion baselines on AUROC for this class of models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reconstruction approach could be tested on other vision-language models that rely on prompt tuning rather than full fine-tuning.
  • Procuring a diverse, reusable OOD image pool might become a standard practice for organizations that outsource model adaptation.
  • If attackers design backdoors specifically to resist OOD-based inversion, the current detection rates would likely decline.

Load-bearing premise

The method requires white-box access to the model together with a sufficient pool of unlabeled out-of-distribution images that allow reliable reconstruction of any backdoor trigger.

What would settle it

Applying CLIP-Inspector to a verified clean prompt-tuned CLIP model and observing a false-positive rate well above the reported level, or applying it to a known backdoored model and finding that trigger reconstruction fails to expose the backdoor.

Figures

Figures reproduced from arXiv: 2604.09101 by Akshit Jindal, Chetan Arora, Saket Anand, Vikram Goyal.

Figure 1
Figure 1. Figure 1: Threat model and audit workflow. A semi-honest provider [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) Image-space embeddings for clean (blue) and poisoned (red) OOD inputs substantially overlap, while meta-tokens for poisoned [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of our CLIP-Inspector framework for detecting backdoor attacks. For each candidate class, CI optimizes an [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: AUROC plots for each detection method. Our method achieves a high AUROC of 0.973, whereas NC and PixB have low scores of [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Average anomaly scores for each method averaged across [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Clean and Triggered image pairs for each attack type. All triggers are visually imperceptible except Blended. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
read the original abstract

Organisations with limited data and computational resources increasingly outsource model training to Machine Learning as a Service (MLaaS) providers, who adapt vision-language models (VLMs) such as CLIP to downstream tasks via prompt tuning rather than training from scratch. This semi-honest setting creates a security risk where a malicious provider can follow the prompt-tuning protocol yet implant a backdoor, forcing triggered inputs to be classified into an attacker-chosen class, even for out-of-distribution (OOD) data. Such backdoors leave encoders untouched, making them undetectable to existing methods that focus on encoder corruption. Other data-level methods that sanitize data before training or during inference, also fail to answer the critical question, "Is the delivered model backdoored or not?" To address this model-level verification problem, we introduce CLIP-Inspector (CI), a backdoor detection method designed for prompt-tuned CLIP models. Assuming white-box access to the delivered model and a pool of unlabeled OOD images, CI reconstructs possible triggers for each class to determine if the model exhibits backdoor behaviour or not. Additionally, we demonstrate that using CI's reconstructed trigger for fine-tuning on correctly labeled triggered inputs enables us to re-align the model and reduce backdoor effectiveness. Through extensive experiments across ten datasets and four backdoor attacks, we demonstrate that CI can reconstruct effective triggers in a single epoch using only 1,000 OOD images, achieving a 94% detection accuracy (47/50 models). Compared to adapted trigger-inversion baselines, CI yields a markedly higher AUROC score (0.973 vs 0.495/0.687), thus enabling the vetting and post-hoc repair of prompt-tuned CLIP models to ensure safe deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The paper introduces CLIP-Inspector (CI), a model-level backdoor detection method for prompt-tuned CLIP models in a semi-honest MLaaS setting. Assuming white-box access and a pool of unlabeled OOD images, CI reconstructs possible triggers for each class to identify backdoored models. Experiments across ten datasets and four attacks report that CI reconstructs effective triggers in one epoch using only 1,000 OOD images, achieving 94% detection accuracy (47/50 models) and AUROC 0.973, outperforming adapted baselines (0.495/0.687). A secondary contribution is using the reconstructed trigger for fine-tuning to reduce backdoor effectiveness.

Significance. If the results hold, the work is significant because it targets backdoors implanted via prompt tuning that leave encoders intact, a threat model not addressed by existing encoder-focused or data-sanitization defenses. The reported efficiency with limited OOD data and the post-hoc repair capability are practical strengths for vetting outsourced VLMs.

major comments (3)
  1. The abstract reports strong quantitative results (94% accuracy, AUROC 0.973) across ten datasets and four attacks, but supplies no details on trigger-inversion objective, convergence criteria, false-positive rates, or how baselines were adapted, leaving the central claim difficult to evaluate without the full method description.
  2. The comparison to adapted trigger-inversion baselines reports markedly higher AUROC (0.973 vs 0.495/0.687), but without specifying adaptation details or the exact metric/threshold for declaring a reconstructed trigger 'effective', it is unclear whether the performance gap is attributable to CI or to differences in experimental protocol.
  3. The method assumes a sufficient pool of unlabeled OOD images enables reliable per-class trigger reconstruction; no sensitivity analysis is provided on the minimum number of images, choice of OOD distribution, or false-positive behavior on clean models, which is load-bearing for the 94% accuracy and practical-applicability claims.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our method and results. We address each major comment below with references to the full manuscript and indicate planned revisions.

read point-by-point responses
  1. Referee: The abstract reports strong quantitative results (94% accuracy, AUROC 0.973) across ten datasets and four attacks, but supplies no details on trigger-inversion objective, convergence criteria, false-positive rates, or how baselines were adapted, leaving the central claim difficult to evaluate without the full method description.

    Authors: We agree the abstract is concise and omits key details. The trigger-inversion objective (maximizing target-class confidence via OOD images) and one-epoch convergence are specified in Section 3.2; false-positive behavior on clean models appears in Section 4.3; baseline adaptations are described in Section 4.2. To improve standalone readability we will expand the abstract with one sentence on the objective and convergence criterion while retaining the quantitative claims. revision: partial

  2. Referee: The comparison to adapted trigger-inversion baselines reports markedly higher AUROC (0.973 vs 0.495/0.687), but without specifying adaptation details or the exact metric/threshold for declaring a reconstructed trigger 'effective', it is unclear whether the performance gap is attributable to CI or to differences in experimental protocol.

    Authors: We acknowledge the need for explicit protocol details. Section 4.2 states that baselines were re-implemented with the identical 1,000-image OOD pool and per-class reconstruction loop; an effective trigger is defined as one yielding attack success rate >90% on held-out triggered samples, with detection threshold set by AUROC on reconstruction loss. We will insert a new paragraph in Section 4.2 that tabulates the exact adaptation steps and the >90% ASR threshold used for all methods. revision: yes

  3. Referee: The method assumes a sufficient pool of unlabeled OOD images enables reliable per-class trigger reconstruction; no sensitivity analysis is provided on the minimum number of images, choice of OOD distribution, or false-positive behavior on clean models, which is load-bearing for the 94% accuracy and practical-applicability claims.

    Authors: This observation is correct; the current manuscript reports results only for the 1,000-image setting across ten datasets without a dedicated sensitivity study. We will add a new subsection (4.4) containing (i) ablation curves for 100/500/1,000/2,000 OOD images, (ii) results using two additional OOD distributions (ImageNet subsets and synthetic noise), and (iii) explicit false-positive rates on the three clean models. These additions directly address the load-bearing assumptions. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes CLIP-Inspector as an empirical trigger-reconstruction procedure that operates on white-box prompt-tuned CLIP models and a pool of unlabeled OOD images. Detection is performed by attempting to invert class-specific triggers and observing whether the reconstructed triggers induce backdoor behavior; success is measured directly against held-out data and external baselines across ten datasets and four attacks. No equations, parameters, or decision rules are defined in terms of the same data used for final evaluation, and no load-bearing steps reduce to self-citation chains or fitted inputs renamed as predictions. The central claim therefore remains an independent empirical verification method rather than a definitional or self-referential loop.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions about access and data availability plus the empirical effectiveness of gradient-based trigger search; no new entities are postulated and no free parameters are explicitly fitted in the abstract description.

axioms (2)
  • domain assumption White-box access to the delivered prompt-tuned model is available
    Explicitly stated as the operating assumption for trigger reconstruction.
  • domain assumption A pool of unlabeled OOD images exists and is representative enough for trigger search
    Required for the single-epoch reconstruction step described.

pith-pipeline@v0.9.0 · 5641 in / 1485 out tokens · 45017 ms · 2026-05-10T17:47:43.071734+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 14 canonical work pages · 4 internal anchors

  1. [1]

    BadCLIP: Trigger-Aware Prompt Learning for Backdoor Attacks on CLIP

    Jiawang Bai, Kuofeng Gao, Shaobo Min, Shu-Tao Xia, Zhifeng Li, and Wei Liu. BadCLIP: Trigger-Aware Prompt Learning for Backdoor Attacks on CLIP. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 24239–24250, 2024. 2, 3, 5

  2. [2]

    Food-101–mining discriminative components with random forests

    Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InComputer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, pro- ceedings, part VI 13, pages 446–461. Springer, 2014. 5

  3. [3]

    Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning

    Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Targeted backdoor attacks on deep learning systems using data poisoning.arXiv preprint arXiv:1712.05526, 2017. 2, 3, 5, 8

  4. [4]

    Describing textures in the wild

    Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014. 5

  5. [5]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 5

  6. [6]

    Learning gener- ative visual models from few training examples: An incre- mental bayesian approach tested on 101 object categories

    Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning gener- ative visual models from few training examples: An incre- mental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pages 178–178. IEEE, 2004. 5

  7. [7]

    Detecting Backdoors in Pre-trained En- coders

    Shiwei Feng, Guanhong Tao, Siyuan Cheng, Guangyu Shen, Xiangzhe Xu, Yingqi Liu, Kaiyuan Zhang, Shiqing Ma, and Xiangyu Zhang. Detecting Backdoors in Pre-trained En- coders. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16352–16362, 2023. 2, 3

  8. [8]

    Strip: A defence against trojan attacks on deep neural networks.Proceedings of the 35th Annual Computer Security Applications Conference, pages 113–125, 2019

    Yansong Gao, Change Xu, Derui Wang, Shiping Chen, Damith C Ranasinghe, and Surya Nepal. Strip: A defence against trojan attacks on deep neural networks.Proceedings of the 35th Annual Computer Security Applications Conference, pages 113–125, 2019. 2, 3

  9. [9]

    Backdoor attack with sparse and invisible trigger.IEEE Transactions on Information Forensics and Security, 19:6364–6376, 2024

    Yinghua Gao, Yiming Li, Xueluan Gong, Zhifeng Li, Shu- Tao Xia, and Qian Wang. Backdoor attack with sparse and invisible trigger.IEEE Transactions on Information Forensics and Security, 19:6364–6376, 2024. 2, 3, 5

  10. [10]

    Backdoor smoothing: De- mystifying backdoor attacks on deep neural networks.Com- puters & Security, 120:102814, 2022

    Kathrin Grosse, Taesung Lee, Battista Biggio, Youngja Park, Michael Backes, and Ian Molloy. Backdoor smoothing: De- mystifying backdoor attacks on deep neural networks.Com- puters & Security, 120:102814, 2022. 3

  11. [11]

    BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain

    Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. Bad- Nets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain.arXiv preprint arXiv:1708.06733, 2019. 3

  12. [12]

    arXiv:2302.03251 [cs.CR] https://arxiv

    Junfeng Guo, Yiming Li, Xun Chen, Hanqing Guo, Lichao Sun, and Cong Liu. Scale-up: An efficient black-box input- level backdoor detection via analyzing scaled prediction con- sistency.arXiv preprint arXiv:2302.03251, 2023. 2, 3

  13. [13]

    Tabor: A highly accurate approach to inspecting and restoring trojan backdoors in ai systems, 2019

    Wenbo Guo, Lun Wang, Xinyu Xing, Min Du, and Dawn Song. Tabor: A highly accurate approach to inspecting and restoring trojan backdoors in ai systems, 2019. 2, 3

  14. [14]

    Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019. 5

  15. [15]

    DeDe: Detecting Backdoor Samples for SSL Encoders via Decoders.arXiv preprint arXiv:2411.16154, 2024

    Sizai Hou, Songze Li, and Duanyi Yao. DeDe: Detecting Backdoor Samples for SSL Encoders via Decoders.arXiv preprint arXiv:2411.16154, 2024. 3

  16. [16]

    Detecting Backdoor Samples in Con- trastive Language Image Pretraining.arXiv preprint arXiv:2502.01385, 2025

    Hanxun Huang, Sarah Erfani, Yige Li, Xingjun Ma, and James Bailey. Detecting Backdoor Samples in Con- trastive Language Image Pretraining.arXiv preprint arXiv:2502.01385, 2025. 3

  17. [17]

    Universal litmus patterns: Revealing back- door attacks in cnns

    Soheil Kolouri, Aniruddha Saha, Hamed Pirsiavash, and Heiko Hoffmann. Universal litmus patterns: Revealing back- door attacks in cnns. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 301–310, 2020. 2

  18. [18]

    Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. The open im- ages dataset v4: Unified image classification, object detection, and visual relationship detection at scale.International jour- nal of computer vision, 128(7):1956–1981, 2020. 5

  19. [19]

    A double-edged sword: The power of two in defending against dnn backdoor attacks

    Quentin Le Roux, Kassem Kallas, and Teddy Furon. A double-edged sword: The power of two in defending against dnn backdoor attacks. In2024 32nd European Signal Process- ing Conference (EUSIPCO), pages 2007–2011. IEEE, 2024. 3

  20. [20]

    Invisible Backdoor Attack with Sample- Specific Triggers.arXiv preprint arXiv:2012.03816, 2021

    Yuezun Li, Yiming Li, Baoyuan Wu, Longkang Li, Ran He, and Siwei Lyu. Invisible Backdoor Attack with Sample- Specific Triggers.arXiv preprint arXiv:2012.03816, 2021. 3

  21. [21]

    BadCLIP: Dual- Embedding Guided Backdoor Attack on Multimodal Con- trastive Learning

    Siyuan Liang, Mingli Zhu, Aishan Liu, Baoyuan Wu, Xiaochun Cao, and Ee-Chien Chang. BadCLIP: Dual- Embedding Guided Backdoor Attack on Multimodal Con- trastive Learning. In2024 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 24645– 24654, Seattle, W A, USA, 2024. IEEE. 3

  22. [22]

    Detect- ing backdoors during the inference stage based on corruption robustness consistency

    Xiaogeng Liu, Minghui Li, Haoyu Wang, Shengshan Hu, Dengpan Ye, Hai Jin, Libing Wu, and Chaowei Xiao. Detect- ing backdoors during the inference stage based on corruption robustness consistency. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 16363–16372, 2023. 2, 3

  23. [23]

    Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Weihang Wang, and X. Zhang. Trojaning attack on neural networks. InNetwork and Distributed System Security Symposium, 2018. 3

  24. [24]

    Complex backdoor detec- tion by symmetric feature differencing

    Yingqi Liu, Guangyu Shen, Guanhong Tao, Zhenting Wang, Shiqing Ma, and Xiangyu Zhang. Complex backdoor detec- tion by symmetric feature differencing. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14983–14993, 2022. 2

  25. [25]

    Fine-Grained Visual Classification of Aircraft

    Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual clas- sification of aircraft.arXiv preprint arXiv:1306.5151, 2013. 5

  26. [26]

    Wanet–imperceptible warping-based back- door attack,

    Anh Nguyen and Anh Tran. WaNet – Imperceptible Warping- based Backdoor Attack.arXiv preprint arXiv:2102.10369,

  27. [27]

    Input-Aware Dynamic Backdoor Attack

    Tuan Anh Nguyen and Anh Tran. Input-Aware Dynamic Backdoor Attack. InAdvances in Neural Information Pro- cessing Systems, pages 3454–3464. Curran Associates, Inc.,

  28. [28]

    Automated flower classification over a large number of classes

    Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008. 5

  29. [29]

    BDetCLIP: Multimodal Prompting Contrastive Test- Time Backdoor Detection.arXiv preprint arXiv:2405.15269,

    Yuwei Niu, Shuo He, Qi Wei, Zongyu Wu, Feng Liu, and Lei Feng. BDetCLIP: Multimodal Prompting Contrastive Test- Time Backdoor Detection.arXiv preprint arXiv:2405.15269,

  30. [30]

    Cats and dogs

    Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012. 5

  31. [31]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. 2021. 1, 2, 5

  32. [32]

    Hidden Trigger Backdoor Attacks.arXiv preprint arXiv:1910.00033, 2019

    Aniruddha Saha, Akshayvarun Subramanya, and Hamed Pir- siavash. Hidden Trigger Backdoor Attacks.arXiv preprint arXiv:1910.00033, 2019. 3

  33. [33]

    Backdoor scanning for deep neural networks through k-arm optimization, 2021

    Guangyu Shen, Yingqi Liu, Guanhong Tao, Shengwei An, Qiuling Xu, Siyuan Cheng, Shiqing Ma, and Xiangyu Zhang. Backdoor scanning for deep neural networks through k-arm optimization, 2021. 3

  34. [34]

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402, 2012. 5

  35. [35]

    Mask and Restore: Blind Backdoor Defense at Test Time with Masked Autoencoder.arXiv preprint arXiv:2303.15564, 2023

    Tao Sun, Lu Pang, Chao Chen, and Haibin Ling. Mask and Restore: Blind Backdoor Defense at Test Time with Masked Autoencoder.arXiv preprint arXiv:2303.15564, 2023. 2

  36. [36]

    TIJO: Trigger Inversion with Joint Optimization for De- fending Multimodal Backdoored Models

    Indranil Sur, Karan Sikka, Matthew Walmer, Kaushik Koner- ipalli, Anirban Roy, Xiao Lin, Ajay Divakaran, and Susmit Jha. TIJO: Trigger Inversion with Joint Optimization for De- fending Multimodal Backdoored Models. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 165–175, 2023. 2, 3

  37. [37]

    Better trigger inversion optimization in backdoor scanning

    Guanhong Tao, Guangyu Shen, Yingqi Liu, Shengwei An, Qiuling Xu, Shiqing Ma, Pan Li, and Xiangyu Zhang. Better trigger inversion optimization in backdoor scanning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13368–13378, 2022. 2, 3, 5

  38. [38]

    Defending Against Patch-based Backdoor Attacks on Self-Supervised Learning

    Ajinkya Tejankar, Maziar Sanjabi, Qifan Wang, Sinong Wang, Hamed Firooz, Hamed Pirsiavash, and Liang Tan. Defending Against Patch-based Backdoor Attacks on Self-Supervised Learning. In2023 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 12239–12249, Vancouver, BC, Canada, 2023. IEEE

  39. [39]

    Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and Ben Y . Zhao. Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Net- works. In2019 IEEE Symposium on Security and Privacy (SP), pages 707–723, 2019. 2, 3, 5

  40. [40]

    Practical detection of trojan neural networks: Data-limited and data-free cases

    Ren Wang, Gaoyuan Zhang, Sijia Liu, Pin-Yu Chen, Jinjun Xiong, and Meng Wang. Practical detection of trojan neural networks: Data-limited and data-free cases. InComputer Vi- sion – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIII, page 222–238, Berlin, Heidelberg, 2020. Springer-Verlag. 2

  41. [41]

    Rethinking the reverse-engineering of trojan triggers.Advances in Neural Information Processing Systems, 35:9738–9753, 2022

    Zhenting Wang, Kai Mei, Hailun Ding, Juan Zhai, and Shiqing Ma. Rethinking the reverse-engineering of trojan triggers.Advances in Neural Information Processing Systems, 35:9738–9753, 2022. 2, 3

  42. [42]

    Uni- corn: A unified backdoor trigger inversion framework, 2023

    Zhenting Wang, Kai Mei, Juan Zhai, and Shiqing Ma. Uni- corn: A unified backdoor trigger inversion framework, 2023. 2, 3

  43. [43]

    Sun database: Large-scale scene recog- nition from abbey to zoo

    Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recog- nition from abbey to zoo. In2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010. 5

  44. [44]

    W. J. Youden. Index for rating diagnostic tests.Cancer, 3(1): 32–35, 1950. 4

  45. [45]

    Conditional prompt learning for vision-language models

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16816–16825,

  46. [46]

    Liuwan Zhu, Rui Ning, Jiang Li, Chunsheng Xin, and Hongyi Wu. Seer: Backdoor detection for vision-language models through searching target text and image trigger jointly.Pro- ceedings of the AAAI Conference on Artificial Intelligence, 38(7):7766–7774, 2024. 2, 3 CLIP-Inspector: Model-Level Backdoor Detection for Prompt-Tuned CLIP via OOD Trigger Inversion...

  47. [47]

    Clean model training We prompt-tuned the CLIP model with CoCoOp on 10 image-classification datasets

    ACC and ASR for all attack types 10.1. Clean model training We prompt-tuned the CLIP model with CoCoOp on 10 image-classification datasets. Accuracy (ACC) values for the seen and unseen subsets of classes for each dataset are shown in Table 6. DTD, EuroSAT, and FGVC datasets achieve the lowest cross-domain (unseen) accuracy, highlighting that prompt tunin...

  48. [48]

    basin of attraction

    Adaptive Attack Against CLIP-Inspector: BadCLIP Adaptive In the main paper, we introduceBadCLIP Adaptive, a two- phase variant of BadCLIP designed to make the backdoor highly specific, such that only a single, exact trigger pattern should activate the target class, whereas small perturbations around this trigger should revert to the clean label. Here, we ...

  49. [49]

    shortcut

    Anomaly score discrimination metrics In this section, we highlight the metrics used by each detec- tion method to mark anomalous classes. 12.1. CI Our approach flags a class as anomalous when, during a single-epoch optimisation, it exhibits both (i) an unusually low average reconstruction loss and (ii) a high attack-success rate (ASR) for the recovered tr...

  50. [50]

    Varying number of samples in OOD dataset We vary the number of samples used for backdoor detection in BadCLIP models from 100 to 500 and 1000 samples

    Ablation study 13.1. Varying number of samples in OOD dataset We vary the number of samples used for backdoor detection in BadCLIP models from 100 to 500 and 1000 samples. The ASR of the reconstructed trigger for the backdoor class is shown in Table 15. ASR drops sharply when sample count falls from 500 to 100. However, the drop observed when Clean BadCLI...

  51. [51]

    CI is able to separate clean from poisoned models and identify the target class without altering the inversion process or hyperparameters

    Generalization to Encoder-Level Backdoors (No Meta-Net) We use the Blended poisoning method to poison the image encoder with three patterns (Gaussian noise, triangle pattern, written text). CI is able to separate clean from poisoned models and identify the target class without altering the inversion process or hyperparameters. Results are given in Table 1...

  52. [52]

    Clean models are not considered in this ablation; therefore, the average values may differ from those presented in the main paper

    Repair Study: Controls and Hyperparame- ters We compare CI-trigger repair against controls: Clean-only FT, Random-δ, Wrong-class δ; and report ACC/ASR values averaged over all datasets for each attack type. Clean models are not considered in this ablation; therefore, the average values may differ from those presented in the main paper. Per-attack metrics ...

  53. [53]

    Structural Similarity (SSIM) scores for orig- inal and reconstructed triggers We show the Structural Similarity or SSIM values for the original and reconstructed triggers (for the backdoor target class) in this section. Recall that the BadCLIP trigger is imperceptible and pervasive, the Blended trigger is pervasive but not imperceptible, SIBA is sparse an...