Closed-Loop Bidirectional Prompting for Adversarial Robustness of Vision Language Models

Boci Peng; Boren Hu; Jiaxiang Liu; Liming Zhang; Mingkun Xu; Prayag Tiwari; Xiao Liu; Xiwen Chen; Yusong Wang

arxiv: 2605.25922 · v1 · pith:X2TI5GDCnew · submitted 2026-05-25 · 💻 cs.CV

Closed-Loop Bidirectional Prompting for Adversarial Robustness of Vision Language Models

Xiao Liu , Jiaxiang Liu , Boci Peng , Boren Hu , Yusong Wang , Xiwen Chen , Prayag Tiwari , Liming Zhang

show 1 more author

Mingkun Xu

This is my paper

Pith reviewed 2026-06-29 22:37 UTC · model grok-4.3

classification 💻 cs.CV

keywords adversarial robustnessvision language modelsbidirectional promptingsemantic anchorcross-modal alignmentclosed-loop adaptationprompt tuningfrozen encoders

0 comments

The pith

Closed-Loop Bidirectional Prompting recovers robust cross-modal consensus in vision-language models by using a semantic anchor in a dynamic feedback loop.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision language models lose alignment between images and text under adversarial perturbations that target their cross-modal understanding. Existing defenses remain mostly one-way or fixed and miss the chance to let each modality clean up the other in an adaptive way. The paper presents Closed-Loop Bidirectional Prompting that runs a feedback process anchored by a stable semantic prior: text semantics refine visual features and the cleaned visuals then update the text prompts on the fly. This occurs on frozen encoders without retraining. A reader would care because the resulting consensus improves defense and task adaptation while controlling extra computation.

Core claim

By introducing a Semantic Anchor as a stable prior, the closed-loop bidirectional prompting allows textual semantics to denoise visual representations and refined visuals to enable instance-adaptive prompt updating on frozen encoders, resulting in a rectified and robust consensus that recovers cross-modal agreement under adversarial perturbations.

What carries the argument

The Semantic Anchor, a stable prior that constrains cyclic updates in the bidirectional feedback loop to mitigate perturbation-induced feature corruption.

If this is right

State-of-the-art adversarial robustness is achieved across 11 datasets.
Strong base-to-new generalization holds in the robust setting.
A favorable trade-off between computational cost and accuracy is preserved.
Robust adaptation is realized as cross-modal agreement recovery via dynamic feedback on frozen encoders.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The instance-adaptive loop may extend to other multimodal models that suffer alignment attacks.
Reduced reliance on large adversarial training sets could lower defense costs in practice.
Further tests on larger-scale models would show whether the anchor remains sufficient to prevent loop divergence.
Combination with existing prompt-tuning techniques might improve the efficiency of the feedback process.

Load-bearing premise

Bidirectional cross-modal complementarity exists and can be safely exploited in a dynamic feedback loop with a Semantic Anchor to achieve robust consensus without introducing instability or new vulnerabilities.

What would settle it

Experiments showing that the closed-loop updates reduce robustness on standard adversarial benchmarks such as ImageNet under PGD attacks or cause performance drop on clean inputs would disprove the claim of stable robust consensus.

Figures

Figures reproduced from arXiv: 2605.25922 by Boci Peng, Boren Hu, Jiaxiang Liu, Liming Zhang, Mingkun Xu, Prayag Tiwari, Xiao Liu, Xiwen Chen, Yusong Wang.

**Figure 1.** Figure 1: Comparison of prompt-learning paradigms. (a) AdvVLP inserts independent prompts in each modality with no cross-modal interaction (prone to overfitting). (b) AdvMaPLe adds a unidirectional text-tovision projection but remains static, lacking visual feedback for instance-specific adaptation. (c) Our CLBP introduces a dynamic cross-modal feedback loop: lightweight networks align features across modalities to… view at source ↗

**Figure 2.** Figure 2: The CLBP pipeline. Step 0 (Semantic Bootstrapping) initializes a fixed text anchor; Step 1 (Active Visual Denoising, T2V) filters image noise using the anchor; Step 2 (Instance-Adaptive Refinement, V2T) injects visual bias back into the text prompt, closing the loop. Final predictions are obtained by Multi-View Aggregation across augmented views; the model is trained with a tri-component loss. adversarial … view at source ↗

**Figure 3.** Figure 3: Few-shot accuracy averaged over 11 datasets for n∈{1, 2, 4, 8, 16} shots/class. Training: 10 epochs, 2-step PGD at ϵtrain=1/255. Top: Top-1 clean; Bottom: Top-1 PGD-100 robust accuracy at ϵa=1/255. 3.06% above the strongest test-time defence SCC (49.71%), and roughly twice the strongest trainingbased baselines (AdvMaPLe 24.37%, FAP 23.96%, PMG-AFT 22.97%). Clean accuracy is 59.99%, 0.39% above vanilla CLI… view at source ↗

**Figure 4.** Figure 4: Ablation on attack strength, training losses and view budget. All panels use the 16-shot CLBP model (10 epochs, 2-step PGD at ϵtrain=1/255). (a) Top-1 PGD-100 robust accuracy at ϵa=1/255 on OxfordPets and DTD as the number of PGD attack episodes (independent restarts) varies in {1, 2, 4, 8}. (b) Impact of loss components Lce, Lkl, Lmar (Section 3.2) on Top-1 clean / PGD-100 robust accuracy (ϵa=1/255) on D… view at source ↗

**Figure 5.** Figure 5: Representation geometry drift by layer. Layer-wise feature consistency measured by Linear Centered Kernel Alignment (CKA) between the pre-trained CLIP encoder and adversarially fine-tuned variants on ImageNet. The drop in similarity in middle layers (blocks 4–8) indicates geometric distortion introduced during robust fine-tuning, more pronounced under stronger training budgets (ϵtrain=4/255). short prompts… view at source ↗

**Figure 6.** Figure 6: Internal feature stability under adversarial perturbations. We compare ℓ2 distances ∆M=∥zM(x adv) − zM(x clean)∥2 between clean and adversarial visual features within the same model M ∈ {CLIP, CLBP}, with x adv generated by PGD-100 at ϵa=1/255 in ℓ∞ on DTD. Most samples satisfy ∆CLBP<∆CLIP (points below the diagonal in (a)), and CLBP reduces both the median and variance of feature shifts (b), demonstrating… view at source ↗

**Figure 7.** Figure 7: Semantic rectification analysis on DTD. 16-shot CLBP (10 epochs, 2-step PGD at ϵtrain=1/255) evaluated under PGD-100 at ϵa=1/255 in ℓ∞. (a) Logit margin distributions ℓy − maxc̸=y ℓc over the test set: clean (CLIP), adversarial (CLIP), and adversarial after the closed-loop denoising (CLBP). (b) Cosinesimilarity decomposition into the correct-class similarity cos(z, wy) and the strongest competitor similar… view at source ↗

**Figure 8.** Figure 8: Robust accuracy and per-image latency of CLBP on DTD as a function of the augmented-view count V . Trained on 16-shot data with 2-step PGD at ϵtrain=1/255, α=1/255 for 10 epochs; evaluated under PGD-100 at the stronger ϵa=4/255 in ℓ∞. All six points are measured on the same trained model. E Multi-View Aggregation: View-Count Ablation We expand each input image into V random augmentations {x (v)} V v=1 and … view at source ↗

read the original abstract

Vision Language Models adapt well to downstream tasks but are highly vulnerable to adversarial perturbations that disrupt cross-modal semantic alignment. Existing defenses are largely unidirectional or structural, failing to exploit bidirectional cross-modal complementarity and instance-wise adaptive protection. To overcome the limitations of unidirectional and static defenses in adversarial settings, we propose Closed-Loop Bidirectional Prompting, casting robust adaptation as cross-modal agreement recovery via a dynamic feedback loop on frozen encoders. A Semantic Anchor is introduced as a stable prior to constrain cyclic updates and mitigate perturbation-induced feature corruption. Through anchor-based bootstrapping, textual semantics denoise visual representations, while the refined visuals enable instance-adaptive prompt updating, yielding a rectified and robust consensus. Extensive evaluations across 11 datasets validate state-of-the-art robustness and strong base-to-new generalization, while maintaining a favorable trade-off between computational cost and accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The closed-loop bidirectional prompting with semantic anchor is a distinct angle on VLM robustness but the abstract supplies no evidence to back the SOTA claims or loop stability.

read the letter

The main point is a method that sets up a dynamic feedback loop between text and vision prompts on frozen encoders, using a semantic anchor to recover cross-modal agreement under adversarial perturbations. This differs from the unidirectional or static defenses the abstract references.

The paper does a reasonable job identifying the limitation of existing approaches and framing robustness as instance-adaptive cross-modal bootstrapping. The direction makes sense in principle: textual semantics could denoise visuals, and cleaner visuals could refine prompts without retraining the whole model.

The soft spot is the complete absence of supporting data. The abstract asserts extensive evaluations across 11 datasets, state-of-the-art robustness, and good base-to-new generalization, yet shows no numbers, controls, error bars, or even a sketch of the update rules. The stress-test concern about possible divergence or self-reinforcing adversarial features in the loop is not resolved by anything visible here.

Without the actual experiments or a bound showing the anchor dominates the updates, the central claim stays unverified. The soundness score from the abstract-only read is low for exactly this reason.

This is for people working on practical adversarial defenses for vision-language models. A reader already following prompt-based or cross-modal robustness work would see the most direct value if the full results and implementation hold up.

It deserves peer review because the problem is timely and the bidirectional loop is a clear departure from prior defenses, even if heavy revision is likely needed once the experiments are examined.

Referee Report

2 major / 1 minor

Summary. The paper proposes Closed-Loop Bidirectional Prompting (CLBP) to enhance adversarial robustness in Vision-Language Models. It casts robust adaptation as cross-modal agreement recovery via a dynamic feedback loop on frozen encoders, introducing a Semantic Anchor as a stable prior to constrain cyclic updates. Through anchor-based bootstrapping, textual semantics denoise visual representations while refined visuals enable instance-adaptive prompt updating, yielding a rectified robust consensus. The work claims state-of-the-art robustness and strong base-to-new generalization across 11 datasets with a favorable computational cost-accuracy trade-off.

Significance. If the closed-loop bidirectional mechanism with the Semantic Anchor delivers stable cross-modal denoising without introducing instability or new vulnerabilities, the result would be significant for the field. It offers a novel paradigm for exploiting bidirectional complementarity in an instance-adaptive manner on frozen encoders, moving beyond unidirectional or static defenses, and could influence prompt-based robustness techniques if the empirical claims hold.

major comments (2)

[Abstract] Abstract: The central claim that the Semantic Anchor constrains cyclic updates to mitigate perturbation-induced feature corruption and yield a rectified consensus relies on unverified stability of the closed-loop. No derivation, bound, or analysis is supplied showing that corruption in one modality cannot amplify through the feedback (text denoising visuals → visual-driven prompt update), which is load-bearing for the robustness guarantee.
[Abstract] Abstract: The assertion of 'extensive evaluations across 11 datasets' validating SOTA robustness and generalization supplies no data, tables, controls, error bars, baselines, or implementation details, preventing verification of whether the empirical results support the claims or the cross-modal complementarity assumption.

minor comments (1)

[Abstract] The abstract refers to a 'favorable trade-off between computational cost and accuracy' without specifying the exact metrics, baselines, or datasets used for this assessment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the Semantic Anchor constrains cyclic updates to mitigate perturbation-induced feature corruption and yield a rectified consensus relies on unverified stability of the closed-loop. No derivation, bound, or analysis is supplied showing that corruption in one modality cannot amplify through the feedback (text denoising visuals → visual-driven prompt update), which is load-bearing for the robustness guarantee.

Authors: We agree that a formal stability analysis or bound is not present in the manuscript. The Semantic Anchor is constructed from unperturbed textual features to serve as a fixed reference, and the closed-loop is implemented with frozen encoders to limit drift, but these design choices are justified empirically rather than theoretically. In revision we will add a dedicated paragraph in the method section analyzing stability via iteration ablations and a simple contraction argument based on the anchor constraint. revision: partial
Referee: [Abstract] Abstract: The assertion of 'extensive evaluations across 11 datasets' validating SOTA robustness and generalization supplies no data, tables, controls, error bars, baselines, or implementation details, preventing verification of whether the empirical results support the claims or the cross-modal complementarity assumption.

Authors: The abstract is a concise summary and therefore omits tables and raw data by design. The full manuscript contains all requested elements in Sections 4 and 5: tables reporting accuracy under multiple attacks on 11 datasets, error bars from repeated runs, baseline comparisons, ablation controls, and full implementation details. These results directly support the SOTA and generalization claims as well as the benefit of bidirectional complementarity. revision: no

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's derivation chain, as described in the abstract, introduces a novel Closed-Loop Bidirectional Prompting method with a Semantic Anchor as a stable prior for constraining cyclic updates on frozen encoders. No equations or self-citations are quoted that reduce any claimed prediction, bootstrapping result, or consensus to fitted inputs or prior author work by construction. The central claim of rectified robust consensus via anchor-based textual-visual denoising is presented as an independent construction without tautological reduction to its own definitions or data fits. Evaluations across 11 datasets serve as external validation, rendering the approach self-contained against benchmarks rather than circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Review based on abstract only; no explicit free parameters, axioms, or invented entities detailed beyond the introduced Semantic Anchor concept.

invented entities (1)

Semantic Anchor no independent evidence
purpose: Stable prior to constrain cyclic updates and mitigate perturbation-induced feature corruption
Introduced in the proposal as a key component for the feedback loop.

pith-pipeline@v0.9.1-grok · 5697 in / 1028 out tokens · 26425 ms · 2026-06-29T22:37:28.852927+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 12 canonical work pages · 5 internal anchors

[1]

Combating adversaries with anti-adversaries

Motasem Alfarra, Juan C Pérez, Ali Thabet, Adel Bibi, Philip HS Torr, and Bernard Ghanem. Combating adversaries with anti-adversaries. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5992–6000, 2022

2022
[2]

Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples

Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. InInternational conference on machine learning, pages 274–283. PMLR, 2018

2018
[3]

Food-101–mining discriminative components with random forests

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InEuropean conference on computer vision, pages 446–461. Springer, 2014

2014
[4]

Towards evaluating the robustness of neural networks

Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), pages 39–57. Ieee, 2017

2017
[5]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021

2021
[6]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning, pages 1597–1607. PmLR, 2020

2020
[7]

Describing textures in the wild

Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014

2014
[8]

Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks

Francesco Croce and Matthias Hein. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. InInternational conference on machine learning, pages 2206–2216. PMLR, 2020

2020
[9]

Imagenet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

2009
[10]

One-shot learning of object categories.IEEE transactions on pattern analysis and machine intelligence, 28(4):594–611, 2006

Li Fei-Fei, Robert Fergus, and Pietro Perona. One-shot learning of object categories.IEEE transactions on pattern analysis and machine intelligence, 28(4):594–611, 2006

2006
[11]

Explaining and Harnessing Adversarial Examples

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversar- ial examples.arXiv preprint arXiv:1412.6572, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[12]

Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020

Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020

2020
[13]

Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019

2019
[14]

Visual prompt tuning

Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. InEuropean conference on computer vision, pages 709–727. Springer, 2022

2022
[15]

Maple: Multi-modal prompt learning

Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fa- had Shahbaz Khan. Maple: Multi-modal prompt learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19113–19122, 2023

2023
[16]

Vilt: Vision-and-language transformer without convolution or region supervision

Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-and-language transformer without convolution or region supervision. InInternational conference on machine learning, pages 5583–5594. PMLR, 2021. 10

2021
[17]

Patch is enough: naturalistic adversarial patch against vision-language pre-training models.Visual Intelligence, 2(1):33, 2024

Dehong Kong, Siyuan Liang, Xiaopeng Zhu, Yuansheng Zhong, and Wenqi Ren. Patch is enough: naturalistic adversarial patch against vision-language pre-training models.Visual Intelligence, 2(1):33, 2024

2024
[18]

3d object representations for fine- grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine- grained categorization. InProceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013

2013
[19]

Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation. InInternational conference on machine learning, pages 12888–12900. PMLR, 2022

2022
[20]

One prompt word is enough to boost adversarial robustness for pre-trained vision-language models

Lin Li, Haoyan Guan, Jianing Qiu, and Michael Spratling. One prompt word is enough to boost adversarial robustness for pre-trained vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24408–24419, 2024

2024
[21]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

2023
[22]

Self-calibrated consis- tency can fight back for adversarial robustness in vision-language models.arXiv preprint arXiv:2510.22785, 2025

Jiaxiang Liu, Jiawei Du, Xiao Liu, Prayag Tiwari, and Mingkun Xu. Self-calibrated consis- tency can fight back for adversarial robustness in vision-language models.arXiv preprint arXiv:2510.22785, 2025

work page arXiv 2025
[23]

Kpl: Training-free medical knowledge mining of vision-language models

Jiaxiang Liu, Tianxiang Hu, Jiawei Du, Ruiyuan Zhang, Joey Tianyi Zhou, and Zuozhu Liu. Kpl: Training-free medical knowledge mining of vision-language models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 18852–18860, 2025

2025
[24]

Towards Deep Learning Models Resistant to Adversarial Attacks

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks.arXiv preprint arXiv:1706.06083, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[25]

Fine-Grained Visual Classification of Aircraft

Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine- grained visual classification of aircraft.arXiv preprint arXiv:1306.5151, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[26]

Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models

Hashmat Shadab Malik, Fahad Shamshad, Muzammal Naseer, Karthik Nandakumar, Fahad Khan, and Salman Khan. Robust-llava: On the effectiveness of large-scale robust image encoders for multi-modal large language models.arXiv preprint arXiv:2502.01576, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Understanding zero-shot adversarial robustness for large-scale models.arXiv preprint arXiv:2212.07016, 2022

Chengzhi Mao, Scott Geng, Junfeng Yang, Xin Wang, and Carl V ondrick. Understanding zero-shot adversarial robustness for large-scale models.arXiv preprint arXiv:2212.07016, 2022

work page arXiv 2022
[28]

Automated flower classification over a large number of classes

Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008

2008
[29]

Cats and dogs

Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012

2012
[30]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021
[31]

Robust clip: Unsupervised adversarial fine-tuning of vision embeddings for robust large vision-language models.arXiv preprint arXiv:2402.12336, 2024

Christian Schlarmann, Naman Deep Singh, Francesco Croce, and Matthias Hein. Robust clip: Unsupervised adversarial fine-tuning of vision embeddings for robust large vision-language models.arXiv preprint arXiv:2402.12336, 2024

work page arXiv 2024
[32]

R-tpt: Improving adversarial robustness of vision-language models through test-time prompt tuning

Lijun Sheng, Jian Liang, Zilei Wang, and Ran He. R-tpt: Improving adversarial robustness of vision-language models through test-time prompt tuning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29958–29967, 2025

2025
[33]

Test-time prompt tuning for zero-shot generalization in vision-language models

Manli Shu, Weili Nie, De-An Huang, Zhiding Yu, Tom Goldstein, Anima Anandkumar, and Chaowei Xiao. Test-time prompt tuning for zero-shot generalization in vision-language models. Advances in Neural Information Processing Systems, 35:14274–14289, 2022. 11

2022
[34]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012
[35]

Pre-trained model guided fine-tuning for zero-shot adversarial robustness

Sibo Wang, Jie Zhang, Zheng Yuan, and Shiguang Shan. Pre-trained model guided fine-tuning for zero-shot adversarial robustness. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24502–24511, 2024

2024
[36]

Improving zero-shot generalization for clip with synthesized prompts.arXiv preprint arXiv:2307.07397, 2023

Zhengbo Wang, Jian Liang, Ran He, Nan Xu, Zilei Wang, and Tieniu Tan. Improving zero-shot generalization for clip with synthesized prompts.arXiv preprint arXiv:2307.07397, 2023

work page arXiv 2023
[37]

Sun database: Large-scale scene recognition from abbey to zoo

Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010

2010
[38]

Clip is strong enough to fight back: Test-time counterattacks towards zero-shot adversarial robustness of clip

Songlong Xing, Zhengyu Zhao, and Nicu Sebe. Clip is strong enough to fight back: Test-time counterattacks towards zero-shot adversarial robustness of clip. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15172–15182, 2025

2025
[39]

Filip: Fine-grained interactive language-image pre-training.arXiv preprint arXiv:2111.07783, 2021

Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. Filip: Fine-grained interactive language-image pre-training.arXiv preprint arXiv:2111.07783, 2021

work page arXiv 2021
[40]

C-tpt: Calibrated test-time prompt tuning for vision-language models via text feature dispersion.arXiv preprint arXiv:2403.14119, 2024

Hee Suk Yoon, Eunseop Yoon, Joshua Tian Jin Tee, Mark Hasegawa-Johnson, Yingzhen Li, and Chang D Yoo. C-tpt: Calibrated test-time prompt tuning for vision-language models via text feature dispersion.arXiv preprint arXiv:2403.14119, 2024

work page arXiv 2024
[41]

Maxime Zanella and Ismail Ben Ayed. On the test-time zero-shot generalization of vision- language models: Do we really need prompt learning? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23783–23793, 2024

2024
[42]

A simple framework for open-vocabulary segmentation and detection

Hao Zhang, Feng Li, Xueyan Zou, Shilong Liu, Chunyuan Li, Jianwei Yang, and Lei Zhang. A simple framework for open-vocabulary segmentation and detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1020–1031, 2023

2023
[43]

Adversarial prompt tuning for vision-language models

Jiaming Zhang, Xingjun Ma, Xin Wang, Lingyu Qiu, Jiaqi Wang, Yu-Gang Jiang, and Jitao Sang. Adversarial prompt tuning for vision-language models. InEuropean conference on computer vision, pages 56–72. Springer, 2024

2024
[44]

A multimodal biomedical foundation model trained from fifteen million image–text pairs.NEJM AI, 2(1):AIoa2400640, 2025

Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. A multimodal biomedical foundation model trained from fifteen million image–text pairs.NEJM AI, 2(1):AIoa2400640, 2025

2025
[45]

Extract free dense labels from clip

Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. InEuropean conference on computer vision, pages 696–712. Springer, 2022

2022
[46]

Conditional prompt learning for vision-language models

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16816–16825, 2022

2022
[47]

Learning to prompt for vision-language models.International Journal of Computer Vision, 130(9):2337–2348, 2022

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.International Journal of Computer Vision, 130(9):2337–2348, 2022

2022
[48]

Few-shot adversarial prompt learning on vision-language models.Advances in Neural Information Processing Systems, 37: 3122–3156, 2024

Yiwei Zhou, Xiaobo Xia, Zhiwei Lin, Bo Han, and Tongliang Liu. Few-shot adversarial prompt learning on vision-language models.Advances in Neural Information Processing Systems, 37: 3122–3156, 2024. 12 A Inference Pipeline A.1 Closed-Loop Inference with Multi-View Aggregation Algorithm 1CLBP Inference with Closed Loop and Multi-View Aggregation Require: Da...

work page arXiv 2024

[1] [1]

Combating adversaries with anti-adversaries

Motasem Alfarra, Juan C Pérez, Ali Thabet, Adel Bibi, Philip HS Torr, and Bernard Ghanem. Combating adversaries with anti-adversaries. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5992–6000, 2022

2022

[2] [2]

Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples

Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. InInternational conference on machine learning, pages 274–283. PMLR, 2018

2018

[3] [3]

Food-101–mining discriminative components with random forests

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InEuropean conference on computer vision, pages 446–461. Springer, 2014

2014

[4] [4]

Towards evaluating the robustness of neural networks

Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), pages 39–57. Ieee, 2017

2017

[5] [5]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021

2021

[6] [6]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning, pages 1597–1607. PmLR, 2020

2020

[7] [7]

Describing textures in the wild

Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014

2014

[8] [8]

Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks

Francesco Croce and Matthias Hein. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. InInternational conference on machine learning, pages 2206–2216. PMLR, 2020

2020

[9] [9]

Imagenet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

2009

[10] [10]

One-shot learning of object categories.IEEE transactions on pattern analysis and machine intelligence, 28(4):594–611, 2006

Li Fei-Fei, Robert Fergus, and Pietro Perona. One-shot learning of object categories.IEEE transactions on pattern analysis and machine intelligence, 28(4):594–611, 2006

2006

[11] [11]

Explaining and Harnessing Adversarial Examples

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversar- ial examples.arXiv preprint arXiv:1412.6572, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[12] [12]

Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020

Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020

2020

[13] [13]

Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019

2019

[14] [14]

Visual prompt tuning

Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. InEuropean conference on computer vision, pages 709–727. Springer, 2022

2022

[15] [15]

Maple: Multi-modal prompt learning

Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fa- had Shahbaz Khan. Maple: Multi-modal prompt learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19113–19122, 2023

2023

[16] [16]

Vilt: Vision-and-language transformer without convolution or region supervision

Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-and-language transformer without convolution or region supervision. InInternational conference on machine learning, pages 5583–5594. PMLR, 2021. 10

2021

[17] [17]

Patch is enough: naturalistic adversarial patch against vision-language pre-training models.Visual Intelligence, 2(1):33, 2024

Dehong Kong, Siyuan Liang, Xiaopeng Zhu, Yuansheng Zhong, and Wenqi Ren. Patch is enough: naturalistic adversarial patch against vision-language pre-training models.Visual Intelligence, 2(1):33, 2024

2024

[18] [18]

3d object representations for fine- grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine- grained categorization. InProceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013

2013

[19] [19]

Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation. InInternational conference on machine learning, pages 12888–12900. PMLR, 2022

2022

[20] [20]

One prompt word is enough to boost adversarial robustness for pre-trained vision-language models

Lin Li, Haoyan Guan, Jianing Qiu, and Michael Spratling. One prompt word is enough to boost adversarial robustness for pre-trained vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24408–24419, 2024

2024

[21] [21]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

2023

[22] [22]

Self-calibrated consis- tency can fight back for adversarial robustness in vision-language models.arXiv preprint arXiv:2510.22785, 2025

Jiaxiang Liu, Jiawei Du, Xiao Liu, Prayag Tiwari, and Mingkun Xu. Self-calibrated consis- tency can fight back for adversarial robustness in vision-language models.arXiv preprint arXiv:2510.22785, 2025

work page arXiv 2025

[23] [23]

Kpl: Training-free medical knowledge mining of vision-language models

Jiaxiang Liu, Tianxiang Hu, Jiawei Du, Ruiyuan Zhang, Joey Tianyi Zhou, and Zuozhu Liu. Kpl: Training-free medical knowledge mining of vision-language models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 18852–18860, 2025

2025

[24] [24]

Towards Deep Learning Models Resistant to Adversarial Attacks

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks.arXiv preprint arXiv:1706.06083, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[25] [25]

Fine-Grained Visual Classification of Aircraft

Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine- grained visual classification of aircraft.arXiv preprint arXiv:1306.5151, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[26] [26]

Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models

Hashmat Shadab Malik, Fahad Shamshad, Muzammal Naseer, Karthik Nandakumar, Fahad Khan, and Salman Khan. Robust-llava: On the effectiveness of large-scale robust image encoders for multi-modal large language models.arXiv preprint arXiv:2502.01576, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Understanding zero-shot adversarial robustness for large-scale models.arXiv preprint arXiv:2212.07016, 2022

Chengzhi Mao, Scott Geng, Junfeng Yang, Xin Wang, and Carl V ondrick. Understanding zero-shot adversarial robustness for large-scale models.arXiv preprint arXiv:2212.07016, 2022

work page arXiv 2022

[28] [28]

Automated flower classification over a large number of classes

Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008

2008

[29] [29]

Cats and dogs

Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012

2012

[30] [30]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021

[31] [31]

Robust clip: Unsupervised adversarial fine-tuning of vision embeddings for robust large vision-language models.arXiv preprint arXiv:2402.12336, 2024

Christian Schlarmann, Naman Deep Singh, Francesco Croce, and Matthias Hein. Robust clip: Unsupervised adversarial fine-tuning of vision embeddings for robust large vision-language models.arXiv preprint arXiv:2402.12336, 2024

work page arXiv 2024

[32] [32]

R-tpt: Improving adversarial robustness of vision-language models through test-time prompt tuning

Lijun Sheng, Jian Liang, Zilei Wang, and Ran He. R-tpt: Improving adversarial robustness of vision-language models through test-time prompt tuning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29958–29967, 2025

2025

[33] [33]

Test-time prompt tuning for zero-shot generalization in vision-language models

Manli Shu, Weili Nie, De-An Huang, Zhiding Yu, Tom Goldstein, Anima Anandkumar, and Chaowei Xiao. Test-time prompt tuning for zero-shot generalization in vision-language models. Advances in Neural Information Processing Systems, 35:14274–14289, 2022. 11

2022

[34] [34]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012

[35] [35]

Pre-trained model guided fine-tuning for zero-shot adversarial robustness

Sibo Wang, Jie Zhang, Zheng Yuan, and Shiguang Shan. Pre-trained model guided fine-tuning for zero-shot adversarial robustness. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24502–24511, 2024

2024

[36] [36]

Improving zero-shot generalization for clip with synthesized prompts.arXiv preprint arXiv:2307.07397, 2023

Zhengbo Wang, Jian Liang, Ran He, Nan Xu, Zilei Wang, and Tieniu Tan. Improving zero-shot generalization for clip with synthesized prompts.arXiv preprint arXiv:2307.07397, 2023

work page arXiv 2023

[37] [37]

Sun database: Large-scale scene recognition from abbey to zoo

Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010

2010

[38] [38]

Clip is strong enough to fight back: Test-time counterattacks towards zero-shot adversarial robustness of clip

Songlong Xing, Zhengyu Zhao, and Nicu Sebe. Clip is strong enough to fight back: Test-time counterattacks towards zero-shot adversarial robustness of clip. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15172–15182, 2025

2025

[39] [39]

Filip: Fine-grained interactive language-image pre-training.arXiv preprint arXiv:2111.07783, 2021

Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. Filip: Fine-grained interactive language-image pre-training.arXiv preprint arXiv:2111.07783, 2021

work page arXiv 2021

[40] [40]

C-tpt: Calibrated test-time prompt tuning for vision-language models via text feature dispersion.arXiv preprint arXiv:2403.14119, 2024

Hee Suk Yoon, Eunseop Yoon, Joshua Tian Jin Tee, Mark Hasegawa-Johnson, Yingzhen Li, and Chang D Yoo. C-tpt: Calibrated test-time prompt tuning for vision-language models via text feature dispersion.arXiv preprint arXiv:2403.14119, 2024

work page arXiv 2024

[41] [41]

Maxime Zanella and Ismail Ben Ayed. On the test-time zero-shot generalization of vision- language models: Do we really need prompt learning? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23783–23793, 2024

2024

[42] [42]

A simple framework for open-vocabulary segmentation and detection

Hao Zhang, Feng Li, Xueyan Zou, Shilong Liu, Chunyuan Li, Jianwei Yang, and Lei Zhang. A simple framework for open-vocabulary segmentation and detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1020–1031, 2023

2023

[43] [43]

Adversarial prompt tuning for vision-language models

Jiaming Zhang, Xingjun Ma, Xin Wang, Lingyu Qiu, Jiaqi Wang, Yu-Gang Jiang, and Jitao Sang. Adversarial prompt tuning for vision-language models. InEuropean conference on computer vision, pages 56–72. Springer, 2024

2024

[44] [44]

A multimodal biomedical foundation model trained from fifteen million image–text pairs.NEJM AI, 2(1):AIoa2400640, 2025

Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. A multimodal biomedical foundation model trained from fifteen million image–text pairs.NEJM AI, 2(1):AIoa2400640, 2025

2025

[45] [45]

Extract free dense labels from clip

Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. InEuropean conference on computer vision, pages 696–712. Springer, 2022

2022

[46] [46]

Conditional prompt learning for vision-language models

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16816–16825, 2022

2022

[47] [47]

Learning to prompt for vision-language models.International Journal of Computer Vision, 130(9):2337–2348, 2022

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.International Journal of Computer Vision, 130(9):2337–2348, 2022

2022

[48] [48]

Few-shot adversarial prompt learning on vision-language models.Advances in Neural Information Processing Systems, 37: 3122–3156, 2024

Yiwei Zhou, Xiaobo Xia, Zhiwei Lin, Bo Han, and Tongliang Liu. Few-shot adversarial prompt learning on vision-language models.Advances in Neural Information Processing Systems, 37: 3122–3156, 2024. 12 A Inference Pipeline A.1 Closed-Loop Inference with Multi-View Aggregation Algorithm 1CLBP Inference with Closed Loop and Multi-View Aggregation Require: Da...

work page arXiv 2024