pith. sign in

arxiv: 2605.25922 · v1 · pith:X2TI5GDCnew · submitted 2026-05-25 · 💻 cs.CV

Closed-Loop Bidirectional Prompting for Adversarial Robustness of Vision Language Models

Pith reviewed 2026-06-29 22:37 UTC · model grok-4.3

classification 💻 cs.CV
keywords adversarial robustnessvision language modelsbidirectional promptingsemantic anchorcross-modal alignmentclosed-loop adaptationprompt tuningfrozen encoders
0
0 comments X

The pith

Closed-Loop Bidirectional Prompting recovers robust cross-modal consensus in vision-language models by using a semantic anchor in a dynamic feedback loop.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision language models lose alignment between images and text under adversarial perturbations that target their cross-modal understanding. Existing defenses remain mostly one-way or fixed and miss the chance to let each modality clean up the other in an adaptive way. The paper presents Closed-Loop Bidirectional Prompting that runs a feedback process anchored by a stable semantic prior: text semantics refine visual features and the cleaned visuals then update the text prompts on the fly. This occurs on frozen encoders without retraining. A reader would care because the resulting consensus improves defense and task adaptation while controlling extra computation.

Core claim

By introducing a Semantic Anchor as a stable prior, the closed-loop bidirectional prompting allows textual semantics to denoise visual representations and refined visuals to enable instance-adaptive prompt updating on frozen encoders, resulting in a rectified and robust consensus that recovers cross-modal agreement under adversarial perturbations.

What carries the argument

The Semantic Anchor, a stable prior that constrains cyclic updates in the bidirectional feedback loop to mitigate perturbation-induced feature corruption.

If this is right

  • State-of-the-art adversarial robustness is achieved across 11 datasets.
  • Strong base-to-new generalization holds in the robust setting.
  • A favorable trade-off between computational cost and accuracy is preserved.
  • Robust adaptation is realized as cross-modal agreement recovery via dynamic feedback on frozen encoders.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The instance-adaptive loop may extend to other multimodal models that suffer alignment attacks.
  • Reduced reliance on large adversarial training sets could lower defense costs in practice.
  • Further tests on larger-scale models would show whether the anchor remains sufficient to prevent loop divergence.
  • Combination with existing prompt-tuning techniques might improve the efficiency of the feedback process.

Load-bearing premise

Bidirectional cross-modal complementarity exists and can be safely exploited in a dynamic feedback loop with a Semantic Anchor to achieve robust consensus without introducing instability or new vulnerabilities.

What would settle it

Experiments showing that the closed-loop updates reduce robustness on standard adversarial benchmarks such as ImageNet under PGD attacks or cause performance drop on clean inputs would disprove the claim of stable robust consensus.

Figures

Figures reproduced from arXiv: 2605.25922 by Boci Peng, Boren Hu, Jiaxiang Liu, Liming Zhang, Mingkun Xu, Prayag Tiwari, Xiao Liu, Xiwen Chen, Yusong Wang.

Figure 1
Figure 1. Figure 1: Comparison of prompt-learning paradigms. (a) AdvVLP inserts independent prompts in each modality with no cross-modal interaction (prone to overfitting). (b) AdvMaPLe adds a unidirectional text-to￾vision projection but remains static, lacking visual feedback for instance-specific adaptation. (c) Our CLBP introduces a dynamic cross-modal feedback loop: lightweight networks align features across modalities to… view at source ↗
Figure 2
Figure 2. Figure 2: The CLBP pipeline. Step 0 (Semantic Bootstrapping) initializes a fixed text anchor; Step 1 (Active Visual Denoising, T2V) filters image noise using the anchor; Step 2 (Instance-Adaptive Refinement, V2T) injects visual bias back into the text prompt, closing the loop. Final predictions are obtained by Multi-View Aggregation across augmented views; the model is trained with a tri-component loss. adversarial … view at source ↗
Figure 3
Figure 3. Figure 3: Few-shot accuracy averaged over 11 datasets for n∈{1, 2, 4, 8, 16} shots/class. Training: 10 epochs, 2-step PGD at ϵtrain=1/255. Top: Top-1 clean; Bottom: Top-1 PGD-100 robust accuracy at ϵa=1/255. 3.06% above the strongest test-time defence SCC (49.71%), and roughly twice the strongest training￾based baselines (AdvMaPLe 24.37%, FAP 23.96%, PMG-AFT 22.97%). Clean accuracy is 59.99%, 0.39% above vanilla CLI… view at source ↗
Figure 4
Figure 4. Figure 4: Ablation on attack strength, training losses and view budget. All panels use the 16-shot CLBP model (10 epochs, 2-step PGD at ϵtrain=1/255). (a) Top-1 PGD-100 robust accuracy at ϵa=1/255 on Oxford￾Pets and DTD as the number of PGD attack episodes (independent restarts) varies in {1, 2, 4, 8}. (b) Impact of loss components Lce, Lkl, Lmar (Section 3.2) on Top-1 clean / PGD-100 robust accuracy (ϵa=1/255) on D… view at source ↗
Figure 5
Figure 5. Figure 5: Representation geometry drift by layer. Layer-wise feature consistency measured by Linear Centered Kernel Alignment (CKA) between the pre-trained CLIP encoder and adversarially fine-tuned variants on ImageNet. The drop in similarity in middle layers (blocks 4–8) indicates geometric distortion introduced during robust fine-tuning, more pronounced under stronger training budgets (ϵtrain=4/255). short prompts… view at source ↗
Figure 6
Figure 6. Figure 6: Internal feature stability under adversarial perturbations. We compare ℓ2 distances ∆M=∥zM(x adv) − zM(x clean)∥2 between clean and adversarial visual features within the same model M ∈ {CLIP, CLBP}, with x adv generated by PGD-100 at ϵa=1/255 in ℓ∞ on DTD. Most samples satisfy ∆CLBP<∆CLIP (points below the diagonal in (a)), and CLBP reduces both the median and variance of feature shifts (b), demonstrating… view at source ↗
Figure 7
Figure 7. Figure 7: Semantic rectification analysis on DTD. 16-shot CLBP (10 epochs, 2-step PGD at ϵtrain=1/255) evaluated under PGD-100 at ϵa=1/255 in ℓ∞. (a) Logit margin distributions ℓy − maxc̸=y ℓc over the test set: clean (CLIP), adversarial (CLIP), and adversarial after the closed-loop denoising (CLBP). (b) Cosine￾similarity decomposition into the correct-class similarity cos(z, wy) and the strongest competitor similar… view at source ↗
Figure 8
Figure 8. Figure 8: Robust accuracy and per-image latency of CLBP on DTD as a function of the augmented-view count V . Trained on 16-shot data with 2-step PGD at ϵtrain=1/255, α=1/255 for 10 epochs; evaluated under PGD-100 at the stronger ϵa=4/255 in ℓ∞. All six points are measured on the same trained model. E Multi-View Aggregation: View-Count Ablation We expand each input image into V random augmentations {x (v)} V v=1 and … view at source ↗
read the original abstract

Vision Language Models adapt well to downstream tasks but are highly vulnerable to adversarial perturbations that disrupt cross-modal semantic alignment. Existing defenses are largely unidirectional or structural, failing to exploit bidirectional cross-modal complementarity and instance-wise adaptive protection. To overcome the limitations of unidirectional and static defenses in adversarial settings, we propose Closed-Loop Bidirectional Prompting, casting robust adaptation as cross-modal agreement recovery via a dynamic feedback loop on frozen encoders. A Semantic Anchor is introduced as a stable prior to constrain cyclic updates and mitigate perturbation-induced feature corruption. Through anchor-based bootstrapping, textual semantics denoise visual representations, while the refined visuals enable instance-adaptive prompt updating, yielding a rectified and robust consensus. Extensive evaluations across 11 datasets validate state-of-the-art robustness and strong base-to-new generalization, while maintaining a favorable trade-off between computational cost and accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Closed-Loop Bidirectional Prompting (CLBP) to enhance adversarial robustness in Vision-Language Models. It casts robust adaptation as cross-modal agreement recovery via a dynamic feedback loop on frozen encoders, introducing a Semantic Anchor as a stable prior to constrain cyclic updates. Through anchor-based bootstrapping, textual semantics denoise visual representations while refined visuals enable instance-adaptive prompt updating, yielding a rectified robust consensus. The work claims state-of-the-art robustness and strong base-to-new generalization across 11 datasets with a favorable computational cost-accuracy trade-off.

Significance. If the closed-loop bidirectional mechanism with the Semantic Anchor delivers stable cross-modal denoising without introducing instability or new vulnerabilities, the result would be significant for the field. It offers a novel paradigm for exploiting bidirectional complementarity in an instance-adaptive manner on frozen encoders, moving beyond unidirectional or static defenses, and could influence prompt-based robustness techniques if the empirical claims hold.

major comments (2)
  1. [Abstract] Abstract: The central claim that the Semantic Anchor constrains cyclic updates to mitigate perturbation-induced feature corruption and yield a rectified consensus relies on unverified stability of the closed-loop. No derivation, bound, or analysis is supplied showing that corruption in one modality cannot amplify through the feedback (text denoising visuals → visual-driven prompt update), which is load-bearing for the robustness guarantee.
  2. [Abstract] Abstract: The assertion of 'extensive evaluations across 11 datasets' validating SOTA robustness and generalization supplies no data, tables, controls, error bars, baselines, or implementation details, preventing verification of whether the empirical results support the claims or the cross-modal complementarity assumption.
minor comments (1)
  1. [Abstract] The abstract refers to a 'favorable trade-off between computational cost and accuracy' without specifying the exact metrics, baselines, or datasets used for this assessment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the Semantic Anchor constrains cyclic updates to mitigate perturbation-induced feature corruption and yield a rectified consensus relies on unverified stability of the closed-loop. No derivation, bound, or analysis is supplied showing that corruption in one modality cannot amplify through the feedback (text denoising visuals → visual-driven prompt update), which is load-bearing for the robustness guarantee.

    Authors: We agree that a formal stability analysis or bound is not present in the manuscript. The Semantic Anchor is constructed from unperturbed textual features to serve as a fixed reference, and the closed-loop is implemented with frozen encoders to limit drift, but these design choices are justified empirically rather than theoretically. In revision we will add a dedicated paragraph in the method section analyzing stability via iteration ablations and a simple contraction argument based on the anchor constraint. revision: partial

  2. Referee: [Abstract] Abstract: The assertion of 'extensive evaluations across 11 datasets' validating SOTA robustness and generalization supplies no data, tables, controls, error bars, baselines, or implementation details, preventing verification of whether the empirical results support the claims or the cross-modal complementarity assumption.

    Authors: The abstract is a concise summary and therefore omits tables and raw data by design. The full manuscript contains all requested elements in Sections 4 and 5: tables reporting accuracy under multiple attacks on 11 datasets, error bars from repeated runs, baseline comparisons, ablation controls, and full implementation details. These results directly support the SOTA and generalization claims as well as the benefit of bidirectional complementarity. revision: no

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's derivation chain, as described in the abstract, introduces a novel Closed-Loop Bidirectional Prompting method with a Semantic Anchor as a stable prior for constraining cyclic updates on frozen encoders. No equations or self-citations are quoted that reduce any claimed prediction, bootstrapping result, or consensus to fitted inputs or prior author work by construction. The central claim of rectified robust consensus via anchor-based textual-visual denoising is presented as an independent construction without tautological reduction to its own definitions or data fits. Evaluations across 11 datasets serve as external validation, rendering the approach self-contained against benchmarks rather than circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Review based on abstract only; no explicit free parameters, axioms, or invented entities detailed beyond the introduced Semantic Anchor concept.

invented entities (1)
  • Semantic Anchor no independent evidence
    purpose: Stable prior to constrain cyclic updates and mitigate perturbation-induced feature corruption
    Introduced in the proposal as a key component for the feedback loop.

pith-pipeline@v0.9.1-grok · 5697 in / 1028 out tokens · 26425 ms · 2026-06-29T22:37:28.852927+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 12 canonical work pages · 5 internal anchors

  1. [1]

    Combating adversaries with anti-adversaries

    Motasem Alfarra, Juan C Pérez, Ali Thabet, Adel Bibi, Philip HS Torr, and Bernard Ghanem. Combating adversaries with anti-adversaries. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5992–6000, 2022

  2. [2]

    Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples

    Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. InInternational conference on machine learning, pages 274–283. PMLR, 2018

  3. [3]

    Food-101–mining discriminative components with random forests

    Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InEuropean conference on computer vision, pages 446–461. Springer, 2014

  4. [4]

    Towards evaluating the robustness of neural networks

    Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), pages 39–57. Ieee, 2017

  5. [5]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021

  6. [6]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning, pages 1597–1607. PmLR, 2020

  7. [7]

    Describing textures in the wild

    Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014

  8. [8]

    Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks

    Francesco Croce and Matthias Hein. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. InInternational conference on machine learning, pages 2206–2216. PMLR, 2020

  9. [9]

    Imagenet: A large- scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

  10. [10]

    One-shot learning of object categories.IEEE transactions on pattern analysis and machine intelligence, 28(4):594–611, 2006

    Li Fei-Fei, Robert Fergus, and Pietro Perona. One-shot learning of object categories.IEEE transactions on pattern analysis and machine intelligence, 28(4):594–611, 2006

  11. [11]

    Explaining and Harnessing Adversarial Examples

    Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversar- ial examples.arXiv preprint arXiv:1412.6572, 2014

  12. [12]

    Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020

    Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020

  13. [13]

    Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019

  14. [14]

    Visual prompt tuning

    Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. InEuropean conference on computer vision, pages 709–727. Springer, 2022

  15. [15]

    Maple: Multi-modal prompt learning

    Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fa- had Shahbaz Khan. Maple: Multi-modal prompt learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19113–19122, 2023

  16. [16]

    Vilt: Vision-and-language transformer without convolution or region supervision

    Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-and-language transformer without convolution or region supervision. InInternational conference on machine learning, pages 5583–5594. PMLR, 2021. 10

  17. [17]

    Patch is enough: naturalistic adversarial patch against vision-language pre-training models.Visual Intelligence, 2(1):33, 2024

    Dehong Kong, Siyuan Liang, Xiaopeng Zhu, Yuansheng Zhong, and Wenqi Ren. Patch is enough: naturalistic adversarial patch against vision-language pre-training models.Visual Intelligence, 2(1):33, 2024

  18. [18]

    3d object representations for fine- grained categorization

    Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine- grained categorization. InProceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013

  19. [19]

    Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation. InInternational conference on machine learning, pages 12888–12900. PMLR, 2022

  20. [20]

    One prompt word is enough to boost adversarial robustness for pre-trained vision-language models

    Lin Li, Haoyan Guan, Jianing Qiu, and Michael Spratling. One prompt word is enough to boost adversarial robustness for pre-trained vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24408–24419, 2024

  21. [21]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

  22. [22]

    Self-calibrated consis- tency can fight back for adversarial robustness in vision-language models.arXiv preprint arXiv:2510.22785, 2025

    Jiaxiang Liu, Jiawei Du, Xiao Liu, Prayag Tiwari, and Mingkun Xu. Self-calibrated consis- tency can fight back for adversarial robustness in vision-language models.arXiv preprint arXiv:2510.22785, 2025

  23. [23]

    Kpl: Training-free medical knowledge mining of vision-language models

    Jiaxiang Liu, Tianxiang Hu, Jiawei Du, Ruiyuan Zhang, Joey Tianyi Zhou, and Zuozhu Liu. Kpl: Training-free medical knowledge mining of vision-language models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 18852–18860, 2025

  24. [24]

    Towards Deep Learning Models Resistant to Adversarial Attacks

    Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks.arXiv preprint arXiv:1706.06083, 2017

  25. [25]

    Fine-Grained Visual Classification of Aircraft

    Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine- grained visual classification of aircraft.arXiv preprint arXiv:1306.5151, 2013

  26. [26]

    Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models

    Hashmat Shadab Malik, Fahad Shamshad, Muzammal Naseer, Karthik Nandakumar, Fahad Khan, and Salman Khan. Robust-llava: On the effectiveness of large-scale robust image encoders for multi-modal large language models.arXiv preprint arXiv:2502.01576, 2025

  27. [27]

    Understanding zero-shot adversarial robustness for large-scale models.arXiv preprint arXiv:2212.07016, 2022

    Chengzhi Mao, Scott Geng, Junfeng Yang, Xin Wang, and Carl V ondrick. Understanding zero-shot adversarial robustness for large-scale models.arXiv preprint arXiv:2212.07016, 2022

  28. [28]

    Automated flower classification over a large number of classes

    Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008

  29. [29]

    Cats and dogs

    Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012

  30. [30]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  31. [31]

    Robust clip: Unsupervised adversarial fine-tuning of vision embeddings for robust large vision-language models.arXiv preprint arXiv:2402.12336, 2024

    Christian Schlarmann, Naman Deep Singh, Francesco Croce, and Matthias Hein. Robust clip: Unsupervised adversarial fine-tuning of vision embeddings for robust large vision-language models.arXiv preprint arXiv:2402.12336, 2024

  32. [32]

    R-tpt: Improving adversarial robustness of vision-language models through test-time prompt tuning

    Lijun Sheng, Jian Liang, Zilei Wang, and Ran He. R-tpt: Improving adversarial robustness of vision-language models through test-time prompt tuning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29958–29967, 2025

  33. [33]

    Test-time prompt tuning for zero-shot generalization in vision-language models

    Manli Shu, Weili Nie, De-An Huang, Zhiding Yu, Tom Goldstein, Anima Anandkumar, and Chaowei Xiao. Test-time prompt tuning for zero-shot generalization in vision-language models. Advances in Neural Information Processing Systems, 35:14274–14289, 2022. 11

  34. [34]

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402, 2012

  35. [35]

    Pre-trained model guided fine-tuning for zero-shot adversarial robustness

    Sibo Wang, Jie Zhang, Zheng Yuan, and Shiguang Shan. Pre-trained model guided fine-tuning for zero-shot adversarial robustness. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24502–24511, 2024

  36. [36]

    Improving zero-shot generalization for clip with synthesized prompts.arXiv preprint arXiv:2307.07397, 2023

    Zhengbo Wang, Jian Liang, Ran He, Nan Xu, Zilei Wang, and Tieniu Tan. Improving zero-shot generalization for clip with synthesized prompts.arXiv preprint arXiv:2307.07397, 2023

  37. [37]

    Sun database: Large-scale scene recognition from abbey to zoo

    Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010

  38. [38]

    Clip is strong enough to fight back: Test-time counterattacks towards zero-shot adversarial robustness of clip

    Songlong Xing, Zhengyu Zhao, and Nicu Sebe. Clip is strong enough to fight back: Test-time counterattacks towards zero-shot adversarial robustness of clip. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15172–15182, 2025

  39. [39]

    Filip: Fine-grained interactive language-image pre-training.arXiv preprint arXiv:2111.07783, 2021

    Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. Filip: Fine-grained interactive language-image pre-training.arXiv preprint arXiv:2111.07783, 2021

  40. [40]

    C-tpt: Calibrated test-time prompt tuning for vision-language models via text feature dispersion.arXiv preprint arXiv:2403.14119, 2024

    Hee Suk Yoon, Eunseop Yoon, Joshua Tian Jin Tee, Mark Hasegawa-Johnson, Yingzhen Li, and Chang D Yoo. C-tpt: Calibrated test-time prompt tuning for vision-language models via text feature dispersion.arXiv preprint arXiv:2403.14119, 2024

  41. [41]

    Maxime Zanella and Ismail Ben Ayed. On the test-time zero-shot generalization of vision- language models: Do we really need prompt learning? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23783–23793, 2024

  42. [42]

    A simple framework for open-vocabulary segmentation and detection

    Hao Zhang, Feng Li, Xueyan Zou, Shilong Liu, Chunyuan Li, Jianwei Yang, and Lei Zhang. A simple framework for open-vocabulary segmentation and detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1020–1031, 2023

  43. [43]

    Adversarial prompt tuning for vision-language models

    Jiaming Zhang, Xingjun Ma, Xin Wang, Lingyu Qiu, Jiaqi Wang, Yu-Gang Jiang, and Jitao Sang. Adversarial prompt tuning for vision-language models. InEuropean conference on computer vision, pages 56–72. Springer, 2024

  44. [44]

    A multimodal biomedical foundation model trained from fifteen million image–text pairs.NEJM AI, 2(1):AIoa2400640, 2025

    Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. A multimodal biomedical foundation model trained from fifteen million image–text pairs.NEJM AI, 2(1):AIoa2400640, 2025

  45. [45]

    Extract free dense labels from clip

    Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. InEuropean conference on computer vision, pages 696–712. Springer, 2022

  46. [46]

    Conditional prompt learning for vision-language models

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16816–16825, 2022

  47. [47]

    Learning to prompt for vision-language models.International Journal of Computer Vision, 130(9):2337–2348, 2022

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.International Journal of Computer Vision, 130(9):2337–2348, 2022

  48. [48]

    Few-shot adversarial prompt learning on vision-language models.Advances in Neural Information Processing Systems, 37: 3122–3156, 2024

    Yiwei Zhou, Xiaobo Xia, Zhiwei Lin, Bo Han, and Tongliang Liu. Few-shot adversarial prompt learning on vision-language models.Advances in Neural Information Processing Systems, 37: 3122–3156, 2024. 12 A Inference Pipeline A.1 Closed-Loop Inference with Multi-View Aggregation Algorithm 1CLBP Inference with Closed Loop and Multi-View Aggregation Require: Da...