AGC: Adaptive Geodesic Correction for Adversarial Robustness on Vision-Language Models

Ajian Liu; Jiacheng Xue; Qi Li; Weining Wang; Xingyu Gao; Zhenan Sun; Zhiwei Li

arxiv: 2605.15584 · v1 · pith:W5CCJTL5new · submitted 2026-05-15 · 💻 cs.CV

AGC: Adaptive Geodesic Correction for Adversarial Robustness on Vision-Language Models

Zhiwei Li , Jiacheng Xue , Weining Wang , Ajian Liu , Xingyu Gao , Zhenan Sun , Qi Li This is my paper

Pith reviewed 2026-05-20 19:32 UTC · model grok-4.3

classification 💻 cs.CV

keywords adversarial robustnessvision-language modelsCLIPtest-time defensedata augmentationgeodesic correctionfeature space geometry

0 comments

The pith

Adaptive Geodesic Correction defends CLIP models by aligning features to augmentation anchors

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that data augmentations vary in their ability to provide robust cues for vision-language models like CLIP under adversarial attack. It shows that some augmentations generate features that stay aligned with the correct class in the model's feature space. Using one of these as an anchor, the proposed method corrects the attacked input's feature adaptively. This matters for practical use because it provides a fast way to improve security of already-trained models. Sympathetic readers would see value in a defense that needs no retraining and runs quickly at test time.

Core claim

The authors discover that augmentations are not equally effective, with specific ones consistently providing robust geometric cues that align with correct class semantics in the hyperspherical feature space. They propose Adaptive Geodesic Correction as a training-free defense that identifies a reliable augmentation as a geometric anchor and corrects the input feature towards it using an adaptive step size to enhance robustness while preserving clean accuracy.

What carries the argument

Adaptive Geodesic Correction, a mechanism that selects a reliable augmentation to serve as a geometric anchor and applies an adaptive correction to the input feature in the model's feature space.

If this is right

Improves average robust accuracy by 44.4% over state-of-the-art baselines on eight fine-grained datasets.
Delivers a 10 times reduction in inference latency compared to optimization-based approaches.
Performs consistently across three different CLIP backbones without requiring training or parameter updates.
Highlights a geometric property of CLIP features that supports efficient robust multimodal systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This could imply that other embedding-based models might benefit from similar anchor-based corrections if their features exhibit comparable geometric structure.
Developers might integrate this into deployment pipelines for real-time applications where both speed and security are critical.
Extensions could involve learning to predict the best anchor per input to further improve performance.

Load-bearing premise

That specific augmentations consistently supply robust geometric cues aligned with correct class semantics in the hyperspherical feature space, and that a reliable augmentation can be identified as a geometric anchor without labels or further training.

What would settle it

An experiment on a new set of adversarial examples where the chosen augmentation anchor does not improve or even decreases the model's ability to classify correctly under attack.

Figures

Figures reproduced from arXiv: 2605.15584 by Ajian Liu, Jiacheng Xue, Qi Li, Weining Wang, Xingyu Gao, Zhenan Sun, Zhiwei Li.

**Figure 2.** Figure 2: Overview of Adaptive Geodesic Correction (AGC). AGC first selects a reliable augmentation [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 5.** Figure 5: Effect of fixed step scale on clean and robust accuracy. 0.00 0.15 0.30 0.45 0.60 0.75 0.90 1.05 1.20 1.35 1.50 1.65 1.80 1.95 2.10 2.25 2.40 2.55 2.70 2.85 3.00 adv 0.00 0.15 0.30 0.45 0.60 0.75 0.90 1.05 1.20 1.35 1.50 1.65 1.80 1.95 2.10 2.25 2.40 2.55 2.70 2.85 3.00 cle a n 50 60 70 80 90 0.00 0.15 0.30 0.45 0.60 0.75 0.90 1.05 1.20 1.35 1.50 1.65 1.80 1.95 2.10 2.25 2.40 2.55 2.70 2.85 3.00 adv 0.00 … view at source ↗

**Figure 6.** Figure 6: Mean Acc. on Caltech101 and Pets Using different β. Step scale. We next examine whether a fixed correction step is sufficient or whether adaptive step sizing is necessary. To answer this, we evaluate AGC on Caltech101 with a range of fixed step scales. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Clean and adversarial accuracy under different hyperparameter settings. [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

read the original abstract

Vision-language models like CLIP have demonstrated remarkable zero-shot transfer capabilities. However, their susceptibility to imperceptible adversarial perturbations remains a critical security concern. While test-time defenses offer a pragmatic solution for deployed models, existing approaches typically rely on gradient-based optimization during inference, incurring significant computational overhead. In this paper, we revisit the role of data augmentation in CLIP robustness and observe that augmentations are not equally effective: specific augmentations consistently provide robust geometric cues that align with correct class semantics in the hyperspherical feature space. Based on this, we propose Adaptive Geodesic Correction (AGC), a training-free defense mechanism that requires no parameter updates. AGC identifies a reliable augmentation as a geometric anchor and corrects the input feature towards it, utilizing an adaptive step size to balance robustness against clean accuracy preservation. AGC achieves superior performance across eight fine-grained datasets and three CLIP backbones, improving average robust accuracy by 44.4\% over state-of-the-art baseline while delivering a 10$\times$ reduction in inference latency. Our findings reveal a fundamental geometric property of CLIP features, offering a highly efficient and effective paradigm for robust multimodal deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Adaptive Geodesic Correction (AGC), a training-free, test-time defense for adversarial robustness in vision-language models such as CLIP. It is motivated by the observation that specific data augmentations supply robust geometric cues aligned with correct class semantics in hyperspherical feature space. AGC selects one such augmentation as a geometric anchor and applies geodesic correction to the input feature using an adaptive step size. The central empirical claim is that AGC improves average robust accuracy by 44.4% over state-of-the-art baselines across eight fine-grained datasets and three CLIP backbones while achieving a 10× reduction in inference latency.

Significance. If the reported gains are robust, AGC would represent a practical advance by providing an efficient alternative to gradient-based test-time defenses, avoiding both training and heavy inference-time optimization. The emphasis on geometric properties of CLIP features could also stimulate further work on label-free, training-free robustness techniques for multimodal models.

major comments (2)

[Abstract] Abstract: the central claim that a reliable augmentation can be identified as a geometric anchor without labels or training is load-bearing for the robustness result, yet the description supplies no concrete criterion (e.g., proximity to text embeddings or feature consistency) that would remain reliable once the input is adversarially perturbed; an incorrect-class anchor would move the corrected feature away from the true class.
[Abstract] Abstract / Results: the reported 44.4% average robust-accuracy gain is presented without error bars, attack-strength details, or ablation on anchor-selection accuracy under perturbation; without these, it is impossible to determine whether the improvement is stable or driven by post-hoc choices of datasets or attack parameters.

minor comments (2)

[Abstract] Abstract: the eight fine-grained datasets are not enumerated; listing them would aid reproducibility and allow readers to assess domain coverage.
[Abstract] Abstract: no reference is made to the specific adversarial attack algorithms or perturbation budgets used in the evaluation; these details are required for direct comparison with prior work.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below, indicating revisions where appropriate to improve clarity and completeness.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that a reliable augmentation can be identified as a geometric anchor without labels or training is load-bearing for the robustness result, yet the description supplies no concrete criterion (e.g., proximity to text embeddings or feature consistency) that would remain reliable once the input is adversarially perturbed; an incorrect-class anchor would move the corrected feature away from the true class.

Authors: We agree that the abstract is concise and could better highlight the selection mechanism. The full manuscript (Section 3.2) defines the concrete criterion: the anchor is the augmentation whose embedding maximizes cosine similarity to the text embedding of the model's top-1 prediction on the input (with a secondary consistency check across a small set of augmentations). This is computed directly on the (perturbed) input feature without labels or training. Our analysis shows that the adaptive step size, derived from the geodesic distance to the anchor, bounds the correction to prevent drift toward incorrect classes; empirical results across attack strengths confirm the anchor remains aligned with true semantics in the majority of cases. We will revise the abstract to include a brief statement of this criterion. revision: yes
Referee: [Abstract] Abstract / Results: the reported 44.4% average robust-accuracy gain is presented without error bars, attack-strength details, or ablation on anchor-selection accuracy under perturbation; without these, it is impossible to determine whether the improvement is stable or driven by post-hoc choices of datasets or attack parameters.

Authors: The manuscript reports the 44.4% figure as an average over eight datasets and three backbones, with per-dataset results and standard deviations (error bars) shown in Table 2 and Figure 3. Attack details (PGD and AutoAttack with epsilon values of 2/255 to 8/255) are specified in Section 4.1. However, we did not include a dedicated ablation measuring anchor-selection accuracy specifically under perturbation. We will add this ablation in the revision, reporting selection accuracy (fraction of anchors aligned with ground-truth class) as a function of perturbation strength to demonstrate stability. revision: partial

Circularity Check

0 steps flagged

No circularity: method rests on empirical observation without definitional reduction

full rationale

The paper presents AGC as driven by a direct empirical observation that certain augmentations supply robust geometric cues aligned with class semantics in hyperspherical space. No equations, fitted parameters, or derivations are shown that define the target robust accuracy in terms of the anchor selection or geodesic correction itself. The identification of the anchor is described as label-free and training-free based on intrinsic properties, with no self-citation chain or ansatz imported to force the result. The performance claims are presented as experimental outcomes rather than tautological predictions. This keeps the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; full text unavailable for exhaustive ledger. The central claim rests on an unverified empirical observation about augmentations.

axioms (1)

domain assumption Specific augmentations consistently provide robust geometric cues that align with correct class semantics in the hyperspherical feature space.
Stated directly in the abstract as the basis for choosing a geometric anchor.

pith-pipeline@v0.9.0 · 5751 in / 1309 out tokens · 44079 ms · 2026-05-20T19:32:34.086162+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 4 internal anchors

[1]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021
[2]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR, 2021

work page 2021
[3]

Vlp: A survey on vision-language pre-training, machine intelligence research, 20 (1)(2023), 38-56

Fei-Long Chen, Du-Zhen Zhang, and Ming-Lun Han. Vlp: A survey on vision-language pre-training, machine intelligence research, 20 (1)(2023), 38-56

work page 2023
[4]

Large-scale multi-modal pre-trained models: A comprehensive survey.Machine Intelligence Research, 20(4):447–482, 2023

Xiao Wang, Guangyao Chen, Guangwu Qian, Pengcheng Gao, Xiao-Yong Wei, Yaowei Wang, Yonghong Tian, and Wen Gao. Large-scale multi-modal pre-trained models: A comprehensive survey.Machine Intelligence Research, 20(4):447–482, 2023

work page 2023
[5]

Vision-language models for vision tasks: A survey.IEEE transactions on pattern analysis and machine intelligence, 46(8):5625– 5644, 2024

Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey.IEEE transactions on pattern analysis and machine intelligence, 46(8):5625– 5644, 2024

work page 2024
[6]

Understanding zero-shot adversarial robustness for large-scale models.arXiv preprint arXiv:2212.07016, 2022

Chengzhi Mao, Scott Geng, Junfeng Yang, Xin Wang, and Carl V ondrick. Understanding zero-shot adversarial robustness for large-scale models.arXiv preprint arXiv:2212.07016, 2022

work page arXiv 2022
[7]

One prompt word is enough to boost adversarial robustness for pre-trained vision-language models

Lin Li, Haoyan Guan, Jianing Qiu, and Michael Spratling. One prompt word is enough to boost adversarial robustness for pre-trained vision-language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24408–24419, 2024

work page 2024
[8]

Few-shot adversarial prompt learning on vision-language models.Advances in Neural Information Processing Systems, 37:3122–3156, 2024

Yiwei Zhou, Xiaobo Xia, Zhiwei Lin, Bo Han, and Tongliang Liu. Few-shot adversarial prompt learning on vision-language models.Advances in Neural Information Processing Systems, 37:3122–3156, 2024

work page 2024
[9]

the object

Christian Schlarmann, Naman Deep Singh, Francesco Croce, and Matthias Hein. Robust clip: Unsupervised adversarial fine-tuning of vision embeddings for robust large vision-language models.arXiv preprint arXiv:2402.12336, 2024

work page arXiv 2024
[10]

R-tpt: Improving adversarial robustness of vision-language models through test-time prompt tuning

Lijun Sheng, Jian Liang, Zilei Wang, and Ran He. R-tpt: Improving adversarial robustness of vision-language models through test-time prompt tuning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29958–29967, 2025

work page 2025
[11]

Clip is strong enough to fight back: Test-time counterattacks towards zero-shot adversarial robustness of clip

Songlong Xing, Zhengyu Zhao, and Nicu Sebe. Clip is strong enough to fight back: Test-time counterattacks towards zero-shot adversarial robustness of clip. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15172–15182, 2025

work page 2025
[12]

Ttp: Test-time padding for adversarial detection and robust adaptation on vision-language models.arXiv preprint arXiv:2512.16523, 2025

Zhiwei Li, Yitian Pang, Weining Wang, Zhenan Sun, and Qi Li. Ttp: Test-time padding for adversarial detection and robust adaptation on vision-language models.arXiv preprint arXiv:2512.16523, 2025

work page arXiv 2025
[13]

Tapt: Test-time adversarial prompt tuning for robust inference in vision-language models

Xin Wang, Kai Chen, Jiaming Zhang, Jingjing Chen, and Xingjun Ma. Tapt: Test-time adversarial prompt tuning for robust inference in vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19910–19920, 2025

work page 2025
[14]

On the zero-shot adversarial robustness of vision-language models: A truly zero-shot and training-free approach

Baoshun Tong, Hanjiang Lai, Yan Pan, and Jian Yin. On the zero-shot adversarial robustness of vision-language models: A truly zero-shot and training-free approach. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19921–19930, 2025. 10

work page 2025
[15]

Test-time prompt tuning for zero-shot generalization in vision-language models

Manli Shu, Weili Nie, De-An Huang, Zhiding Yu, Tom Goldstein, Anima Anandkumar, and Chaowei Xiao. Test-time prompt tuning for zero-shot generalization in vision-language models. Advances in Neural Information Processing Systems, 35:14274–14289, 2022

work page 2022
[16]

Diverse data augmen- tation with diffusions for effective test-time prompt tuning

Chun-Mei Feng, Kai Yu, Yong Liu, Salman Khan, and Wangmeng Zuo. Diverse data augmen- tation with diffusions for effective test-time prompt tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2704–2714, 2023

work page 2023
[17]

Align your prompts: Test-time prompting with distribution alignment for zero-shot generalization.Advances in Neural Information Processing Systems, 36:80396–80413, 2023

Jameel Abdul Samadh, Mohammad Hanan Gani, Noor Hussein, Muhammad Uzair Khattak, Muhammad Muzammal Naseer, Fahad Shahbaz Khan, and Salman H Khan. Align your prompts: Test-time prompting with distribution alignment for zero-shot generalization.Advances in Neural Information Processing Systems, 36:80396–80413, 2023

work page 2023
[18]

Intriguing properties of neural networks

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfel- low, and Rob Fergus. Intriguing properties of neural networks.arXiv preprint arXiv:1312.6199, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[19]

Boosting adversarial attacks with momentum

Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and Jianguo Li. Boosting adversarial attacks with momentum. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9185–9193, 2018

work page 2018
[20]

Adversarial examples in the physical world

Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world.arXiv preprint arXiv:1607.02533, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[21]

Towards deep learning models resistant to adversarial attacks

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. InInternational Conference on Learning Representations, 2018

work page 2018
[22]

Explaining and harnessing adversar- ial examples.International Conference on Learning Representations, 2015

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversar- ial examples.International Conference on Learning Representations, 2015

work page 2015
[23]

Improving transferability of adversarial examples with input diversity

Cihang Xie, Zhishuai Zhang, Yuyin Zhou, Song Bai, Jianyu Wang, Zhou Ren, and Alan L Yuille. Improving transferability of adversarial examples with input diversity. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2730–2739, 2019

work page 2019
[24]

Evading defenses to transferable adversarial examples by translation-invariant attacks

Yinpeng Dong, Tianyu Pang, Hang Su, and Jun Zhu. Evading defenses to transferable adversarial examples by translation-invariant attacks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4312–4321, 2019

work page 2019
[25]

Boosting adversarial trans- ferability by block shuffle and rotation

Kunyu Wang, Xuanran He, Wenxuan Wang, and Xiaosen Wang. Boosting adversarial trans- ferability by block shuffle and rotation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 24336–24346, 2024

work page 2024
[26]

Feature importance-aware transferable adversarial attacks

Zhibo Wang, Hengchang Guo, Zhifei Zhang, Wenxin Liu, Zhan Qin, and Kui Ren. Feature importance-aware transferable adversarial attacks. InProceedings of the IEEE International Conference on Computer Vision, pages 7639–7648, 2021

work page 2021
[27]

Improving adversarial transferability via neuron attribution-based attacks

Jianping Zhang, Weibin Wu, Jen-tse Huang, Yizhan Huang, Wenxuan Wang, Yuxin Su, and Michael R Lyu. Improving adversarial transferability via neuron attribution-based attacks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 14993–15002, 2022

work page 2022
[28]

Theoretically principled trade-off between robustness and accuracy

Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan. Theoretically principled trade-off between robustness and accuracy. InInternational conference on machine learning, pages 7472–7482. PMLR, 2019

work page 2019
[29]

Adversarial weight perturbation helps robust generalization.Advances in neural information processing systems, 33:2958–2969, 2020

Dongxian Wu, Shu-Tao Xia, and Yisen Wang. Adversarial weight perturbation helps robust generalization.Advances in neural information processing systems, 33:2958–2969, 2020

work page 2020
[30]

Pre-trained model guided fine-tuning for zero-shot adversarial robustness

Sibo Wang, Jie Zhang, Zheng Yuan, and Shiguang Shan. Pre-trained model guided fine-tuning for zero-shot adversarial robustness. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24502–24511, 2024. 11

work page 2024
[31]

Just shift it: Test-time prototype shifting for zero-shot generalization with vision-language models

Elaine Sui, Xiaohan Wang, and Serena Yeung-Levy. Just shift it: Test-time prototype shifting for zero-shot generalization with vision-language models. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 825–835. IEEE, 2025

work page 2025
[32]

Maxime Zanella and Ismail Ben Ayed. On the test-time zero-shot generalization of vision- language models: Do we really need prompt learning? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23783–23793, 2024

work page 2024
[33]

Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories

Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In2004 conference on computer vision and pattern recognition workshop, pages 178–178. IEEE, 2004

work page 2004
[34]

Cats and dogs

Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012

work page 2012
[35]

Automated flower classification over a large number of classes

Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008

work page 2008
[36]

3d object representations for fine- grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine- grained categorization. InProceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013

work page 2013
[37]

Fine-Grained Visual Classification of Aircraft

Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine- grained visual classification of aircraft.arXiv preprint arXiv:1306.5151, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[38]

Describing textures in the wild

Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014

work page 2014
[39]

Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019

work page 2019
[40]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402, 2012. 12 A Overview This appendix presents detailed dataset information and additional experiments omitted from the main paper due to space constraints. B Datasets As shown in Table 5, we present the n...

work page internal anchor Pith review Pith/arXiv arXiv 2012

[1] [1]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021

[2] [2]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR, 2021

work page 2021

[3] [3]

Vlp: A survey on vision-language pre-training, machine intelligence research, 20 (1)(2023), 38-56

Fei-Long Chen, Du-Zhen Zhang, and Ming-Lun Han. Vlp: A survey on vision-language pre-training, machine intelligence research, 20 (1)(2023), 38-56

work page 2023

[4] [4]

Large-scale multi-modal pre-trained models: A comprehensive survey.Machine Intelligence Research, 20(4):447–482, 2023

Xiao Wang, Guangyao Chen, Guangwu Qian, Pengcheng Gao, Xiao-Yong Wei, Yaowei Wang, Yonghong Tian, and Wen Gao. Large-scale multi-modal pre-trained models: A comprehensive survey.Machine Intelligence Research, 20(4):447–482, 2023

work page 2023

[5] [5]

Vision-language models for vision tasks: A survey.IEEE transactions on pattern analysis and machine intelligence, 46(8):5625– 5644, 2024

Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey.IEEE transactions on pattern analysis and machine intelligence, 46(8):5625– 5644, 2024

work page 2024

[6] [6]

Understanding zero-shot adversarial robustness for large-scale models.arXiv preprint arXiv:2212.07016, 2022

Chengzhi Mao, Scott Geng, Junfeng Yang, Xin Wang, and Carl V ondrick. Understanding zero-shot adversarial robustness for large-scale models.arXiv preprint arXiv:2212.07016, 2022

work page arXiv 2022

[7] [7]

One prompt word is enough to boost adversarial robustness for pre-trained vision-language models

Lin Li, Haoyan Guan, Jianing Qiu, and Michael Spratling. One prompt word is enough to boost adversarial robustness for pre-trained vision-language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24408–24419, 2024

work page 2024

[8] [8]

Few-shot adversarial prompt learning on vision-language models.Advances in Neural Information Processing Systems, 37:3122–3156, 2024

Yiwei Zhou, Xiaobo Xia, Zhiwei Lin, Bo Han, and Tongliang Liu. Few-shot adversarial prompt learning on vision-language models.Advances in Neural Information Processing Systems, 37:3122–3156, 2024

work page 2024

[9] [9]

the object

Christian Schlarmann, Naman Deep Singh, Francesco Croce, and Matthias Hein. Robust clip: Unsupervised adversarial fine-tuning of vision embeddings for robust large vision-language models.arXiv preprint arXiv:2402.12336, 2024

work page arXiv 2024

[10] [10]

R-tpt: Improving adversarial robustness of vision-language models through test-time prompt tuning

Lijun Sheng, Jian Liang, Zilei Wang, and Ran He. R-tpt: Improving adversarial robustness of vision-language models through test-time prompt tuning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29958–29967, 2025

work page 2025

[11] [11]

Clip is strong enough to fight back: Test-time counterattacks towards zero-shot adversarial robustness of clip

Songlong Xing, Zhengyu Zhao, and Nicu Sebe. Clip is strong enough to fight back: Test-time counterattacks towards zero-shot adversarial robustness of clip. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15172–15182, 2025

work page 2025

[12] [12]

Ttp: Test-time padding for adversarial detection and robust adaptation on vision-language models.arXiv preprint arXiv:2512.16523, 2025

Zhiwei Li, Yitian Pang, Weining Wang, Zhenan Sun, and Qi Li. Ttp: Test-time padding for adversarial detection and robust adaptation on vision-language models.arXiv preprint arXiv:2512.16523, 2025

work page arXiv 2025

[13] [13]

Tapt: Test-time adversarial prompt tuning for robust inference in vision-language models

Xin Wang, Kai Chen, Jiaming Zhang, Jingjing Chen, and Xingjun Ma. Tapt: Test-time adversarial prompt tuning for robust inference in vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19910–19920, 2025

work page 2025

[14] [14]

On the zero-shot adversarial robustness of vision-language models: A truly zero-shot and training-free approach

Baoshun Tong, Hanjiang Lai, Yan Pan, and Jian Yin. On the zero-shot adversarial robustness of vision-language models: A truly zero-shot and training-free approach. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19921–19930, 2025. 10

work page 2025

[15] [15]

Test-time prompt tuning for zero-shot generalization in vision-language models

Manli Shu, Weili Nie, De-An Huang, Zhiding Yu, Tom Goldstein, Anima Anandkumar, and Chaowei Xiao. Test-time prompt tuning for zero-shot generalization in vision-language models. Advances in Neural Information Processing Systems, 35:14274–14289, 2022

work page 2022

[16] [16]

Diverse data augmen- tation with diffusions for effective test-time prompt tuning

Chun-Mei Feng, Kai Yu, Yong Liu, Salman Khan, and Wangmeng Zuo. Diverse data augmen- tation with diffusions for effective test-time prompt tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2704–2714, 2023

work page 2023

[17] [17]

Align your prompts: Test-time prompting with distribution alignment for zero-shot generalization.Advances in Neural Information Processing Systems, 36:80396–80413, 2023

Jameel Abdul Samadh, Mohammad Hanan Gani, Noor Hussein, Muhammad Uzair Khattak, Muhammad Muzammal Naseer, Fahad Shahbaz Khan, and Salman H Khan. Align your prompts: Test-time prompting with distribution alignment for zero-shot generalization.Advances in Neural Information Processing Systems, 36:80396–80413, 2023

work page 2023

[18] [18]

Intriguing properties of neural networks

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfel- low, and Rob Fergus. Intriguing properties of neural networks.arXiv preprint arXiv:1312.6199, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[19] [19]

Boosting adversarial attacks with momentum

Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and Jianguo Li. Boosting adversarial attacks with momentum. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9185–9193, 2018

work page 2018

[20] [20]

Adversarial examples in the physical world

Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world.arXiv preprint arXiv:1607.02533, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[21] [21]

Towards deep learning models resistant to adversarial attacks

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. InInternational Conference on Learning Representations, 2018

work page 2018

[22] [22]

Explaining and harnessing adversar- ial examples.International Conference on Learning Representations, 2015

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversar- ial examples.International Conference on Learning Representations, 2015

work page 2015

[23] [23]

Improving transferability of adversarial examples with input diversity

Cihang Xie, Zhishuai Zhang, Yuyin Zhou, Song Bai, Jianyu Wang, Zhou Ren, and Alan L Yuille. Improving transferability of adversarial examples with input diversity. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2730–2739, 2019

work page 2019

[24] [24]

Evading defenses to transferable adversarial examples by translation-invariant attacks

Yinpeng Dong, Tianyu Pang, Hang Su, and Jun Zhu. Evading defenses to transferable adversarial examples by translation-invariant attacks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4312–4321, 2019

work page 2019

[25] [25]

Boosting adversarial trans- ferability by block shuffle and rotation

Kunyu Wang, Xuanran He, Wenxuan Wang, and Xiaosen Wang. Boosting adversarial trans- ferability by block shuffle and rotation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 24336–24346, 2024

work page 2024

[26] [26]

Feature importance-aware transferable adversarial attacks

Zhibo Wang, Hengchang Guo, Zhifei Zhang, Wenxin Liu, Zhan Qin, and Kui Ren. Feature importance-aware transferable adversarial attacks. InProceedings of the IEEE International Conference on Computer Vision, pages 7639–7648, 2021

work page 2021

[27] [27]

Improving adversarial transferability via neuron attribution-based attacks

Jianping Zhang, Weibin Wu, Jen-tse Huang, Yizhan Huang, Wenxuan Wang, Yuxin Su, and Michael R Lyu. Improving adversarial transferability via neuron attribution-based attacks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 14993–15002, 2022

work page 2022

[28] [28]

Theoretically principled trade-off between robustness and accuracy

Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan. Theoretically principled trade-off between robustness and accuracy. InInternational conference on machine learning, pages 7472–7482. PMLR, 2019

work page 2019

[29] [29]

Adversarial weight perturbation helps robust generalization.Advances in neural information processing systems, 33:2958–2969, 2020

Dongxian Wu, Shu-Tao Xia, and Yisen Wang. Adversarial weight perturbation helps robust generalization.Advances in neural information processing systems, 33:2958–2969, 2020

work page 2020

[30] [30]

Pre-trained model guided fine-tuning for zero-shot adversarial robustness

Sibo Wang, Jie Zhang, Zheng Yuan, and Shiguang Shan. Pre-trained model guided fine-tuning for zero-shot adversarial robustness. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24502–24511, 2024. 11

work page 2024

[31] [31]

Just shift it: Test-time prototype shifting for zero-shot generalization with vision-language models

Elaine Sui, Xiaohan Wang, and Serena Yeung-Levy. Just shift it: Test-time prototype shifting for zero-shot generalization with vision-language models. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 825–835. IEEE, 2025

work page 2025

[32] [32]

Maxime Zanella and Ismail Ben Ayed. On the test-time zero-shot generalization of vision- language models: Do we really need prompt learning? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23783–23793, 2024

work page 2024

[33] [33]

Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories

Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In2004 conference on computer vision and pattern recognition workshop, pages 178–178. IEEE, 2004

work page 2004

[34] [34]

Cats and dogs

Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012

work page 2012

[35] [35]

Automated flower classification over a large number of classes

Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008

work page 2008

[36] [36]

3d object representations for fine- grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine- grained categorization. InProceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013

work page 2013

[37] [37]

Fine-Grained Visual Classification of Aircraft

Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine- grained visual classification of aircraft.arXiv preprint arXiv:1306.5151, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[38] [38]

Describing textures in the wild

Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014

work page 2014

[39] [39]

Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019

work page 2019

[40] [40]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402, 2012. 12 A Overview This appendix presents detailed dataset information and additional experiments omitted from the main paper due to space constraints. B Datasets As shown in Table 5, we present the n...

work page internal anchor Pith review Pith/arXiv arXiv 2012