TAME: Test-Time Adversarial Prompt Tuning via Mixture-of-Experts for Vision-Language Models

Jiaming Zhang; Jiaqi Yu; Jingjing Chen; Kai Chen; Ruofan Wang; Xingjun Ma; Xin Wang; Yixu Wang; Yu-Gang Jiang

arxiv: 2605.17577 · v1 · pith:GUPUS3DTnew · submitted 2026-05-17 · 💻 cs.CV

TAME: Test-Time Adversarial Prompt Tuning via Mixture-of-Experts for Vision-Language Models

Xin Wang , Yixu Wang , Jiaming Zhang , Ruofan Wang , Jiaqi Yu , Kai Chen , Jingjing Chen , Xingjun Ma

show 1 more author

Yu-Gang Jiang

This is my paper

Pith reviewed 2026-05-20 14:16 UTC · model grok-4.3

classification 💻 cs.CV

keywords adversarial robustnesstest-time adaptationmixture of expertsprompt tuningvision-language modelszero-shot defenseunsupervised objectivesCLIP

0 comments

The pith

TAME replaces single prompts with an input-routed mixture of expert prompts tuned at test time via three unsupervised objectives to defend vision-language models against adversarial attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes TAME as a test-time method to improve the robustness of pre-trained vision-language models such as CLIP to imperceptible adversarial perturbations. It keeps a collection of learnable expert prompts and routes them according to each unlabeled test input to form a tailored defense prompt. Tuning relies only on three unsupervised signals applied to the test samples themselves, eliminating any requirement for labels or model retraining. A sympathetic reader would care because the approach targets real safety risks in open-world deployment while aiming to retain the original zero-shot performance on clean data. If the method works as described, models could be made substantially safer for practical use without the cost of task-specific fine-tuning.

Core claim

TAME reformulates test-time adversarial prompt tuning by replacing a single adaptive prompt with an input-conditioned Mixture-of-Experts framework that maintains a bank of learnable expert prompts and employs an input-dependent routing mechanism to aggregate a customized prompt mixture for each unlabeled test sample at inference time, driven by multi-view prediction entropy minimization, layer-wise alignment of visual token statistics to precomputed clean and adversarial reference distributions, and MoE regularization for balanced expert utilization and prompt diversity.

What carries the argument

Input-conditioned Mixture-of-Experts routing that aggregates a bank of expert prompts into a sample-specific defense prompt using three unsupervised objectives.

If this is right

TAME raises zero-shot adversarial robustness of the original CLIP by at least 49.1 percent under AutoAttack.
Clean-sample generalization is largely preserved across the evaluated datasets.
The approach outperforms prior adversarial prompt tuning methods by an average of at least 30.2 percent across multiple prompt designs.
Consistent gains appear on ImageNet and ten additional zero-shot datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The routing mechanism could allow experts to specialize on different input distributions or attack patterns encountered after deployment.
Similar test-time expert mixtures might transfer to other multimodal models that face robustness issues in open settings.
The absence of labels at tuning time points toward potential use in streaming or continually changing environments.

Load-bearing premise

The three unsupervised objectives suffice to produce effective customized prompts from unlabeled test samples without any task labels or supervision.

What would settle it

Running TAME on ImageNet or another benchmark under AutoAttack yields robustness improvement below 20 percent or a clear drop in clean-sample accuracy relative to the original CLIP model.

Figures

Figures reproduced from arXiv: 2605.17577 by Jiaming Zhang, Jiaqi Yu, Jingjing Chen, Kai Chen, Ruofan Wang, Xingjun Ma, Xin Wang, Yixu Wang, Yu-Gang Jiang.

**Figure 1.** Figure 1: Inference with different prompts. Top: inference with hand-crafted prompts fails to recognize the class ‘cat’; Bottom: Inference with test-time adversarial mixture-of-experts optimized for each image produces accurate recognitions. methods typically rely on task-specific training data and labels, which contradict the original zero-shot ability of pre-trained VLMs [24], [25]. In realistic deployment, howeve… view at source ↗

**Figure 2.** Figure 2: Illustration of adversarial prompt tuning designs. (a)–(d) show single-prompt designs, including Textual Prompts, Visual [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: An overview of our proposed TAME method. Given an adversarial image, TAME generates multiple augmented views [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Adversarial robustness (%) of our TAME method under different test-time robustness steps (i.e., [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 6.** Figure 6: Effect of prompt depth and prompt length on ImageNet. [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Effect of the number of experts on ImageNet. [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Effect of the alignment layer range. Each cell reports [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

read the original abstract

Large-scale pre-trained Vision-Language models (VLMs), such as CLIP, exhibit strong zero-shot generalization, yet remain highly vulnerable to imperceptible adversarial perturbations, raising serious safety concerns for open-world deployment. To enhance robustness without requiring downstream task-specific retraining, we propose TAME, a novel test-time defense. Building upon our prior Test-Time Adversarial Prompt Tuning (TAPT), TAME introduces an architectural reformulation by replacing TAPT's single adaptive prompt with an input-conditioned Mixture-of-Experts (MoE) framework, enabling more expressive and adaptive defense. Specifically, TAME maintains a bank of learnable expert prompts and employs an input-dependent routing mechanism to aggregate a customized prompt mixture for each unlabeled test sample at inference time. This test-time defense mechanism is driven by three unsupervised objectives: (1) multi-view prediction entropy minimization, (2) layer-wise alignment of visual token statistics to precomputed clean and adversarial reference distributions, and (3) MoE regularization for balanced expert utilization and prompt diversity. We evaluated TAME on 11 benchmark datasets, including ImageNet and 10 additional zero-shot datasets. The results show that TAME improves the zero-shot adversarial robustness of the original CLIP by at least 49.1% under AutoAttack while largely preserving generalization on clean samples. TAME also consistently outperforms existing adversarial prompt tuning methods across multiple prompt designs, yielding an average robustness gain of at least 30.2%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TAME swaps TAPT's single prompt for an MoE router tuned at test time with three unsupervised losses, and the reported robustness lift on CLIP is large but rests on untested assumptions about adaptive attacks.

read the letter

TAME replaces the single adaptive prompt from the authors' earlier TAPT work with a bank of expert prompts and an input-dependent router. For each unlabeled test sample it assembles a custom mixture and optimizes it on the fly using entropy minimization across views, layer-wise alignment of visual token statistics to clean and adversarial references, and a regularizer that encourages expert diversity. That architectural shift is the concrete change.

Referee Report

2 major / 2 minor

Summary. The paper introduces TAME, an extension of prior Test-Time Adversarial Prompt Tuning (TAPT) that replaces a single adaptive prompt with an input-conditioned Mixture-of-Experts (MoE) bank of learnable expert prompts. For each unlabeled test sample, an input-dependent router aggregates a customized prompt mixture. Adaptation is driven by three unsupervised objectives: multi-view prediction entropy minimization, layer-wise alignment of visual token statistics to clean and adversarial reference distributions, and MoE regularization for expert utilization and diversity. On 11 zero-shot datasets including ImageNet, TAME is reported to raise CLIP's adversarial robustness by at least 49.1% under AutoAttack while largely preserving clean accuracy and outperforming prior adversarial prompt-tuning baselines by an average of 30.2%.

Significance. If the reported robustness improvements are shown to survive adaptive attacks that target the test-time routing and optimization steps themselves, the work would offer a practical, training-free route to hardening zero-shot VLMs for open-world deployment. The MoE reformulation and the three unsupervised objectives constitute a concrete architectural and objective-level advance over single-prompt test-time tuning; the multi-dataset empirical evaluation is a strength.

major comments (2)

[Evaluation section] Evaluation section: the headline 49.1% robustness gain is measured by applying standard AutoAttack to the final tuned prompt. Because prompt selection, routing, and the three unsupervised objectives are executed at inference time on each unlabeled sample, a white-box adversary can in principle differentiate through the entire adaptation pipeline. The manuscript does not report results under such adaptive attacks, leaving the central robustness claim dependent on an untested assumption that the test-time process is not itself exploitable.
[Experiments] Experiments: no statistical significance, confidence intervals, or multiple random seeds are reported for the robustness gains across the 11 datasets. Without these, it is impossible to determine whether the observed improvements over baselines are reliable or could be explained by optimization variance in the test-time procedure.

minor comments (2)

[Method] The description of the three unsupervised objectives would benefit from explicit loss equations or pseudocode to clarify how the layer-wise alignment and MoE regularization terms are combined with the entropy minimization objective.
[Tables] Table captions and axis labels should explicitly state whether reported numbers are means over multiple runs or single-run results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and indicate the revisions we will make to strengthen the work.

read point-by-point responses

Referee: [Evaluation section] Evaluation section: the headline 49.1% robustness gain is measured by applying standard AutoAttack to the final tuned prompt. Because prompt selection, routing, and the three unsupervised objectives are executed at inference time on each unlabeled sample, a white-box adversary can in principle differentiate through the entire adaptation pipeline. The manuscript does not report results under such adaptive attacks, leaving the central robustness claim dependent on an untested assumption that the test-time process is not itself exploitable.

Authors: We agree that a fully adaptive white-box attack capable of differentiating through the input-dependent router, prompt aggregation, and the three unsupervised objectives would constitute a stronger evaluation. Our current results follow the standard AutoAttack protocol used in prior test-time and prompt-tuning robustness papers. To address this point directly, we will add a new set of experiments in the revised manuscript that evaluate TAME under adaptive attacks targeting the full test-time pipeline. We will also discuss the computational trade-offs involved in such attacks. revision: yes
Referee: [Experiments] Experiments: no statistical significance, confidence intervals, or multiple random seeds are reported for the robustness gains across the 11 datasets. Without these, it is impossible to determine whether the observed improvements over baselines are reliable or could be explained by optimization variance in the test-time procedure.

Authors: We acknowledge that reporting variability across multiple random seeds and providing confidence intervals would improve the reliability assessment of the reported gains. The original experiments used a fixed seed for reproducibility across the 11 datasets, but multiple independent runs were not performed or reported. In the revision we will rerun the main experiments with several random seeds, report mean performance with standard deviations, and include confidence intervals for the key robustness metrics. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external benchmarks rather than self-referential derivations.

full rationale

The paper introduces TAME as an architectural extension of the authors' prior TAPT method, using an MoE framework with three unsupervised objectives (entropy minimization, layer-wise token alignment, and regularization) for test-time prompt adaptation on unlabeled samples. Central performance claims, such as at least 49.1% robustness improvement under AutoAttack while preserving clean accuracy, are derived from direct empirical evaluations across 11 standard zero-shot datasets including ImageNet. No equations, predictions, or uniqueness arguments reduce these gains to quantities defined solely by internal fits, self-citations, or ansatzes; the results are measured against independent external attack methods and datasets, rendering the derivation chain self-contained.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard unsupervised learning assumptions for test-time adaptation and the effectiveness of the proposed objectives; no new physical entities are introduced.

free parameters (2)

Number of experts in MoE bank
Hyperparameter controlling the size of the learnable expert prompt bank.
Routing network parameters
Learned weights for the input-dependent expert selection mechanism.

axioms (1)

domain assumption Unsupervised objectives suffice to learn robust prompts from unlabeled test samples
Invoked when stating that the three objectives drive the defense without downstream labels.

pith-pipeline@v0.9.0 · 5821 in / 1251 out tokens · 42821 ms · 2026-05-20T14:16:50.779291+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TAME improves the zero-shot adversarial robustness of the original CLIP by at least 49.1% under AutoAttack while largely preserving generalization on clean samples.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · 6 internal anchors

[1]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inICML, 2021

work page 2021
[2]

Scaling up visual and vision-language representation learning with noisy text supervision,

C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. Le, Y .-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” inICML, 2021

work page 2021
[3]

Multi-event video-text retrieval,

G. Zhang, J. Ren, J. Gu, and V . Tresp, “Multi-event video-text retrieval,” inICCV, 2023

work page 2023
[4]

A visual–language foundation model for pathology image analysis using medical twitter,

Z. Huang, F. Bianchi, M. Yuksekgonul, T. J. Montine, and J. Zou, “A visual–language foundation model for pathology image analysis using medical twitter,”Nature Medicine, 2023

work page 2023
[5]

Medclip: Contrastive learning from unpaired medical images and text,

Z. Wang, Z. Wu, D. Agarwal, and J. Sun, “Medclip: Contrastive learning from unpaired medical images and text,” inEMNLP, 2022

work page 2022
[6]

Lossless medical image compression based on anatomical information and deep neural networks,

Q. Min, X. Wang, B. Huang, and Z. Zhou, “Lossless medical image compression based on anatomical information and deep neural networks,” Biomedical Signal Processing and Control, 2022

work page 2022
[7]

Web-based technology for remote viewing of radiological images: App validation,

Q. Min, X. Wang, B. Huang, and L. Xu, “Web-based technology for remote viewing of radiological images: App validation,”Journal of Medical Internet Research, 2020

work page 2020
[8]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausmanet al., “Do as i can, not as i say: Grounding language in robotic affordances,”preprint arXiv:2204.01691, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

Cliport: What and where pathways for robotic manipulation,

M. Shridhar, L. Manuelli, and D. Fox, “Cliport: What and where pathways for robotic manipulation,” inCoRL, 2022

work page 2022
[10]

Simple but effective: Clip embeddings for embodied ai,

A. Khandelwal, L. Weihs, R. Mottaghi, and A. Kembhavi, “Simple but effective: Clip embeddings for embodied ai,” inCVPR, 2022

work page 2022
[11]

Intriguing properties of neural networks,

C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” inICLR, 2013

work page 2013
[12]

Towards deep learning models resistant to adversarial attacks,

A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” inICLR, 2018

work page 2018
[13]

Boosting adversarial attacks with momentum,

Y . Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, and J. Li, “Boosting adversarial attacks with momentum,” inCVPR, 2018

work page 2018
[14]

Towards adversarial attack on vision- language pre-training models,

J. Zhang, Q. Yi, and J. Sang, “Towards adversarial attack on vision- language pre-training models,” inACM MM, 2022

work page 2022
[15]

On evaluating adversarial robustness of large vision-language models,

Y . Zhao, T. Pang, C. Du, X. Yang, C. Li, N.-M. M. Cheung, and M. Lin, “On evaluating adversarial robustness of large vision-language models,” inNeurIPS, 2024

work page 2024
[16]

arXiv preprint arXiv:2601.01592 , year=

X. Wang, Y . Chen, J. Li, Y . Wang, Y . Yao, T. Gu, J. Li, Y . Teng, Y . Wang, and X. Hu, “Openrt: An open-source red teaming framework for multimodal llms,”arXiv preprint arXiv:2601.01592, 2026

work page arXiv 2026
[17]

Freezevla: Action-freezing attacks against vision- language-action models,

X. Wang, J. Li, Z. Weng, Y . Wang, Y . Gao, T. Pang, C. Du, Y . Teng, Y . Wang, Z. Wuet al., “Freezevla: Action-freezing attacks against vision- language-action models,”arXiv preprint arXiv:2509.19870, 2025

work page arXiv 2025
[18]

Safety at scale: A comprehensive survey of large model and agent safety,

X. Ma, Y . Gao, Y . Wang, R. Wang, X. Wang, Y . Sun, Y . Ding, H. Xu, Y . Chen, Y . Zhaoet al., “Safety at scale: A comprehensive survey of large model and agent safety,”Foundations and Trends in Privacy and Security, 2026

work page 2026
[19]

Imperceptible jailbreaking against large language models,

K. Gao, Y . Li, C. Du, X. Wang, X. Ma, S.-T. Xia, and T. Pang, “Imperceptible jailbreaking against large language models,”arXiv preprint arXiv:2510.05025, 2025

work page arXiv 2025
[20]

Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs

Y . Chen, X. Wang, J. Li, Y . Wang, J. Li, Y . Teng, Y . Wang, and X. Ma, “Evolve the method, not the prompts: Evolutionary synthesis of jailbreak attacks on llms,”arXiv preprint arXiv:2511.12710, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Theoretically principled trade-off between robustness and accuracy,

H. Zhang, Y . Yu, J. Jiao, E. Xing, L. El Ghaoui, and M. Jordan, “Theoretically principled trade-off between robustness and accuracy,” inICML, 2019

work page 2019
[22]

Large-scale adversarial training for vision-and-language representation learning,

Z. Gan, Y .-C. Chen, L. Li, C. Zhu, Y . Cheng, and J. Liu, “Large-scale adversarial training for vision-and-language representation learning,” in NeurIPS, 2020

work page 2020
[23]

Revisiting adversarial training at scale,

Z. Wang, X. Li, H. Zhu, and C. Xie, “Revisiting adversarial training at scale,” inCVPR, 2024

work page 2024
[24]

Is robustness the cost of accuracy?–a comprehensive study on the robustness of 18 deep image classification models,

D. Su, H. Zhang, H. Chen, J. Yi, P.-Y . Chen, and Y . Gao, “Is robustness the cost of accuracy?–a comprehensive study on the robustness of 18 deep image classification models,” inECCV, 2018

work page 2018
[25]

On the relationship between generalization and robustness to adversarial examples,

A. Pedraza, O. Deniz, and G. Bueno, “On the relationship between generalization and robustness to adversarial examples,”Symmetry, 2021

work page 2021
[26]

Tapt: Test-time adversarial prompt tuning for robust inference in vision-language models,

X. Wang, K. Chen, J. Zhang, J. Chen, and X. Ma, “Tapt: Test-time adversarial prompt tuning for robust inference in vision-language models,” inCVPR, 2025

work page 2025
[27]

Adversarial prompt tuning for vision-language models,

J. Zhang, X. Ma, X. Wang, L. Qiu, J. Wang, Y .-G. Jiang, and J. Sang, “Adversarial prompt tuning for vision-language models,” inECCV, 2024

work page 2024
[28]

One prompt word is enough to boost adversarial robustness for pre-trained vision-language models,

L. Li, H. Guan, J. Qiu, and M. Spratling, “One prompt word is enough to boost adversarial robustness for pre-trained vision-language models,” inCVPR, 2024

work page 2024
[29]

Adversarial prompt distillation for vision-language models,

L. Luo, X. Wang, B. Zi, S. Zhao, X. Ma, and Y .-G. Jiang, “Adversarial prompt distillation for vision-language models,” inICASSP, 2026

work page 2026
[30]

Few-shot adversarial prompt learning on vision-language models,

Y . Zhou, X. Xia, Z. Lin, B. Han, and T. Liu, “Few-shot adversarial prompt learning on vision-language models,” inNeurIPS, 2024

work page 2024
[31]

Advclip: Downstream-agnostic adversarial examples in multimodal contrastive learning,

Z. Zhou, S. Hu, M. Li, H. Zhang, Y . Zhang, and H. Jin, “Advclip: Downstream-agnostic adversarial examples in multimodal contrastive learning,” inACM MM, 2023

work page 2023
[32]

Imbalanced gradients: a subtle cause of overestimated adversarial robustness,

X. Ma, L. Jiang, H. Huang, Z. Weng, J. Bailey, and Y .-G. Jiang, “Imbalanced gradients: a subtle cause of overestimated adversarial robustness,”Machine Learning, 2024

work page 2024
[33]

Improving transferability of adversarial examples with input diversity,

C. Xie, Z. Zhang, Y . Zhou, S. Bai, J. Wang, Z. Ren, and A. L. Yuille, “Improving transferability of adversarial examples with input diversity,” inCVPR, 2019

work page 2019
[34]

Set-level guidance attack: Boosting adversarial transferability of vision-language pre-training models,

D. Lu, Z. Wang, T. Wang, W. Guan, H. Gao, and F. Zheng, “Set-level guidance attack: Boosting adversarial transferability of vision-language pre-training models,” inICCV, 2023

work page 2023
[35]

Sa-attack: Improving adversarial transferability of vision-language pre-training models via self-augmentation,

B. He, X. Jia, S. Liang, T. Lou, Y . Liu, and X. Cao, “Sa-attack: Improving adversarial transferability of vision-language pre-training models via self- augmentation,”preprint arXiv:2312.04913, 2023

work page arXiv 2023
[36]

Transferable multimodal attack on vision-language pre-training models,

H. Wang, K. Dong, Z. Zhu, H. Qin, A. Liu, X. Fang, J. Wang, and X. Liu, “Transferable multimodal attack on vision-language pre-training models,” inIEEE S&P, 2024

work page 2024
[37]

Vlattack: Multimodal adversarial attacks on vision-language tasks via pre-trained models,

Z. Yin, M. Ye, T. Zhang, T. Du, J. Zhu, H. Liu, J. Chen, T. Wang, and F. Ma, “Vlattack: Multimodal adversarial attacks on vision-language tasks via pre-trained models,” inNeurIPS, 2023

work page 2023
[38]

One perturbation is enough: On generating universal adversarial pertur- bations against vision-language pre-training models,

H. Fang, J. Kong, W. Yu, B. Chen, J. Li, S. Xia, and K. Xu, “One perturbation is enough: On generating universal adversarial perturbations against vision-language pre-training models,”preprint arXiv:2406.05491, 2024

work page arXiv 2024
[39]

Universal adversarial perturbations for vision-language pre-trained models,

P.-F. Zhang, Z. Huang, and G. Bai, “Universal adversarial perturbations for vision-language pre-trained models,” inACM SIGIR, 2024, pp. 862– 871

work page 2024
[40]

Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks,

F. Croce and M. Hein, “Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks,” inICML, 2020

work page 2020
[41]

Ex- ploring transferability of multimodal adversarial samples for vision- language pre-training models with contrastive learning,

Y . Wang, W. Hu, Y . Dong, H. Zhang, H. Su, and R. Hong, “Exploring transferability of multimodal adversarial samples for vision-language pre- training models with contrastive learning,”preprint arXiv:2308.12636, 2023

work page arXiv 2023
[42]

Understanding zero-shot adversarial robustness for large-scale models,

C. Mao, S. Geng, J. Yang, X. Wang, and C. V ondrick, “Understanding zero-shot adversarial robustness for large-scale models,” inICLR, 2023

work page 2023
[43]

Robust CLIP: Unsupervised adversarial fine-tuning of vision embeddings for robust large vision-language models,

C. Schlarmann, N. D. Singh, F. Croce, and M. Hein, “Robust CLIP: Unsupervised adversarial fine-tuning of vision embeddings for robust large vision-language models,” inICML, 2024

work page 2024
[44]

Pre-trained model guided fine-tuning for zero-shot adversarial robustness,

S. Wang, J. Zhang, Z. Yuan, and S. Shan, “Pre-trained model guided fine-tuning for zero-shot adversarial robustness,” inCVPR, 2024

work page 2024
[45]

Revisiting the adversarial robustness of vision language models: a multimodal perspective,

W. Zhou, S. Bai, Q. Zhao, and B. Chen, “Revisiting the adversarial robustness of vision language models: a multimodal perspective,”preprint arXiv:2404.19287, 2024

work page arXiv 2024
[46]

AdvQDet: Detecting query-based adversarial attacks with adversarial contrastive prompt tuning,

X. Wang, K. Chen, X. Ma, Z. Chen, J. Chen, and Y .-G. Jiang, “AdvQDet: Detecting query-based adversarial attacks with adversarial contrastive prompt tuning,” inACM MM, 2024

work page 2024
[47]

Promptsmooth: Certifying robustness of medical vision-language models via prompt learning,

N. Hussein, F. Shamshad, M. Naseer, and K. Nandakumar, “Promptsmooth: Certifying robustness of medical vision-language models via prompt learning,” inMICCAI, 2024

work page 2024
[48]

Mixprompt: Enhancing generalizability and adversarial robustness for vision-language models via prompt fusion,

H. Fan, Z. Ma, Y . Li, R. Tian, Y . Chen, and C. Gao, “Mixprompt: Enhancing generalizability and adversarial robustness for vision-language models via prompt fusion,” inICIC, 2024

work page 2024
[49]

Prompt learning with optimal transport for vision-language models,

G. Chen, W. Yao, X. Song, X. Li, Y . Rao, and K. Zhang, “Plot: Prompt learning with optimal transport for vision-language models,”preprint arXiv:2210.01253, 2022

work page arXiv 2022
[50]

Lasp: Text-to-text optimization for language-aware soft prompting of vision & language models,

A. Bulat and G. Tzimiropoulos, “Lasp: Text-to-text optimization for language-aware soft prompting of vision & language models,” inCVPR, 2023

work page 2023
[51]

Mixture of prompts learning for vision- language models,

Y . Du, T. Niu, and R. Zhao, “Mixture of prompts learning for vision- language models,”Frontiers in Artificial Intelligence, 2025

work page 2025
[52]

One prompt is not enough: Automated construction of a mixture-of-expert prompts,

R. Wang, S. An, M. Cheng, T. Zhou, S. J. Hwang, and C.-J. Hsieh, “One prompt is not enough: Automated construction of a mixture-of-expert prompts,”preprint arXiv:2407.00256, 2024

work page arXiv 2024
[53]

Smop: Towards efficient and effective prompt tuning with sparse mixture-of-prompts,

J.-Y . Choi, J. Kim, J.-H. Park, W.-L. Mok, and S. Lee, “Smop: Towards efficient and effective prompt tuning with sparse mixture-of-prompts,” inEMNLP, 2023

work page 2023
[54]

Enhancing adversarial robustness of vision language models via adversarial mixture prompt tuning,

S. Zhao, Q. Zhu, S. Xiong, S. Ruan, Y . Fan, R. Duan, Q. Guo, and X. Wei, “Enhancing adversarial robustness of vision language models via adversarial mixture prompt tuning,”preprint arXiv:2505.17509, 2025

work page arXiv 2025
[55]

Improving robustness against common corruptions by covariate shift adaptation,

S. Schneider, E. Rusak, L. Eck, O. Bringmann, W. Brendel, and M. Bethge, “Improving robustness against common corruptions by covariate shift adaptation,”NeurIPS, 2020. 14

work page 2020
[56]

Evaluating prediction-time batch normalization for robustness under covariate shift

Z. Nado, S. Padhy, D. Sculley, A. D’Amour, B. Lakshminarayanan, and J. Snoek, “Evaluating prediction-time batch normalization for robustness under covariate shift,”preprint arXiv:2006.10963, 2020

work page arXiv 2006
[57]

Tent: Fully test-time adaptation by entropy minimization,

D. Wang, E. Shelhamer, S. Liu, B. Olshausen, and T. Darrell, “Tent: Fully test-time adaptation by entropy minimization,” inICLR, 2021

work page 2021
[58]

Memo: Test time robustness via adaptation and augmentation,

M. Zhang, S. Levine, and C. Finn, “Memo: Test time robustness via adaptation and augmentation,” inNeurIPS, 2022

work page 2022
[59]

Continual test-time domain adaptation,

Q. Wang, O. Fink, L. Van Gool, and D. Dai, “Continual test-time domain adaptation,” inCVPR, 2022

work page 2022
[60]

Efficient test-time model adaptation without forgetting,

S. Niu, J. Wu, Y . Zhang, Y . Chen, S. Zheng, P. Zhao, and M. Tan, “Efficient test-time model adaptation without forgetting,” inICML, 2022

work page 2022
[61]

Test-time prompt tuning for zero-shot generalization in vision-language models,

M. Shu, W. Nie, D.-A. Huang, Z. Yu, T. Goldstein, A. Anandkumar, and C. Xiao, “Test-time prompt tuning for zero-shot generalization in vision-language models,” inNeurIPS, 2022

work page 2022
[62]

Align your prompts: Test- time prompting with distribution alignment for zero-shot generalization,

J. Abdul Samadh, M. H. Gani, N. Hussein, M. U. Khattak, M. M. Naseer, F. Shahbaz Khan, and S. H. Khan, “Align your prompts: Test- time prompting with distribution alignment for zero-shot generalization,” inNeurIPS, 2024

work page 2024
[63]

On the test-time zero-shot generalization of vision-language models: Do we really need prompt learning?

M. Zanella and I. Ben Ayed, “On the test-time zero-shot generalization of vision-language models: Do we really need prompt learning?” in CVPR, 2024

work page 2024
[64]

R-tpt: Improving adversarial robustness of vision-language models through test-time prompt tuning,

L. Sheng, J. Liang, Z. Wang, and R. He, “R-tpt: Improving adversarial robustness of vision-language models through test-time prompt tuning,” inCVPR, 2025

work page 2025
[65]

Clip is strong enough to fight back: Test-time counterattacks towards zero-shot adversarial robustness of clip,

S. Xing, Z. Zhao, and N. Sebe, “Clip is strong enough to fight back: Test-time counterattacks towards zero-shot adversarial robustness of clip,” inCVPR, 2025

work page 2025
[66]

Visual prompting: Modifying pixel space to adapt pre-trained models,

H. Bahng, A. Jahanian, S. Sankaranarayanan, and P. Isola, “Exploring vi- sual prompts for adapting large-scale models,”preprint arXiv:2203.17274, 2022

work page arXiv 2022
[67]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,” inICLR, 2017

work page 2017
[68]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,

W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,”JMLR, 2022

work page 2022
[69]

Adaptive mixtures of local experts,

R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts,”Neural Computation, 1991

work page 1991
[70]

Imagenet large scale visual recognition challenge,

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernsteinet al., “Imagenet large scale visual recognition challenge,”IJCV, 2015

work page 2015
[71]

Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,

L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” inCVPR Workshops, 2004

work page 2004
[72]

Describing textures in the wild,

M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi, “Describing textures in the wild,” inCVPR, 2014

work page 2014
[73]

Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification,

P. Helber, B. Bischke, A. Dengel, and D. Borth, “Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification,” IEEE J-STARS, 2019

work page 2019
[74]

Cats and dogs,

O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. Jawahar, “Cats and dogs,” inCVPR, 2012

work page 2012
[75]

Fine-Grained Visual Classification of Aircraft

S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi, “Fine-grained visual classification of aircraft,”preprint arXiv:1306.5151, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[76]

Food-101–mining discriminative components with random forests,

L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101–mining discriminative components with random forests,” inECCV, 2014

work page 2014
[77]

Automated flower classification over a large number of classes,

M.-E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,” inICVGIP, 2008

work page 2008
[78]

3d object representations for fine-grained categorization,

J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3d object representations for fine-grained categorization,” inICCV Workshops, 2013

work page 2013
[79]

Sun database: Large-scale scene recognition from abbey to zoo,

J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun database: Large-scale scene recognition from abbey to zoo,” inCVPR, 2010

work page 2010
[80]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

K. Soomro, “Ucf101: A dataset of 101 human actions classes from videos in the wild,”preprint arXiv:1212.0402, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012

Showing first 80 references.

[1] [1]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inICML, 2021

work page 2021

[2] [2]

Scaling up visual and vision-language representation learning with noisy text supervision,

C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. Le, Y .-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” inICML, 2021

work page 2021

[3] [3]

Multi-event video-text retrieval,

G. Zhang, J. Ren, J. Gu, and V . Tresp, “Multi-event video-text retrieval,” inICCV, 2023

work page 2023

[4] [4]

A visual–language foundation model for pathology image analysis using medical twitter,

Z. Huang, F. Bianchi, M. Yuksekgonul, T. J. Montine, and J. Zou, “A visual–language foundation model for pathology image analysis using medical twitter,”Nature Medicine, 2023

work page 2023

[5] [5]

Medclip: Contrastive learning from unpaired medical images and text,

Z. Wang, Z. Wu, D. Agarwal, and J. Sun, “Medclip: Contrastive learning from unpaired medical images and text,” inEMNLP, 2022

work page 2022

[6] [6]

Lossless medical image compression based on anatomical information and deep neural networks,

Q. Min, X. Wang, B. Huang, and Z. Zhou, “Lossless medical image compression based on anatomical information and deep neural networks,” Biomedical Signal Processing and Control, 2022

work page 2022

[7] [7]

Web-based technology for remote viewing of radiological images: App validation,

Q. Min, X. Wang, B. Huang, and L. Xu, “Web-based technology for remote viewing of radiological images: App validation,”Journal of Medical Internet Research, 2020

work page 2020

[8] [8]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausmanet al., “Do as i can, not as i say: Grounding language in robotic affordances,”preprint arXiv:2204.01691, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[9] [9]

Cliport: What and where pathways for robotic manipulation,

M. Shridhar, L. Manuelli, and D. Fox, “Cliport: What and where pathways for robotic manipulation,” inCoRL, 2022

work page 2022

[10] [10]

Simple but effective: Clip embeddings for embodied ai,

A. Khandelwal, L. Weihs, R. Mottaghi, and A. Kembhavi, “Simple but effective: Clip embeddings for embodied ai,” inCVPR, 2022

work page 2022

[11] [11]

Intriguing properties of neural networks,

C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” inICLR, 2013

work page 2013

[12] [12]

Towards deep learning models resistant to adversarial attacks,

A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” inICLR, 2018

work page 2018

[13] [13]

Boosting adversarial attacks with momentum,

Y . Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, and J. Li, “Boosting adversarial attacks with momentum,” inCVPR, 2018

work page 2018

[14] [14]

Towards adversarial attack on vision- language pre-training models,

J. Zhang, Q. Yi, and J. Sang, “Towards adversarial attack on vision- language pre-training models,” inACM MM, 2022

work page 2022

[15] [15]

On evaluating adversarial robustness of large vision-language models,

Y . Zhao, T. Pang, C. Du, X. Yang, C. Li, N.-M. M. Cheung, and M. Lin, “On evaluating adversarial robustness of large vision-language models,” inNeurIPS, 2024

work page 2024

[16] [16]

arXiv preprint arXiv:2601.01592 , year=

X. Wang, Y . Chen, J. Li, Y . Wang, Y . Yao, T. Gu, J. Li, Y . Teng, Y . Wang, and X. Hu, “Openrt: An open-source red teaming framework for multimodal llms,”arXiv preprint arXiv:2601.01592, 2026

work page arXiv 2026

[17] [17]

Freezevla: Action-freezing attacks against vision- language-action models,

X. Wang, J. Li, Z. Weng, Y . Wang, Y . Gao, T. Pang, C. Du, Y . Teng, Y . Wang, Z. Wuet al., “Freezevla: Action-freezing attacks against vision- language-action models,”arXiv preprint arXiv:2509.19870, 2025

work page arXiv 2025

[18] [18]

Safety at scale: A comprehensive survey of large model and agent safety,

X. Ma, Y . Gao, Y . Wang, R. Wang, X. Wang, Y . Sun, Y . Ding, H. Xu, Y . Chen, Y . Zhaoet al., “Safety at scale: A comprehensive survey of large model and agent safety,”Foundations and Trends in Privacy and Security, 2026

work page 2026

[19] [19]

Imperceptible jailbreaking against large language models,

K. Gao, Y . Li, C. Du, X. Wang, X. Ma, S.-T. Xia, and T. Pang, “Imperceptible jailbreaking against large language models,”arXiv preprint arXiv:2510.05025, 2025

work page arXiv 2025

[20] [20]

Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs

Y . Chen, X. Wang, J. Li, Y . Wang, J. Li, Y . Teng, Y . Wang, and X. Ma, “Evolve the method, not the prompts: Evolutionary synthesis of jailbreak attacks on llms,”arXiv preprint arXiv:2511.12710, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Theoretically principled trade-off between robustness and accuracy,

H. Zhang, Y . Yu, J. Jiao, E. Xing, L. El Ghaoui, and M. Jordan, “Theoretically principled trade-off between robustness and accuracy,” inICML, 2019

work page 2019

[22] [22]

Large-scale adversarial training for vision-and-language representation learning,

Z. Gan, Y .-C. Chen, L. Li, C. Zhu, Y . Cheng, and J. Liu, “Large-scale adversarial training for vision-and-language representation learning,” in NeurIPS, 2020

work page 2020

[23] [23]

Revisiting adversarial training at scale,

Z. Wang, X. Li, H. Zhu, and C. Xie, “Revisiting adversarial training at scale,” inCVPR, 2024

work page 2024

[24] [24]

Is robustness the cost of accuracy?–a comprehensive study on the robustness of 18 deep image classification models,

D. Su, H. Zhang, H. Chen, J. Yi, P.-Y . Chen, and Y . Gao, “Is robustness the cost of accuracy?–a comprehensive study on the robustness of 18 deep image classification models,” inECCV, 2018

work page 2018

[25] [25]

On the relationship between generalization and robustness to adversarial examples,

A. Pedraza, O. Deniz, and G. Bueno, “On the relationship between generalization and robustness to adversarial examples,”Symmetry, 2021

work page 2021

[26] [26]

Tapt: Test-time adversarial prompt tuning for robust inference in vision-language models,

X. Wang, K. Chen, J. Zhang, J. Chen, and X. Ma, “Tapt: Test-time adversarial prompt tuning for robust inference in vision-language models,” inCVPR, 2025

work page 2025

[27] [27]

Adversarial prompt tuning for vision-language models,

J. Zhang, X. Ma, X. Wang, L. Qiu, J. Wang, Y .-G. Jiang, and J. Sang, “Adversarial prompt tuning for vision-language models,” inECCV, 2024

work page 2024

[28] [28]

One prompt word is enough to boost adversarial robustness for pre-trained vision-language models,

L. Li, H. Guan, J. Qiu, and M. Spratling, “One prompt word is enough to boost adversarial robustness for pre-trained vision-language models,” inCVPR, 2024

work page 2024

[29] [29]

Adversarial prompt distillation for vision-language models,

L. Luo, X. Wang, B. Zi, S. Zhao, X. Ma, and Y .-G. Jiang, “Adversarial prompt distillation for vision-language models,” inICASSP, 2026

work page 2026

[30] [30]

Few-shot adversarial prompt learning on vision-language models,

Y . Zhou, X. Xia, Z. Lin, B. Han, and T. Liu, “Few-shot adversarial prompt learning on vision-language models,” inNeurIPS, 2024

work page 2024

[31] [31]

Advclip: Downstream-agnostic adversarial examples in multimodal contrastive learning,

Z. Zhou, S. Hu, M. Li, H. Zhang, Y . Zhang, and H. Jin, “Advclip: Downstream-agnostic adversarial examples in multimodal contrastive learning,” inACM MM, 2023

work page 2023

[32] [32]

Imbalanced gradients: a subtle cause of overestimated adversarial robustness,

X. Ma, L. Jiang, H. Huang, Z. Weng, J. Bailey, and Y .-G. Jiang, “Imbalanced gradients: a subtle cause of overestimated adversarial robustness,”Machine Learning, 2024

work page 2024

[33] [33]

Improving transferability of adversarial examples with input diversity,

C. Xie, Z. Zhang, Y . Zhou, S. Bai, J. Wang, Z. Ren, and A. L. Yuille, “Improving transferability of adversarial examples with input diversity,” inCVPR, 2019

work page 2019

[34] [34]

Set-level guidance attack: Boosting adversarial transferability of vision-language pre-training models,

D. Lu, Z. Wang, T. Wang, W. Guan, H. Gao, and F. Zheng, “Set-level guidance attack: Boosting adversarial transferability of vision-language pre-training models,” inICCV, 2023

work page 2023

[35] [35]

Sa-attack: Improving adversarial transferability of vision-language pre-training models via self-augmentation,

B. He, X. Jia, S. Liang, T. Lou, Y . Liu, and X. Cao, “Sa-attack: Improving adversarial transferability of vision-language pre-training models via self- augmentation,”preprint arXiv:2312.04913, 2023

work page arXiv 2023

[36] [36]

Transferable multimodal attack on vision-language pre-training models,

H. Wang, K. Dong, Z. Zhu, H. Qin, A. Liu, X. Fang, J. Wang, and X. Liu, “Transferable multimodal attack on vision-language pre-training models,” inIEEE S&P, 2024

work page 2024

[37] [37]

Vlattack: Multimodal adversarial attacks on vision-language tasks via pre-trained models,

Z. Yin, M. Ye, T. Zhang, T. Du, J. Zhu, H. Liu, J. Chen, T. Wang, and F. Ma, “Vlattack: Multimodal adversarial attacks on vision-language tasks via pre-trained models,” inNeurIPS, 2023

work page 2023

[38] [38]

One perturbation is enough: On generating universal adversarial pertur- bations against vision-language pre-training models,

H. Fang, J. Kong, W. Yu, B. Chen, J. Li, S. Xia, and K. Xu, “One perturbation is enough: On generating universal adversarial perturbations against vision-language pre-training models,”preprint arXiv:2406.05491, 2024

work page arXiv 2024

[39] [39]

Universal adversarial perturbations for vision-language pre-trained models,

P.-F. Zhang, Z. Huang, and G. Bai, “Universal adversarial perturbations for vision-language pre-trained models,” inACM SIGIR, 2024, pp. 862– 871

work page 2024

[40] [40]

Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks,

F. Croce and M. Hein, “Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks,” inICML, 2020

work page 2020

[41] [41]

Ex- ploring transferability of multimodal adversarial samples for vision- language pre-training models with contrastive learning,

Y . Wang, W. Hu, Y . Dong, H. Zhang, H. Su, and R. Hong, “Exploring transferability of multimodal adversarial samples for vision-language pre- training models with contrastive learning,”preprint arXiv:2308.12636, 2023

work page arXiv 2023

[42] [42]

Understanding zero-shot adversarial robustness for large-scale models,

C. Mao, S. Geng, J. Yang, X. Wang, and C. V ondrick, “Understanding zero-shot adversarial robustness for large-scale models,” inICLR, 2023

work page 2023

[43] [43]

Robust CLIP: Unsupervised adversarial fine-tuning of vision embeddings for robust large vision-language models,

C. Schlarmann, N. D. Singh, F. Croce, and M. Hein, “Robust CLIP: Unsupervised adversarial fine-tuning of vision embeddings for robust large vision-language models,” inICML, 2024

work page 2024

[44] [44]

Pre-trained model guided fine-tuning for zero-shot adversarial robustness,

S. Wang, J. Zhang, Z. Yuan, and S. Shan, “Pre-trained model guided fine-tuning for zero-shot adversarial robustness,” inCVPR, 2024

work page 2024

[45] [45]

Revisiting the adversarial robustness of vision language models: a multimodal perspective,

W. Zhou, S. Bai, Q. Zhao, and B. Chen, “Revisiting the adversarial robustness of vision language models: a multimodal perspective,”preprint arXiv:2404.19287, 2024

work page arXiv 2024

[46] [46]

AdvQDet: Detecting query-based adversarial attacks with adversarial contrastive prompt tuning,

X. Wang, K. Chen, X. Ma, Z. Chen, J. Chen, and Y .-G. Jiang, “AdvQDet: Detecting query-based adversarial attacks with adversarial contrastive prompt tuning,” inACM MM, 2024

work page 2024

[47] [47]

Promptsmooth: Certifying robustness of medical vision-language models via prompt learning,

N. Hussein, F. Shamshad, M. Naseer, and K. Nandakumar, “Promptsmooth: Certifying robustness of medical vision-language models via prompt learning,” inMICCAI, 2024

work page 2024

[48] [48]

Mixprompt: Enhancing generalizability and adversarial robustness for vision-language models via prompt fusion,

H. Fan, Z. Ma, Y . Li, R. Tian, Y . Chen, and C. Gao, “Mixprompt: Enhancing generalizability and adversarial robustness for vision-language models via prompt fusion,” inICIC, 2024

work page 2024

[49] [49]

Prompt learning with optimal transport for vision-language models,

G. Chen, W. Yao, X. Song, X. Li, Y . Rao, and K. Zhang, “Plot: Prompt learning with optimal transport for vision-language models,”preprint arXiv:2210.01253, 2022

work page arXiv 2022

[50] [50]

Lasp: Text-to-text optimization for language-aware soft prompting of vision & language models,

A. Bulat and G. Tzimiropoulos, “Lasp: Text-to-text optimization for language-aware soft prompting of vision & language models,” inCVPR, 2023

work page 2023

[51] [51]

Mixture of prompts learning for vision- language models,

Y . Du, T. Niu, and R. Zhao, “Mixture of prompts learning for vision- language models,”Frontiers in Artificial Intelligence, 2025

work page 2025

[52] [52]

One prompt is not enough: Automated construction of a mixture-of-expert prompts,

R. Wang, S. An, M. Cheng, T. Zhou, S. J. Hwang, and C.-J. Hsieh, “One prompt is not enough: Automated construction of a mixture-of-expert prompts,”preprint arXiv:2407.00256, 2024

work page arXiv 2024

[53] [53]

Smop: Towards efficient and effective prompt tuning with sparse mixture-of-prompts,

J.-Y . Choi, J. Kim, J.-H. Park, W.-L. Mok, and S. Lee, “Smop: Towards efficient and effective prompt tuning with sparse mixture-of-prompts,” inEMNLP, 2023

work page 2023

[54] [54]

Enhancing adversarial robustness of vision language models via adversarial mixture prompt tuning,

S. Zhao, Q. Zhu, S. Xiong, S. Ruan, Y . Fan, R. Duan, Q. Guo, and X. Wei, “Enhancing adversarial robustness of vision language models via adversarial mixture prompt tuning,”preprint arXiv:2505.17509, 2025

work page arXiv 2025

[55] [55]

Improving robustness against common corruptions by covariate shift adaptation,

S. Schneider, E. Rusak, L. Eck, O. Bringmann, W. Brendel, and M. Bethge, “Improving robustness against common corruptions by covariate shift adaptation,”NeurIPS, 2020. 14

work page 2020

[56] [56]

Evaluating prediction-time batch normalization for robustness under covariate shift

Z. Nado, S. Padhy, D. Sculley, A. D’Amour, B. Lakshminarayanan, and J. Snoek, “Evaluating prediction-time batch normalization for robustness under covariate shift,”preprint arXiv:2006.10963, 2020

work page arXiv 2006

[57] [57]

Tent: Fully test-time adaptation by entropy minimization,

D. Wang, E. Shelhamer, S. Liu, B. Olshausen, and T. Darrell, “Tent: Fully test-time adaptation by entropy minimization,” inICLR, 2021

work page 2021

[58] [58]

Memo: Test time robustness via adaptation and augmentation,

M. Zhang, S. Levine, and C. Finn, “Memo: Test time robustness via adaptation and augmentation,” inNeurIPS, 2022

work page 2022

[59] [59]

Continual test-time domain adaptation,

Q. Wang, O. Fink, L. Van Gool, and D. Dai, “Continual test-time domain adaptation,” inCVPR, 2022

work page 2022

[60] [60]

Efficient test-time model adaptation without forgetting,

S. Niu, J. Wu, Y . Zhang, Y . Chen, S. Zheng, P. Zhao, and M. Tan, “Efficient test-time model adaptation without forgetting,” inICML, 2022

work page 2022

[61] [61]

Test-time prompt tuning for zero-shot generalization in vision-language models,

M. Shu, W. Nie, D.-A. Huang, Z. Yu, T. Goldstein, A. Anandkumar, and C. Xiao, “Test-time prompt tuning for zero-shot generalization in vision-language models,” inNeurIPS, 2022

work page 2022

[62] [62]

Align your prompts: Test- time prompting with distribution alignment for zero-shot generalization,

J. Abdul Samadh, M. H. Gani, N. Hussein, M. U. Khattak, M. M. Naseer, F. Shahbaz Khan, and S. H. Khan, “Align your prompts: Test- time prompting with distribution alignment for zero-shot generalization,” inNeurIPS, 2024

work page 2024

[63] [63]

On the test-time zero-shot generalization of vision-language models: Do we really need prompt learning?

M. Zanella and I. Ben Ayed, “On the test-time zero-shot generalization of vision-language models: Do we really need prompt learning?” in CVPR, 2024

work page 2024

[64] [64]

R-tpt: Improving adversarial robustness of vision-language models through test-time prompt tuning,

L. Sheng, J. Liang, Z. Wang, and R. He, “R-tpt: Improving adversarial robustness of vision-language models through test-time prompt tuning,” inCVPR, 2025

work page 2025

[65] [65]

Clip is strong enough to fight back: Test-time counterattacks towards zero-shot adversarial robustness of clip,

S. Xing, Z. Zhao, and N. Sebe, “Clip is strong enough to fight back: Test-time counterattacks towards zero-shot adversarial robustness of clip,” inCVPR, 2025

work page 2025

[66] [66]

Visual prompting: Modifying pixel space to adapt pre-trained models,

H. Bahng, A. Jahanian, S. Sankaranarayanan, and P. Isola, “Exploring vi- sual prompts for adapting large-scale models,”preprint arXiv:2203.17274, 2022

work page arXiv 2022

[67] [67]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,” inICLR, 2017

work page 2017

[68] [68]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,

W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,”JMLR, 2022

work page 2022

[69] [69]

Adaptive mixtures of local experts,

R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts,”Neural Computation, 1991

work page 1991

[70] [70]

Imagenet large scale visual recognition challenge,

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernsteinet al., “Imagenet large scale visual recognition challenge,”IJCV, 2015

work page 2015

[71] [71]

Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,

L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” inCVPR Workshops, 2004

work page 2004

[72] [72]

Describing textures in the wild,

M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi, “Describing textures in the wild,” inCVPR, 2014

work page 2014

[73] [73]

Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification,

P. Helber, B. Bischke, A. Dengel, and D. Borth, “Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification,” IEEE J-STARS, 2019

work page 2019

[74] [74]

Cats and dogs,

O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. Jawahar, “Cats and dogs,” inCVPR, 2012

work page 2012

[75] [75]

Fine-Grained Visual Classification of Aircraft

S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi, “Fine-grained visual classification of aircraft,”preprint arXiv:1306.5151, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[76] [76]

Food-101–mining discriminative components with random forests,

L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101–mining discriminative components with random forests,” inECCV, 2014

work page 2014

[77] [77]

Automated flower classification over a large number of classes,

M.-E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,” inICVGIP, 2008

work page 2008

[78] [78]

3d object representations for fine-grained categorization,

J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3d object representations for fine-grained categorization,” inICCV Workshops, 2013

work page 2013

[79] [79]

Sun database: Large-scale scene recognition from abbey to zoo,

J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun database: Large-scale scene recognition from abbey to zoo,” inCVPR, 2010

work page 2010

[80] [80]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

K. Soomro, “Ucf101: A dataset of 101 human actions classes from videos in the wild,”preprint arXiv:1212.0402, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012