TAME: Test-Time Adversarial Prompt Tuning via Mixture-of-Experts for Vision-Language Models
Pith reviewed 2026-05-20 14:16 UTC · model grok-4.3
The pith
TAME replaces single prompts with an input-routed mixture of expert prompts tuned at test time via three unsupervised objectives to defend vision-language models against adversarial attacks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TAME reformulates test-time adversarial prompt tuning by replacing a single adaptive prompt with an input-conditioned Mixture-of-Experts framework that maintains a bank of learnable expert prompts and employs an input-dependent routing mechanism to aggregate a customized prompt mixture for each unlabeled test sample at inference time, driven by multi-view prediction entropy minimization, layer-wise alignment of visual token statistics to precomputed clean and adversarial reference distributions, and MoE regularization for balanced expert utilization and prompt diversity.
What carries the argument
Input-conditioned Mixture-of-Experts routing that aggregates a bank of expert prompts into a sample-specific defense prompt using three unsupervised objectives.
If this is right
- TAME raises zero-shot adversarial robustness of the original CLIP by at least 49.1 percent under AutoAttack.
- Clean-sample generalization is largely preserved across the evaluated datasets.
- The approach outperforms prior adversarial prompt tuning methods by an average of at least 30.2 percent across multiple prompt designs.
- Consistent gains appear on ImageNet and ten additional zero-shot datasets.
Where Pith is reading between the lines
- The routing mechanism could allow experts to specialize on different input distributions or attack patterns encountered after deployment.
- Similar test-time expert mixtures might transfer to other multimodal models that face robustness issues in open settings.
- The absence of labels at tuning time points toward potential use in streaming or continually changing environments.
Load-bearing premise
The three unsupervised objectives suffice to produce effective customized prompts from unlabeled test samples without any task labels or supervision.
What would settle it
Running TAME on ImageNet or another benchmark under AutoAttack yields robustness improvement below 20 percent or a clear drop in clean-sample accuracy relative to the original CLIP model.
Figures
read the original abstract
Large-scale pre-trained Vision-Language models (VLMs), such as CLIP, exhibit strong zero-shot generalization, yet remain highly vulnerable to imperceptible adversarial perturbations, raising serious safety concerns for open-world deployment. To enhance robustness without requiring downstream task-specific retraining, we propose TAME, a novel test-time defense. Building upon our prior Test-Time Adversarial Prompt Tuning (TAPT), TAME introduces an architectural reformulation by replacing TAPT's single adaptive prompt with an input-conditioned Mixture-of-Experts (MoE) framework, enabling more expressive and adaptive defense. Specifically, TAME maintains a bank of learnable expert prompts and employs an input-dependent routing mechanism to aggregate a customized prompt mixture for each unlabeled test sample at inference time. This test-time defense mechanism is driven by three unsupervised objectives: (1) multi-view prediction entropy minimization, (2) layer-wise alignment of visual token statistics to precomputed clean and adversarial reference distributions, and (3) MoE regularization for balanced expert utilization and prompt diversity. We evaluated TAME on 11 benchmark datasets, including ImageNet and 10 additional zero-shot datasets. The results show that TAME improves the zero-shot adversarial robustness of the original CLIP by at least 49.1% under AutoAttack while largely preserving generalization on clean samples. TAME also consistently outperforms existing adversarial prompt tuning methods across multiple prompt designs, yielding an average robustness gain of at least 30.2%.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces TAME, an extension of prior Test-Time Adversarial Prompt Tuning (TAPT) that replaces a single adaptive prompt with an input-conditioned Mixture-of-Experts (MoE) bank of learnable expert prompts. For each unlabeled test sample, an input-dependent router aggregates a customized prompt mixture. Adaptation is driven by three unsupervised objectives: multi-view prediction entropy minimization, layer-wise alignment of visual token statistics to clean and adversarial reference distributions, and MoE regularization for expert utilization and diversity. On 11 zero-shot datasets including ImageNet, TAME is reported to raise CLIP's adversarial robustness by at least 49.1% under AutoAttack while largely preserving clean accuracy and outperforming prior adversarial prompt-tuning baselines by an average of 30.2%.
Significance. If the reported robustness improvements are shown to survive adaptive attacks that target the test-time routing and optimization steps themselves, the work would offer a practical, training-free route to hardening zero-shot VLMs for open-world deployment. The MoE reformulation and the three unsupervised objectives constitute a concrete architectural and objective-level advance over single-prompt test-time tuning; the multi-dataset empirical evaluation is a strength.
major comments (2)
- [Evaluation section] Evaluation section: the headline 49.1% robustness gain is measured by applying standard AutoAttack to the final tuned prompt. Because prompt selection, routing, and the three unsupervised objectives are executed at inference time on each unlabeled sample, a white-box adversary can in principle differentiate through the entire adaptation pipeline. The manuscript does not report results under such adaptive attacks, leaving the central robustness claim dependent on an untested assumption that the test-time process is not itself exploitable.
- [Experiments] Experiments: no statistical significance, confidence intervals, or multiple random seeds are reported for the robustness gains across the 11 datasets. Without these, it is impossible to determine whether the observed improvements over baselines are reliable or could be explained by optimization variance in the test-time procedure.
minor comments (2)
- [Method] The description of the three unsupervised objectives would benefit from explicit loss equations or pseudocode to clarify how the layer-wise alignment and MoE regularization terms are combined with the entropy minimization objective.
- [Tables] Table captions and axis labels should explicitly state whether reported numbers are means over multiple runs or single-run results.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and indicate the revisions we will make to strengthen the work.
read point-by-point responses
-
Referee: [Evaluation section] Evaluation section: the headline 49.1% robustness gain is measured by applying standard AutoAttack to the final tuned prompt. Because prompt selection, routing, and the three unsupervised objectives are executed at inference time on each unlabeled sample, a white-box adversary can in principle differentiate through the entire adaptation pipeline. The manuscript does not report results under such adaptive attacks, leaving the central robustness claim dependent on an untested assumption that the test-time process is not itself exploitable.
Authors: We agree that a fully adaptive white-box attack capable of differentiating through the input-dependent router, prompt aggregation, and the three unsupervised objectives would constitute a stronger evaluation. Our current results follow the standard AutoAttack protocol used in prior test-time and prompt-tuning robustness papers. To address this point directly, we will add a new set of experiments in the revised manuscript that evaluate TAME under adaptive attacks targeting the full test-time pipeline. We will also discuss the computational trade-offs involved in such attacks. revision: yes
-
Referee: [Experiments] Experiments: no statistical significance, confidence intervals, or multiple random seeds are reported for the robustness gains across the 11 datasets. Without these, it is impossible to determine whether the observed improvements over baselines are reliable or could be explained by optimization variance in the test-time procedure.
Authors: We acknowledge that reporting variability across multiple random seeds and providing confidence intervals would improve the reliability assessment of the reported gains. The original experiments used a fixed seed for reproducibility across the 11 datasets, but multiple independent runs were not performed or reported. In the revision we will rerun the main experiments with several random seeds, report mean performance with standard deviations, and include confidence intervals for the key robustness metrics. revision: yes
Circularity Check
No significant circularity; empirical claims rest on external benchmarks rather than self-referential derivations.
full rationale
The paper introduces TAME as an architectural extension of the authors' prior TAPT method, using an MoE framework with three unsupervised objectives (entropy minimization, layer-wise token alignment, and regularization) for test-time prompt adaptation on unlabeled samples. Central performance claims, such as at least 49.1% robustness improvement under AutoAttack while preserving clean accuracy, are derived from direct empirical evaluations across 11 standard zero-shot datasets including ImageNet. No equations, predictions, or uniqueness arguments reduce these gains to quantities defined solely by internal fits, self-citations, or ansatzes; the results are measured against independent external attack methods and datasets, rendering the derivation chain self-contained.
Axiom & Free-Parameter Ledger
free parameters (2)
- Number of experts in MoE bank
- Routing network parameters
axioms (1)
- domain assumption Unsupervised objectives suffice to learn robust prompts from unlabeled test samples
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
TAME improves the zero-shot adversarial robustness of the original CLIP by at least 49.1% under AutoAttack while largely preserving generalization on clean samples.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inICML, 2021
work page 2021
-
[2]
Scaling up visual and vision-language representation learning with noisy text supervision,
C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. Le, Y .-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” inICML, 2021
work page 2021
-
[3]
Multi-event video-text retrieval,
G. Zhang, J. Ren, J. Gu, and V . Tresp, “Multi-event video-text retrieval,” inICCV, 2023
work page 2023
-
[4]
A visual–language foundation model for pathology image analysis using medical twitter,
Z. Huang, F. Bianchi, M. Yuksekgonul, T. J. Montine, and J. Zou, “A visual–language foundation model for pathology image analysis using medical twitter,”Nature Medicine, 2023
work page 2023
-
[5]
Medclip: Contrastive learning from unpaired medical images and text,
Z. Wang, Z. Wu, D. Agarwal, and J. Sun, “Medclip: Contrastive learning from unpaired medical images and text,” inEMNLP, 2022
work page 2022
-
[6]
Lossless medical image compression based on anatomical information and deep neural networks,
Q. Min, X. Wang, B. Huang, and Z. Zhou, “Lossless medical image compression based on anatomical information and deep neural networks,” Biomedical Signal Processing and Control, 2022
work page 2022
-
[7]
Web-based technology for remote viewing of radiological images: App validation,
Q. Min, X. Wang, B. Huang, and L. Xu, “Web-based technology for remote viewing of radiological images: App validation,”Journal of Medical Internet Research, 2020
work page 2020
-
[8]
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausmanet al., “Do as i can, not as i say: Grounding language in robotic affordances,”preprint arXiv:2204.01691, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[9]
Cliport: What and where pathways for robotic manipulation,
M. Shridhar, L. Manuelli, and D. Fox, “Cliport: What and where pathways for robotic manipulation,” inCoRL, 2022
work page 2022
-
[10]
Simple but effective: Clip embeddings for embodied ai,
A. Khandelwal, L. Weihs, R. Mottaghi, and A. Kembhavi, “Simple but effective: Clip embeddings for embodied ai,” inCVPR, 2022
work page 2022
-
[11]
Intriguing properties of neural networks,
C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” inICLR, 2013
work page 2013
-
[12]
Towards deep learning models resistant to adversarial attacks,
A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” inICLR, 2018
work page 2018
-
[13]
Boosting adversarial attacks with momentum,
Y . Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, and J. Li, “Boosting adversarial attacks with momentum,” inCVPR, 2018
work page 2018
-
[14]
Towards adversarial attack on vision- language pre-training models,
J. Zhang, Q. Yi, and J. Sang, “Towards adversarial attack on vision- language pre-training models,” inACM MM, 2022
work page 2022
-
[15]
On evaluating adversarial robustness of large vision-language models,
Y . Zhao, T. Pang, C. Du, X. Yang, C. Li, N.-M. M. Cheung, and M. Lin, “On evaluating adversarial robustness of large vision-language models,” inNeurIPS, 2024
work page 2024
-
[16]
arXiv preprint arXiv:2601.01592 , year=
X. Wang, Y . Chen, J. Li, Y . Wang, Y . Yao, T. Gu, J. Li, Y . Teng, Y . Wang, and X. Hu, “Openrt: An open-source red teaming framework for multimodal llms,”arXiv preprint arXiv:2601.01592, 2026
-
[17]
Freezevla: Action-freezing attacks against vision- language-action models,
X. Wang, J. Li, Z. Weng, Y . Wang, Y . Gao, T. Pang, C. Du, Y . Teng, Y . Wang, Z. Wuet al., “Freezevla: Action-freezing attacks against vision- language-action models,”arXiv preprint arXiv:2509.19870, 2025
-
[18]
Safety at scale: A comprehensive survey of large model and agent safety,
X. Ma, Y . Gao, Y . Wang, R. Wang, X. Wang, Y . Sun, Y . Ding, H. Xu, Y . Chen, Y . Zhaoet al., “Safety at scale: A comprehensive survey of large model and agent safety,”Foundations and Trends in Privacy and Security, 2026
work page 2026
-
[19]
Imperceptible jailbreaking against large language models,
K. Gao, Y . Li, C. Du, X. Wang, X. Ma, S.-T. Xia, and T. Pang, “Imperceptible jailbreaking against large language models,”arXiv preprint arXiv:2510.05025, 2025
-
[20]
Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs
Y . Chen, X. Wang, J. Li, Y . Wang, J. Li, Y . Teng, Y . Wang, and X. Ma, “Evolve the method, not the prompts: Evolutionary synthesis of jailbreak attacks on llms,”arXiv preprint arXiv:2511.12710, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
Theoretically principled trade-off between robustness and accuracy,
H. Zhang, Y . Yu, J. Jiao, E. Xing, L. El Ghaoui, and M. Jordan, “Theoretically principled trade-off between robustness and accuracy,” inICML, 2019
work page 2019
-
[22]
Large-scale adversarial training for vision-and-language representation learning,
Z. Gan, Y .-C. Chen, L. Li, C. Zhu, Y . Cheng, and J. Liu, “Large-scale adversarial training for vision-and-language representation learning,” in NeurIPS, 2020
work page 2020
-
[23]
Revisiting adversarial training at scale,
Z. Wang, X. Li, H. Zhu, and C. Xie, “Revisiting adversarial training at scale,” inCVPR, 2024
work page 2024
-
[24]
D. Su, H. Zhang, H. Chen, J. Yi, P.-Y . Chen, and Y . Gao, “Is robustness the cost of accuracy?–a comprehensive study on the robustness of 18 deep image classification models,” inECCV, 2018
work page 2018
-
[25]
On the relationship between generalization and robustness to adversarial examples,
A. Pedraza, O. Deniz, and G. Bueno, “On the relationship between generalization and robustness to adversarial examples,”Symmetry, 2021
work page 2021
-
[26]
Tapt: Test-time adversarial prompt tuning for robust inference in vision-language models,
X. Wang, K. Chen, J. Zhang, J. Chen, and X. Ma, “Tapt: Test-time adversarial prompt tuning for robust inference in vision-language models,” inCVPR, 2025
work page 2025
-
[27]
Adversarial prompt tuning for vision-language models,
J. Zhang, X. Ma, X. Wang, L. Qiu, J. Wang, Y .-G. Jiang, and J. Sang, “Adversarial prompt tuning for vision-language models,” inECCV, 2024
work page 2024
-
[28]
One prompt word is enough to boost adversarial robustness for pre-trained vision-language models,
L. Li, H. Guan, J. Qiu, and M. Spratling, “One prompt word is enough to boost adversarial robustness for pre-trained vision-language models,” inCVPR, 2024
work page 2024
-
[29]
Adversarial prompt distillation for vision-language models,
L. Luo, X. Wang, B. Zi, S. Zhao, X. Ma, and Y .-G. Jiang, “Adversarial prompt distillation for vision-language models,” inICASSP, 2026
work page 2026
-
[30]
Few-shot adversarial prompt learning on vision-language models,
Y . Zhou, X. Xia, Z. Lin, B. Han, and T. Liu, “Few-shot adversarial prompt learning on vision-language models,” inNeurIPS, 2024
work page 2024
-
[31]
Advclip: Downstream-agnostic adversarial examples in multimodal contrastive learning,
Z. Zhou, S. Hu, M. Li, H. Zhang, Y . Zhang, and H. Jin, “Advclip: Downstream-agnostic adversarial examples in multimodal contrastive learning,” inACM MM, 2023
work page 2023
-
[32]
Imbalanced gradients: a subtle cause of overestimated adversarial robustness,
X. Ma, L. Jiang, H. Huang, Z. Weng, J. Bailey, and Y .-G. Jiang, “Imbalanced gradients: a subtle cause of overestimated adversarial robustness,”Machine Learning, 2024
work page 2024
-
[33]
Improving transferability of adversarial examples with input diversity,
C. Xie, Z. Zhang, Y . Zhou, S. Bai, J. Wang, Z. Ren, and A. L. Yuille, “Improving transferability of adversarial examples with input diversity,” inCVPR, 2019
work page 2019
-
[34]
D. Lu, Z. Wang, T. Wang, W. Guan, H. Gao, and F. Zheng, “Set-level guidance attack: Boosting adversarial transferability of vision-language pre-training models,” inICCV, 2023
work page 2023
-
[35]
B. He, X. Jia, S. Liang, T. Lou, Y . Liu, and X. Cao, “Sa-attack: Improving adversarial transferability of vision-language pre-training models via self- augmentation,”preprint arXiv:2312.04913, 2023
-
[36]
Transferable multimodal attack on vision-language pre-training models,
H. Wang, K. Dong, Z. Zhu, H. Qin, A. Liu, X. Fang, J. Wang, and X. Liu, “Transferable multimodal attack on vision-language pre-training models,” inIEEE S&P, 2024
work page 2024
-
[37]
Vlattack: Multimodal adversarial attacks on vision-language tasks via pre-trained models,
Z. Yin, M. Ye, T. Zhang, T. Du, J. Zhu, H. Liu, J. Chen, T. Wang, and F. Ma, “Vlattack: Multimodal adversarial attacks on vision-language tasks via pre-trained models,” inNeurIPS, 2023
work page 2023
-
[38]
H. Fang, J. Kong, W. Yu, B. Chen, J. Li, S. Xia, and K. Xu, “One perturbation is enough: On generating universal adversarial perturbations against vision-language pre-training models,”preprint arXiv:2406.05491, 2024
-
[39]
Universal adversarial perturbations for vision-language pre-trained models,
P.-F. Zhang, Z. Huang, and G. Bai, “Universal adversarial perturbations for vision-language pre-trained models,” inACM SIGIR, 2024, pp. 862– 871
work page 2024
-
[40]
Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks,
F. Croce and M. Hein, “Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks,” inICML, 2020
work page 2020
-
[41]
Y . Wang, W. Hu, Y . Dong, H. Zhang, H. Su, and R. Hong, “Exploring transferability of multimodal adversarial samples for vision-language pre- training models with contrastive learning,”preprint arXiv:2308.12636, 2023
-
[42]
Understanding zero-shot adversarial robustness for large-scale models,
C. Mao, S. Geng, J. Yang, X. Wang, and C. V ondrick, “Understanding zero-shot adversarial robustness for large-scale models,” inICLR, 2023
work page 2023
-
[43]
C. Schlarmann, N. D. Singh, F. Croce, and M. Hein, “Robust CLIP: Unsupervised adversarial fine-tuning of vision embeddings for robust large vision-language models,” inICML, 2024
work page 2024
-
[44]
Pre-trained model guided fine-tuning for zero-shot adversarial robustness,
S. Wang, J. Zhang, Z. Yuan, and S. Shan, “Pre-trained model guided fine-tuning for zero-shot adversarial robustness,” inCVPR, 2024
work page 2024
-
[45]
Revisiting the adversarial robustness of vision language models: a multimodal perspective,
W. Zhou, S. Bai, Q. Zhao, and B. Chen, “Revisiting the adversarial robustness of vision language models: a multimodal perspective,”preprint arXiv:2404.19287, 2024
-
[46]
AdvQDet: Detecting query-based adversarial attacks with adversarial contrastive prompt tuning,
X. Wang, K. Chen, X. Ma, Z. Chen, J. Chen, and Y .-G. Jiang, “AdvQDet: Detecting query-based adversarial attacks with adversarial contrastive prompt tuning,” inACM MM, 2024
work page 2024
-
[47]
Promptsmooth: Certifying robustness of medical vision-language models via prompt learning,
N. Hussein, F. Shamshad, M. Naseer, and K. Nandakumar, “Promptsmooth: Certifying robustness of medical vision-language models via prompt learning,” inMICCAI, 2024
work page 2024
-
[48]
H. Fan, Z. Ma, Y . Li, R. Tian, Y . Chen, and C. Gao, “Mixprompt: Enhancing generalizability and adversarial robustness for vision-language models via prompt fusion,” inICIC, 2024
work page 2024
-
[49]
Prompt learning with optimal transport for vision-language models,
G. Chen, W. Yao, X. Song, X. Li, Y . Rao, and K. Zhang, “Plot: Prompt learning with optimal transport for vision-language models,”preprint arXiv:2210.01253, 2022
-
[50]
Lasp: Text-to-text optimization for language-aware soft prompting of vision & language models,
A. Bulat and G. Tzimiropoulos, “Lasp: Text-to-text optimization for language-aware soft prompting of vision & language models,” inCVPR, 2023
work page 2023
-
[51]
Mixture of prompts learning for vision- language models,
Y . Du, T. Niu, and R. Zhao, “Mixture of prompts learning for vision- language models,”Frontiers in Artificial Intelligence, 2025
work page 2025
-
[52]
One prompt is not enough: Automated construction of a mixture-of-expert prompts,
R. Wang, S. An, M. Cheng, T. Zhou, S. J. Hwang, and C.-J. Hsieh, “One prompt is not enough: Automated construction of a mixture-of-expert prompts,”preprint arXiv:2407.00256, 2024
-
[53]
Smop: Towards efficient and effective prompt tuning with sparse mixture-of-prompts,
J.-Y . Choi, J. Kim, J.-H. Park, W.-L. Mok, and S. Lee, “Smop: Towards efficient and effective prompt tuning with sparse mixture-of-prompts,” inEMNLP, 2023
work page 2023
-
[54]
Enhancing adversarial robustness of vision language models via adversarial mixture prompt tuning,
S. Zhao, Q. Zhu, S. Xiong, S. Ruan, Y . Fan, R. Duan, Q. Guo, and X. Wei, “Enhancing adversarial robustness of vision language models via adversarial mixture prompt tuning,”preprint arXiv:2505.17509, 2025
-
[55]
Improving robustness against common corruptions by covariate shift adaptation,
S. Schneider, E. Rusak, L. Eck, O. Bringmann, W. Brendel, and M. Bethge, “Improving robustness against common corruptions by covariate shift adaptation,”NeurIPS, 2020. 14
work page 2020
-
[56]
Evaluating prediction-time batch normalization for robustness under covariate shift
Z. Nado, S. Padhy, D. Sculley, A. D’Amour, B. Lakshminarayanan, and J. Snoek, “Evaluating prediction-time batch normalization for robustness under covariate shift,”preprint arXiv:2006.10963, 2020
-
[57]
Tent: Fully test-time adaptation by entropy minimization,
D. Wang, E. Shelhamer, S. Liu, B. Olshausen, and T. Darrell, “Tent: Fully test-time adaptation by entropy minimization,” inICLR, 2021
work page 2021
-
[58]
Memo: Test time robustness via adaptation and augmentation,
M. Zhang, S. Levine, and C. Finn, “Memo: Test time robustness via adaptation and augmentation,” inNeurIPS, 2022
work page 2022
-
[59]
Continual test-time domain adaptation,
Q. Wang, O. Fink, L. Van Gool, and D. Dai, “Continual test-time domain adaptation,” inCVPR, 2022
work page 2022
-
[60]
Efficient test-time model adaptation without forgetting,
S. Niu, J. Wu, Y . Zhang, Y . Chen, S. Zheng, P. Zhao, and M. Tan, “Efficient test-time model adaptation without forgetting,” inICML, 2022
work page 2022
-
[61]
Test-time prompt tuning for zero-shot generalization in vision-language models,
M. Shu, W. Nie, D.-A. Huang, Z. Yu, T. Goldstein, A. Anandkumar, and C. Xiao, “Test-time prompt tuning for zero-shot generalization in vision-language models,” inNeurIPS, 2022
work page 2022
-
[62]
Align your prompts: Test- time prompting with distribution alignment for zero-shot generalization,
J. Abdul Samadh, M. H. Gani, N. Hussein, M. U. Khattak, M. M. Naseer, F. Shahbaz Khan, and S. H. Khan, “Align your prompts: Test- time prompting with distribution alignment for zero-shot generalization,” inNeurIPS, 2024
work page 2024
-
[63]
M. Zanella and I. Ben Ayed, “On the test-time zero-shot generalization of vision-language models: Do we really need prompt learning?” in CVPR, 2024
work page 2024
-
[64]
R-tpt: Improving adversarial robustness of vision-language models through test-time prompt tuning,
L. Sheng, J. Liang, Z. Wang, and R. He, “R-tpt: Improving adversarial robustness of vision-language models through test-time prompt tuning,” inCVPR, 2025
work page 2025
-
[65]
S. Xing, Z. Zhao, and N. Sebe, “Clip is strong enough to fight back: Test-time counterattacks towards zero-shot adversarial robustness of clip,” inCVPR, 2025
work page 2025
-
[66]
Visual prompting: Modifying pixel space to adapt pre-trained models,
H. Bahng, A. Jahanian, S. Sankaranarayanan, and P. Isola, “Exploring vi- sual prompts for adapting large-scale models,”preprint arXiv:2203.17274, 2022
-
[67]
Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,
N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,” inICLR, 2017
work page 2017
-
[68]
Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,
W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,”JMLR, 2022
work page 2022
-
[69]
Adaptive mixtures of local experts,
R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts,”Neural Computation, 1991
work page 1991
-
[70]
Imagenet large scale visual recognition challenge,
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernsteinet al., “Imagenet large scale visual recognition challenge,”IJCV, 2015
work page 2015
-
[71]
L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” inCVPR Workshops, 2004
work page 2004
-
[72]
Describing textures in the wild,
M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi, “Describing textures in the wild,” inCVPR, 2014
work page 2014
-
[73]
Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification,
P. Helber, B. Bischke, A. Dengel, and D. Borth, “Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification,” IEEE J-STARS, 2019
work page 2019
-
[74]
O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. Jawahar, “Cats and dogs,” inCVPR, 2012
work page 2012
-
[75]
Fine-Grained Visual Classification of Aircraft
S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi, “Fine-grained visual classification of aircraft,”preprint arXiv:1306.5151, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[76]
Food-101–mining discriminative components with random forests,
L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101–mining discriminative components with random forests,” inECCV, 2014
work page 2014
-
[77]
Automated flower classification over a large number of classes,
M.-E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,” inICVGIP, 2008
work page 2008
-
[78]
3d object representations for fine-grained categorization,
J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3d object representations for fine-grained categorization,” inICCV Workshops, 2013
work page 2013
-
[79]
Sun database: Large-scale scene recognition from abbey to zoo,
J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun database: Large-scale scene recognition from abbey to zoo,” inCVPR, 2010
work page 2010
-
[80]
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
K. Soomro, “Ucf101: A dataset of 101 human actions classes from videos in the wild,”preprint arXiv:1212.0402, 2012
work page internal anchor Pith review Pith/arXiv arXiv 2012
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.