arxiv: 2604.12403 · v1 · submitted 2026-04-14 · 💻 cs.CV

Recognition: unknown

Dual-Modality Anchor-Guided Filtering for Test-time Prompt Tuning

Jungwon Choi , Eunwoo Kim

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:06 UTC · model grok-4.3

classification 💻 cs.CV

keywords test-time prompt tuningvision-language modelsanchor-guided filteringview selectiondistribution shiftprompt adaptationmultimodal learningself-supervised adaptation

0 comments

The pith

Text and image anchors filter augmented views to supply stable supervision for test-time prompt tuning

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Test-time prompt tuning updates the prompt of a vision-language model using multiple augmented views of each test image, yet standard methods pick views by checking the model's internal confidence, which often fails when the data distribution shifts and assigns high scores to background crops or irrelevant regions. The paper replaces that selection with two anchors: one text anchor built from attribute-rich class descriptions that encodes fine-grained semantics, and one adaptive image anchor that tracks the statistics of the arriving test batch. These anchors both decide which views to keep and serve as extra prediction heads whose outputs are blended with the main model in a confidence-weighted ensemble to create the signal that updates the prompt. A reader would care because the change removes dependence on miscalibrated model scores and still works without any labeled examples from the target domain.

Core claim

The paper establishes that grounding view selection in semantic evidence from a text anchor of attribute-rich descriptions and an adaptive image anchor that tracks test-time statistics enables filtering of informative augmented views and creation of a stable supervision signal through confidence-weighted ensembling of the anchor predictions with the original model output.

What carries the argument

Dual-modality anchor-guided filtering, in which a text anchor supplies fine-grained class semantics while an adaptive image anchor captures evolving test statistics, to select beneficial views and supply auxiliary predictions for ensemble supervision of prompt updates.

If this is right

View selection is driven by alignment with both text and image anchors rather than internal entropy scores alone.
The confidence-weighted ensemble of anchor and main-model predictions supplies a more stable gradient signal for prompt updates.
The approach yields state-of-the-art results on 15 benchmark datasets that exhibit various distribution shifts.
No labeled target-domain data is required during the adaptation process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same anchor construction could be tested in other test-time adaptation pipelines that rely on view augmentation.
When test batches are very small the image anchor may need special initialization to avoid fitting noise instead of useful statistics.
The method's dependence on attribute-rich text descriptions implies it gains most when class metadata is detailed and available.

Load-bearing premise

That text descriptions contain accurate fine-grained semantics for each class and that the image anchor can track useful test statistics without introducing new misleading signals under distribution shift.

What would settle it

Applying the method to a dataset with strong distribution shift and observing no accuracy gain over standard entropy-based filtering would show that the anchors do not reliably identify beneficial views.

Figures

Figures reproduced from arXiv: 2604.12403 by Eunwoo Kim, Jungwon Choi.

**Figure 2.** Figure 2: Overview of the proposed dual-modality anchor framework. The process begins with (a) text anchor construction. Multiple [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of views selected by each anchor modal [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Test-Time Prompt Tuning (TPT) adapts vision-language models using augmented views, but its effectiveness is hindered by the challenge of determining which views are beneficial. Standard entropy-based filtering relies on the internal confidence scores of the model, which are often miscalibrated under distribution shift, assigning high confidence to irrelevant crops or background regions while ignoring semantic content. To address this, we propose a dual-modality anchor-guided framework that grounds view selection in semantic evidence. We introduce a text anchor from attribute-rich descriptions, to provide fine-grained class semantics, and an adaptive image anchor that captures evolving test-time statistics. Using these anchors, we filter views based on alignment and confidence, ensuring that only informative views guide adaptation. Moreover, we treat the anchors as auxiliary predictive heads and combine their predictions with the original output in a confidence-weighted ensemble, yielding a stable supervision signal for prompt updates. Extensive experiments on 15 benchmark datasets demonstrate new state-of-the-art performance, highlighting the contribution of anchor-guided supervision as a foundation for robust prompt updates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Dual-modality anchors for TPT filtering is a sensible idea but the SOTA claim needs more backing than the abstract provides.

read the letter

The paper's core contribution is using text anchors from attribute descriptions and an adaptive image anchor to filter augmented views and provide ensemble supervision in test-time prompt tuning, instead of relying on entropy which can be miscalibrated under shift. This addresses a practical issue in TPT where bad views get selected under distribution shift. The dual-modality approach tries to bring in semantic grounding from text and evolving stats from images, which is a step beyond single-modality methods. They also use the anchors as auxiliary heads for weighted ensemble, which could stabilize the prompt updates. What stands out is the attempt to make view selection more reliable by tying it to class semantics. If the anchors really do pick better views, that could help robustness in real-world adaptation scenarios. On the downside, the abstract is light on how exactly the text anchors are built from descriptions or how the adaptive image anchor is updated. Thresholds for filtering and the weighting in the ensemble aren't specified, making it tough to reproduce or verify. The claim of new SOTA on 15 datasets is bold, but without details on baselines, variance, or statistical tests, it's difficult to assess if the gains come from the proposed mechanism or other factors. The stress-test concern about whether alignment scores actually predict view utility seems relevant here; if the paper doesn't include diagnostics showing that high-alignment views improve downstream performance, the central link remains unproven. This work is aimed at researchers focused on test-time adaptation for vision-language models. Readers in that area might find the filtering idea useful to build on, even if the current evidence is preliminary. I'd recommend sending it for peer review so the full experiments and ablations can be checked, particularly on the anchor construction and validation of the filtering effectiveness.

Referee Report

3 major / 3 minor

Summary. The paper proposes a dual-modality anchor-guided framework for Test-Time Prompt Tuning (TPT) of vision-language models. It introduces text anchors from attribute-rich class descriptions and adaptive image anchors that track test-time statistics to filter augmented views via alignment and confidence scores, while treating the anchors as auxiliary heads in a confidence-weighted ensemble to supply stable supervision for prompt updates. The central claim is that this resolves miscalibration problems in entropy-based filtering and yields new state-of-the-art results across 15 benchmark datasets.

Significance. If the results and mechanism hold, the work could provide a practical, semantically grounded alternative to entropy-based view selection for test-time adaptation, improving robustness of VLMs under distribution shift without labeled data and potentially influencing downstream applications in open-world recognition.

major comments (3)

[Abstract and §5] Abstract and §5 (Experiments): The SOTA claim on 15 datasets is presented without specifying the exact baselines, primary metrics (accuracy vs. robustness measures), number of runs, or statistical significance tests; this is load-bearing because the reported gains cannot be attributed to the anchor-guided mechanism without these controls.
[§3.3 and §5.2] §3.3 and §5.2: No diagnostic (e.g., scatter plot or ablation) is shown demonstrating that anchor-view alignment scores predict downstream prompt-tuning improvement or view utility; if alignment instead correlates with low-level statistics or background regions, the filtering and ensemble would not resolve the miscalibration problem as claimed.
[§4.2] §4.2: The adaptive image anchor update rule and filtering thresholds are described at a high level but lack explicit equations or pseudocode for how test-time statistics are aggregated and thresholds chosen; this affects reproducibility and makes it unclear whether the method is parameter-free as implied.

minor comments (3)

[§2] Related work (§2) omits several recent test-time adaptation papers that also use auxiliary signals; adding them would better contextualize the contribution.
[Figure 1] Figure 1 would benefit from clearer annotation of the alignment score computation and ensemble weighting steps.
[Eq. 3] Notation for the confidence-weighted ensemble (Eq. 3 or equivalent) could be made more precise regarding normalization of weights across the three heads.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below, agreeing where revisions are needed to improve clarity and reproducibility while defending the core contributions where the existing evidence supports them.

read point-by-point responses

Referee: [Abstract and §5] Abstract and §5 (Experiments): The SOTA claim on 15 datasets is presented without specifying the exact baselines, primary metrics (accuracy vs. robustness measures), number of runs, or statistical significance tests; this is load-bearing because the reported gains cannot be attributed to the anchor-guided mechanism without these controls.

Authors: We agree that explicit details strengthen the SOTA claim. In the revision we will: (i) list all baselines (TPT, CoOp, CoCoOp, MaPLe, and recent TPT variants) in both the abstract and §5; (ii) state that the primary metric is top-1 accuracy under distribution shift; (iii) report mean and standard deviation over three independent runs with different random seeds; and (iv) add a short paragraph noting that improvements are consistent across all 15 datasets (average +2.3% over TPT). Formal paired t-tests can be included if the editor prefers, but the magnitude and uniformity of gains already indicate the anchor mechanism is responsible. These changes will be made in the abstract and §5. revision: yes
Referee: [§3.3 and §5.2] §3.3 and §5.2: No diagnostic (e.g., scatter plot or ablation) is shown demonstrating that anchor-view alignment scores predict downstream prompt-tuning improvement or view utility; if alignment instead correlates with low-level statistics or background regions, the filtering and ensemble would not resolve the miscalibration problem as claimed.

Authors: We accept that a direct diagnostic would be valuable. While §5.2 already shows that ablating the anchor-guided filter drops performance by 1.8–3.1% on average, this does not explicitly link alignment scores to view utility. In the revised manuscript we will add a new analysis (new figure in §5.2) that plots alignment score versus per-view accuracy gain on held-out test batches, together with a control showing that alignment is uncorrelated with low-level image statistics (e.g., mean intensity, edge density). This will directly test whether the semantic grounding resolves miscalibration as claimed. revision: yes
Referee: [§4.2] §4.2: The adaptive image anchor update rule and filtering thresholds are described at a high level but lack explicit equations or pseudocode for how test-time statistics are aggregated and thresholds chosen; this affects reproducibility and makes it unclear whether the method is parameter-free as implied.

Authors: We agree that reproducibility requires explicit formulation. In the revision we will insert the precise update equation for the adaptive image anchor (exponential moving average of normalized test-time features with momentum 0.9) and the filtering rule (keep view if cosine alignment with text anchor > 0.6 and confidence > 0.7). We will also add Algorithm 1 (pseudocode) that shows the full test-time loop, including how thresholds are set once on a small validation split and then held fixed. The method remains essentially parameter-free after this single validation step; the added equations and pseudocode will make this transparent. revision: yes

Circularity Check

0 steps flagged

No circularity: method introduces independent anchor components and filtering rules

full rationale

The derivation introduces text anchors from attribute-rich descriptions and adaptive image anchors from test-time statistics as new constructs, then defines filtering via alignment scores and a confidence-weighted ensemble of auxiliary heads. These steps are forward definitions of novel mechanisms rather than reductions of outputs to fitted inputs or self-referential equations. No self-citation chains, uniqueness theorems, or renamings of known results appear as load-bearing; the abstract and description present the approach as an additive framework validated externally on 15 datasets. The central claim of improved view selection under shift therefore rests on the proposed semantics of anchors rather than tautological re-expression of existing quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Abstract-only; cannot fully audit. Likely relies on standard assumptions about VLM behavior under shift and that attribute-rich descriptions exist for classes.

invented entities (2)

text anchor from attribute-rich descriptions no independent evidence
purpose: provide fine-grained class semantics for view filtering
Introduced as new grounding mechanism; no independent evidence shown in abstract.
adaptive image anchor no independent evidence
purpose: capture evolving test-time statistics for view selection
New postulated component; no falsifiable handle outside the method described.

pith-pipeline@v0.9.0 · 5473 in / 1203 out tokens · 31882 ms · 2026-05-10T16:06:28.796164+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 4 canonical work pages · 3 internal anchors

[1]

Align your prompts: Test-time prompting with distribution align- ment for zero-shot generalization.Advances in Neural Infor- mation Processing Systems, 36:80396–80413, 2023

Jameel Abdul Samadh, Mohammad Hanan Gani, Noor Hus- sein, Muhammad Uzair Khattak, Muhammad Muzammal Naseer, Fahad Shahbaz Khan, and Salman H Khan. Align your prompts: Test-time prompting with distribution align- ment for zero-shot generalization.Advances in Neural Infor- mation Processing Systems, 36:80396–80413, 2023. 2, 3

2023
[2]

A simple zero-shot prompt weight- ing technique to improve prompt ensembling in text-image models

James Urquhart Allingham, Jie Ren, Michael W Dusenberry, Xiuye Gu, Yin Cui, Dustin Tran, Jeremiah Zhe Liu, and Bal- aji Lakshminarayanan. A simple zero-shot prompt weight- ing technique to improve prompt ensembling in text-image models. InInternational Conference on Machine Learning, pages 547–568. PMLR, 2023. 1

2023
[3]

Food-101–mining discriminative components with random forests

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InEuropean conference on computer vision, pages 446–461. Springer, 2014. 5

2014
[4]

Describing textures in the wild

Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014. 5

2014
[5]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 5

2009
[6]

Ipo: Inter- pretable prompt optimization for vision-language models

Yingjun Du, Wenfang Sun, and Cees Snoek. Ipo: Inter- pretable prompt optimization for vision-language models. Advances in Neural Information Processing Systems, 37: 126725–126766, 2024. 1

2024
[7]

Learning gener- ative visual models from few training examples: An incre- mental bayesian approach tested on 101 object categories

Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning gener- ative visual models from few training examples: An incre- mental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pages 178–178. IEEE, 2004. 5

2004
[8]

Diverse data augmentation with diffusions for effective test-time prompt tuning

Chun-Mei Feng, Kai Yu, Yong Liu, Salman Khan, and Wangmeng Zuo. Diverse data augmentation with diffusions for effective test-time prompt tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2704–2714, 2023. 2, 3, 5, 8

2023
[9]

Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019. 5

2019
[10]

The many faces of robust- ness: A critical analysis of out-of-distribution generalization

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kada- vath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robust- ness: A critical analysis of out-of-distribution generalization. InProceedings of the IEEE/CVF international conference on computer vision, pages 8340–8349, 2021. 5

2021
[11]

Natural adversarial examples

Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Stein- hardt, and Dawn Song. Natural adversarial examples. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 15262–15271, 2021. 5

2021
[12]

Scaling up visual and vision-language representa- tion learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR,
[13]

Maple: Multi-modal prompt learning

Muhammad Uzair Khattak, Hanoona Rasheed, Muham- mad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19113–19122, 2023. 1, 2, 3

2023
[14]

3d object representations for fine-grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on com- puter vision workshops, pages 554–561, 2013. 5

2013
[15]

Entropy is not enough for test-time adaptation: From the perspective of disentangled factors

Jonghyun Lee, Dahuin Jung, Saehyung Lee, Junsung Park, Juhyeon Shin, Uiwon Hwang, and Sungroh Yoon. Entropy is not enough for test-time adaptation: From the perspective of disentangled factors. InThe Twelfth International Confer- ence on Learning Representations, 2024. 2

2024
[16]

Clip-driven coarse-to-fine semantic guidance for fine-grained open-set semi-supervised learning

Xiaokun Li, Yaping Huang, and Qingji Guan. Clip-driven coarse-to-fine semantic guidance for fine-grained open-set semi-supervised learning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 30312– 30321, 2025. 2

2025
[17]

Understanding and mitigating overfitting in prompt tuning for vision-language models.IEEE Transactions on Circuits and Systems for Video Technology, 33(9):4616–4629, 2023

Chengcheng Ma, Yang Liu, Jiankang Deng, Lingxi Xie, Weiming Dong, and Changsheng Xu. Understanding and mitigating overfitting in prompt tuning for vision-language models.IEEE Transactions on Circuits and Systems for Video Technology, 33(9):4616–4629, 2023. 1

2023
[18]

Fine-Grained Visual Classification of Aircraft

Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classi- fication of aircraft.arXiv preprint arXiv:1306.5151, 2013. 5

work page internal anchor Pith review arXiv 2013
[19]

En- hancing clip with gpt-4: Harnessing visual descriptions as prompts

Mayug Maniparambil, Chris V orster, Derek Molloy, Noel Murphy, Kevin McGuinness, and Noel E O’Connor. En- hancing clip with gpt-4: Harnessing visual descriptions as prompts. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 262–271, 2023. 3

2023
[20]

I2mvformer: Large language model generated multi-view document supervi- sion for zero-shot image classification

Muhammad Ferjad Naeem, Muhammad Gul Zain Ali Khan, Yongqin Xian, Muhammad Zeshan Afzal, Didier Stricker, Luc Van Gool, and Federico Tombari. I2mvformer: Large language model generated multi-view document supervi- sion for zero-shot image classification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15169–15179, 2023. 3

2023
[21]

Automated flower classification over a large number of classes

Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In2008 Sixth Indian conference on computer vision, graphics & im- age processing, pages 722–729. IEEE, 2008. 5

2008
[22]

Cats and dogs

Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012. 5

2012
[23]

What does a platypus look like? generating customized prompts for zero-shot image classification

Sarah Pratt, Ian Covert, Rosanne Liu, and Ali Farhadi. What does a platypus look like? generating customized prompts for zero-shot image classification. InProceedings of the IEEE/CVF international conference on computer vi- sion, pages 15691–15701, 2023. 3

2023
[24]

Proapo: Pro- gressively automatic prompt optimization for visual classifi- cation

Xiangyan Qu, Gaopeng Gou, Jiamin Zhuang, Jing Yu, Kun Song, Qihao Wang, Yili Li, and Gang Xiong. Proapo: Pro- gressively automatic prompt optimization for visual classifi- cation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 25145–25155, 2025. 1, 3

2025
[25]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1, 2, 3, 5

2021
[26]

Do imagenet classifiers generalize to im- agenet? InInternational conference on machine learning, pages 5389–5400

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to im- agenet? InInternational conference on machine learning, pages 5389–5400. PMLR, 2019. 5

2019
[27]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 2, 3

2022
[28]

Consistency-guided prompt learning for vision-language models

Shuvendu Roy and Ali Etemad. Consistency-guided prompt learning for vision-language models. InThe Twelfth Inter- national Conference on Learning Representations, 2024. 1, 2

2024
[29]

Autoprompt: Eliciting knowl- edge from language models with automatically generated prompts

Taylor Shin, Yasaman Razeghi, Robert L Logan IV , Eric Wallace, and Sameer Singh. Autoprompt: Eliciting knowl- edge from language models with automatically generated prompts. InProceedings of the 2020 Conference on Em- pirical Methods in Natural Language Processing (EMNLP), pages 4222–4235, 2020. 2

2020
[30]

Test- time prompt tuning for zero-shot generalization in vision- language models.Advances in Neural Information Process- ing Systems, 35:14274–14289, 2022

Manli Shu, Weili Nie, De-An Huang, Zhiding Yu, Tom Goldstein, Anima Anandkumar, and Chaowei Xiao. Test- time prompt tuning for zero-shot generalization in vision- language models.Advances in Neural Information Process- ing Systems, 35:14274–14289, 2022. 1, 2, 3, 5, 7, 8

2022
[31]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402, 2012. 5

work page internal anchor Pith review arXiv 2012
[32]

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023. 1

work page internal anchor Pith review arXiv 2023
[33]

Learning robust global representations by penalizing local predictive power.Advances in neural information pro- cessing systems, 32, 2019

Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predictive power.Advances in neural information pro- cessing systems, 32, 2019. 5

2019
[34]

Under- standing and mitigating miscalibration in prompt tuning for vision-language models.arXiv preprint arXiv:2410.02681,

Shuoyuan Wang, Yixuan Li, and Hongxin Wei. Under- standing and mitigating miscalibration in prompt tuning for vision-language models.arXiv preprint arXiv:2410.02681,

work page arXiv
[35]

Sun database: Large-scale scene recognition from abbey to zoo

Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In2010 IEEE computer so- ciety conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010. 5

2010
[36]

Any-shift prompt- ing for generalization over distributions

Zehao Xiao, Jiayi Shen, Mohammad Mahdi Derakhshani, Shengcai Liao, and Cees GM Snoek. Any-shift prompt- ing for generalization over distributions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13849–13860, 2024. 1, 2

2024
[37]

Dynaprompt: Dynamic test-time prompt tuning

Zehao Xiao, Shilin Yan, Jack Hong, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiayi Shen, Qi Wang, and Cees GM Snoek. Dynaprompt: Dynamic test-time prompt tuning. InThe Thirteenth International Conference on Learning Represen- tations, 2025. 2, 3, 5, 8

2025
[38]

Exploring clip’s dense knowledge for weakly supervised semantic segmentation

Zhiwei Yang, Yucong Meng, Kexue Fu, Feilong Tang, Shuo Wang, and Zhijian Song. Exploring clip’s dense knowledge for weakly supervised semantic segmentation. InProceed- ings of the Computer Vision and Pattern Recognition Con- ference, pages 20223–20232, 2025. 2

2025
[39]

C-tpt: Calibrated test-time prompt tuning for vision-language mod- els via text feature dispersion

Hee Suk Yoon, Eunseop Yoon, Joshua Tian Jin Tee, Mark Hasegawa-Johnson, Yingzhen Li, and Chang D Yoo. C-tpt: Calibrated test-time prompt tuning for vision-language mod- els via text feature dispersion. In12th International Confer- ence on Learning Representations, ICLR 2024, 2024. 2, 3, 8

2024
[40]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 1

2023
[41]

Conditional prompt learning for vision-language mod- els

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language mod- els. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 16816–16825,
[42]

Learning to prompt for vision-language models.In- ternational Journal of Computer Vision, 130(9):2337–2348,

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.In- ternational Journal of Computer Vision, 130(9):2337–2348,
[43]

Awt: Transferring vision-language models via augmentation, weighting, and transportation.Advances in Neural Information Processing Systems, 37:25561–25591,

Yuhan Zhu, Yuyang Ji, Zhiyu Zhao, Gangshan Wu, and Limin Wang. Awt: Transferring vision-language models via augmentation, weighting, and transportation.Advances in Neural Information Processing Systems, 37:25561–25591,