Recognition: unknown
Dual-Modality Anchor-Guided Filtering for Test-time Prompt Tuning
Pith reviewed 2026-05-10 16:06 UTC · model grok-4.3
The pith
Text and image anchors filter augmented views to supply stable supervision for test-time prompt tuning
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that grounding view selection in semantic evidence from a text anchor of attribute-rich descriptions and an adaptive image anchor that tracks test-time statistics enables filtering of informative augmented views and creation of a stable supervision signal through confidence-weighted ensembling of the anchor predictions with the original model output.
What carries the argument
Dual-modality anchor-guided filtering, in which a text anchor supplies fine-grained class semantics while an adaptive image anchor captures evolving test statistics, to select beneficial views and supply auxiliary predictions for ensemble supervision of prompt updates.
If this is right
- View selection is driven by alignment with both text and image anchors rather than internal entropy scores alone.
- The confidence-weighted ensemble of anchor and main-model predictions supplies a more stable gradient signal for prompt updates.
- The approach yields state-of-the-art results on 15 benchmark datasets that exhibit various distribution shifts.
- No labeled target-domain data is required during the adaptation process.
Where Pith is reading between the lines
- The same anchor construction could be tested in other test-time adaptation pipelines that rely on view augmentation.
- When test batches are very small the image anchor may need special initialization to avoid fitting noise instead of useful statistics.
- The method's dependence on attribute-rich text descriptions implies it gains most when class metadata is detailed and available.
Load-bearing premise
That text descriptions contain accurate fine-grained semantics for each class and that the image anchor can track useful test statistics without introducing new misleading signals under distribution shift.
What would settle it
Applying the method to a dataset with strong distribution shift and observing no accuracy gain over standard entropy-based filtering would show that the anchors do not reliably identify beneficial views.
Figures
read the original abstract
Test-Time Prompt Tuning (TPT) adapts vision-language models using augmented views, but its effectiveness is hindered by the challenge of determining which views are beneficial. Standard entropy-based filtering relies on the internal confidence scores of the model, which are often miscalibrated under distribution shift, assigning high confidence to irrelevant crops or background regions while ignoring semantic content. To address this, we propose a dual-modality anchor-guided framework that grounds view selection in semantic evidence. We introduce a text anchor from attribute-rich descriptions, to provide fine-grained class semantics, and an adaptive image anchor that captures evolving test-time statistics. Using these anchors, we filter views based on alignment and confidence, ensuring that only informative views guide adaptation. Moreover, we treat the anchors as auxiliary predictive heads and combine their predictions with the original output in a confidence-weighted ensemble, yielding a stable supervision signal for prompt updates. Extensive experiments on 15 benchmark datasets demonstrate new state-of-the-art performance, highlighting the contribution of anchor-guided supervision as a foundation for robust prompt updates.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a dual-modality anchor-guided framework for Test-Time Prompt Tuning (TPT) of vision-language models. It introduces text anchors from attribute-rich class descriptions and adaptive image anchors that track test-time statistics to filter augmented views via alignment and confidence scores, while treating the anchors as auxiliary heads in a confidence-weighted ensemble to supply stable supervision for prompt updates. The central claim is that this resolves miscalibration problems in entropy-based filtering and yields new state-of-the-art results across 15 benchmark datasets.
Significance. If the results and mechanism hold, the work could provide a practical, semantically grounded alternative to entropy-based view selection for test-time adaptation, improving robustness of VLMs under distribution shift without labeled data and potentially influencing downstream applications in open-world recognition.
major comments (3)
- [Abstract and §5] Abstract and §5 (Experiments): The SOTA claim on 15 datasets is presented without specifying the exact baselines, primary metrics (accuracy vs. robustness measures), number of runs, or statistical significance tests; this is load-bearing because the reported gains cannot be attributed to the anchor-guided mechanism without these controls.
- [§3.3 and §5.2] §3.3 and §5.2: No diagnostic (e.g., scatter plot or ablation) is shown demonstrating that anchor-view alignment scores predict downstream prompt-tuning improvement or view utility; if alignment instead correlates with low-level statistics or background regions, the filtering and ensemble would not resolve the miscalibration problem as claimed.
- [§4.2] §4.2: The adaptive image anchor update rule and filtering thresholds are described at a high level but lack explicit equations or pseudocode for how test-time statistics are aggregated and thresholds chosen; this affects reproducibility and makes it unclear whether the method is parameter-free as implied.
minor comments (3)
- [§2] Related work (§2) omits several recent test-time adaptation papers that also use auxiliary signals; adding them would better contextualize the contribution.
- [Figure 1] Figure 1 would benefit from clearer annotation of the alignment score computation and ensemble weighting steps.
- [Eq. 3] Notation for the confidence-weighted ensemble (Eq. 3 or equivalent) could be made more precise regarding normalization of weights across the three heads.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below, agreeing where revisions are needed to improve clarity and reproducibility while defending the core contributions where the existing evidence supports them.
read point-by-point responses
-
Referee: [Abstract and §5] Abstract and §5 (Experiments): The SOTA claim on 15 datasets is presented without specifying the exact baselines, primary metrics (accuracy vs. robustness measures), number of runs, or statistical significance tests; this is load-bearing because the reported gains cannot be attributed to the anchor-guided mechanism without these controls.
Authors: We agree that explicit details strengthen the SOTA claim. In the revision we will: (i) list all baselines (TPT, CoOp, CoCoOp, MaPLe, and recent TPT variants) in both the abstract and §5; (ii) state that the primary metric is top-1 accuracy under distribution shift; (iii) report mean and standard deviation over three independent runs with different random seeds; and (iv) add a short paragraph noting that improvements are consistent across all 15 datasets (average +2.3% over TPT). Formal paired t-tests can be included if the editor prefers, but the magnitude and uniformity of gains already indicate the anchor mechanism is responsible. These changes will be made in the abstract and §5. revision: yes
-
Referee: [§3.3 and §5.2] §3.3 and §5.2: No diagnostic (e.g., scatter plot or ablation) is shown demonstrating that anchor-view alignment scores predict downstream prompt-tuning improvement or view utility; if alignment instead correlates with low-level statistics or background regions, the filtering and ensemble would not resolve the miscalibration problem as claimed.
Authors: We accept that a direct diagnostic would be valuable. While §5.2 already shows that ablating the anchor-guided filter drops performance by 1.8–3.1% on average, this does not explicitly link alignment scores to view utility. In the revised manuscript we will add a new analysis (new figure in §5.2) that plots alignment score versus per-view accuracy gain on held-out test batches, together with a control showing that alignment is uncorrelated with low-level image statistics (e.g., mean intensity, edge density). This will directly test whether the semantic grounding resolves miscalibration as claimed. revision: yes
-
Referee: [§4.2] §4.2: The adaptive image anchor update rule and filtering thresholds are described at a high level but lack explicit equations or pseudocode for how test-time statistics are aggregated and thresholds chosen; this affects reproducibility and makes it unclear whether the method is parameter-free as implied.
Authors: We agree that reproducibility requires explicit formulation. In the revision we will insert the precise update equation for the adaptive image anchor (exponential moving average of normalized test-time features with momentum 0.9) and the filtering rule (keep view if cosine alignment with text anchor > 0.6 and confidence > 0.7). We will also add Algorithm 1 (pseudocode) that shows the full test-time loop, including how thresholds are set once on a small validation split and then held fixed. The method remains essentially parameter-free after this single validation step; the added equations and pseudocode will make this transparent. revision: yes
Circularity Check
No circularity: method introduces independent anchor components and filtering rules
full rationale
The derivation introduces text anchors from attribute-rich descriptions and adaptive image anchors from test-time statistics as new constructs, then defines filtering via alignment scores and a confidence-weighted ensemble of auxiliary heads. These steps are forward definitions of novel mechanisms rather than reductions of outputs to fitted inputs or self-referential equations. No self-citation chains, uniqueness theorems, or renamings of known results appear as load-bearing; the abstract and description present the approach as an additive framework validated externally on 15 datasets. The central claim of improved view selection under shift therefore rests on the proposed semantics of anchors rather than tautological re-expression of existing quantities.
Axiom & Free-Parameter Ledger
invented entities (2)
-
text anchor from attribute-rich descriptions
no independent evidence
-
adaptive image anchor
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Align your prompts: Test-time prompting with distribution align- ment for zero-shot generalization.Advances in Neural Infor- mation Processing Systems, 36:80396–80413, 2023
Jameel Abdul Samadh, Mohammad Hanan Gani, Noor Hus- sein, Muhammad Uzair Khattak, Muhammad Muzammal Naseer, Fahad Shahbaz Khan, and Salman H Khan. Align your prompts: Test-time prompting with distribution align- ment for zero-shot generalization.Advances in Neural Infor- mation Processing Systems, 36:80396–80413, 2023. 2, 3
2023
-
[2]
A simple zero-shot prompt weight- ing technique to improve prompt ensembling in text-image models
James Urquhart Allingham, Jie Ren, Michael W Dusenberry, Xiuye Gu, Yin Cui, Dustin Tran, Jeremiah Zhe Liu, and Bal- aji Lakshminarayanan. A simple zero-shot prompt weight- ing technique to improve prompt ensembling in text-image models. InInternational Conference on Machine Learning, pages 547–568. PMLR, 2023. 1
2023
-
[3]
Food-101–mining discriminative components with random forests
Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InEuropean conference on computer vision, pages 446–461. Springer, 2014. 5
2014
-
[4]
Describing textures in the wild
Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014. 5
2014
-
[5]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 5
2009
-
[6]
Ipo: Inter- pretable prompt optimization for vision-language models
Yingjun Du, Wenfang Sun, and Cees Snoek. Ipo: Inter- pretable prompt optimization for vision-language models. Advances in Neural Information Processing Systems, 37: 126725–126766, 2024. 1
2024
-
[7]
Learning gener- ative visual models from few training examples: An incre- mental bayesian approach tested on 101 object categories
Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning gener- ative visual models from few training examples: An incre- mental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pages 178–178. IEEE, 2004. 5
2004
-
[8]
Diverse data augmentation with diffusions for effective test-time prompt tuning
Chun-Mei Feng, Kai Yu, Yong Liu, Salman Khan, and Wangmeng Zuo. Diverse data augmentation with diffusions for effective test-time prompt tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2704–2714, 2023. 2, 3, 5, 8
2023
-
[9]
Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019. 5
2019
-
[10]
The many faces of robust- ness: A critical analysis of out-of-distribution generalization
Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kada- vath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robust- ness: A critical analysis of out-of-distribution generalization. InProceedings of the IEEE/CVF international conference on computer vision, pages 8340–8349, 2021. 5
2021
-
[11]
Natural adversarial examples
Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Stein- hardt, and Dawn Song. Natural adversarial examples. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 15262–15271, 2021. 5
2021
-
[12]
Scaling up visual and vision-language representa- tion learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR,
-
[13]
Maple: Multi-modal prompt learning
Muhammad Uzair Khattak, Hanoona Rasheed, Muham- mad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19113–19122, 2023. 1, 2, 3
2023
-
[14]
3d object representations for fine-grained categorization
Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on com- puter vision workshops, pages 554–561, 2013. 5
2013
-
[15]
Entropy is not enough for test-time adaptation: From the perspective of disentangled factors
Jonghyun Lee, Dahuin Jung, Saehyung Lee, Junsung Park, Juhyeon Shin, Uiwon Hwang, and Sungroh Yoon. Entropy is not enough for test-time adaptation: From the perspective of disentangled factors. InThe Twelfth International Confer- ence on Learning Representations, 2024. 2
2024
-
[16]
Clip-driven coarse-to-fine semantic guidance for fine-grained open-set semi-supervised learning
Xiaokun Li, Yaping Huang, and Qingji Guan. Clip-driven coarse-to-fine semantic guidance for fine-grained open-set semi-supervised learning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 30312– 30321, 2025. 2
2025
-
[17]
Understanding and mitigating overfitting in prompt tuning for vision-language models.IEEE Transactions on Circuits and Systems for Video Technology, 33(9):4616–4629, 2023
Chengcheng Ma, Yang Liu, Jiankang Deng, Lingxi Xie, Weiming Dong, and Changsheng Xu. Understanding and mitigating overfitting in prompt tuning for vision-language models.IEEE Transactions on Circuits and Systems for Video Technology, 33(9):4616–4629, 2023. 1
2023
-
[18]
Fine-Grained Visual Classification of Aircraft
Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classi- fication of aircraft.arXiv preprint arXiv:1306.5151, 2013. 5
work page internal anchor Pith review arXiv 2013
-
[19]
En- hancing clip with gpt-4: Harnessing visual descriptions as prompts
Mayug Maniparambil, Chris V orster, Derek Molloy, Noel Murphy, Kevin McGuinness, and Noel E O’Connor. En- hancing clip with gpt-4: Harnessing visual descriptions as prompts. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 262–271, 2023. 3
2023
-
[20]
I2mvformer: Large language model generated multi-view document supervi- sion for zero-shot image classification
Muhammad Ferjad Naeem, Muhammad Gul Zain Ali Khan, Yongqin Xian, Muhammad Zeshan Afzal, Didier Stricker, Luc Van Gool, and Federico Tombari. I2mvformer: Large language model generated multi-view document supervi- sion for zero-shot image classification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15169–15179, 2023. 3
2023
-
[21]
Automated flower classification over a large number of classes
Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In2008 Sixth Indian conference on computer vision, graphics & im- age processing, pages 722–729. IEEE, 2008. 5
2008
-
[22]
Cats and dogs
Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012. 5
2012
-
[23]
What does a platypus look like? generating customized prompts for zero-shot image classification
Sarah Pratt, Ian Covert, Rosanne Liu, and Ali Farhadi. What does a platypus look like? generating customized prompts for zero-shot image classification. InProceedings of the IEEE/CVF international conference on computer vi- sion, pages 15691–15701, 2023. 3
2023
-
[24]
Proapo: Pro- gressively automatic prompt optimization for visual classifi- cation
Xiangyan Qu, Gaopeng Gou, Jiamin Zhuang, Jing Yu, Kun Song, Qihao Wang, Yili Li, and Gang Xiong. Proapo: Pro- gressively automatic prompt optimization for visual classifi- cation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 25145–25155, 2025. 1, 3
2025
-
[25]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1, 2, 3, 5
2021
-
[26]
Do imagenet classifiers generalize to im- agenet? InInternational conference on machine learning, pages 5389–5400
Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to im- agenet? InInternational conference on machine learning, pages 5389–5400. PMLR, 2019. 5
2019
-
[27]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 2, 3
2022
-
[28]
Consistency-guided prompt learning for vision-language models
Shuvendu Roy and Ali Etemad. Consistency-guided prompt learning for vision-language models. InThe Twelfth Inter- national Conference on Learning Representations, 2024. 1, 2
2024
-
[29]
Autoprompt: Eliciting knowl- edge from language models with automatically generated prompts
Taylor Shin, Yasaman Razeghi, Robert L Logan IV , Eric Wallace, and Sameer Singh. Autoprompt: Eliciting knowl- edge from language models with automatically generated prompts. InProceedings of the 2020 Conference on Em- pirical Methods in Natural Language Processing (EMNLP), pages 4222–4235, 2020. 2
2020
-
[30]
Test- time prompt tuning for zero-shot generalization in vision- language models.Advances in Neural Information Process- ing Systems, 35:14274–14289, 2022
Manli Shu, Weili Nie, De-An Huang, Zhiding Yu, Tom Goldstein, Anima Anandkumar, and Chaowei Xiao. Test- time prompt tuning for zero-shot generalization in vision- language models.Advances in Neural Information Process- ing Systems, 35:14274–14289, 2022. 1, 2, 3, 5, 7, 8
2022
-
[31]
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402, 2012. 5
work page internal anchor Pith review arXiv 2012
-
[32]
EVA-CLIP: Improved Training Techniques for CLIP at Scale
Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023. 1
work page internal anchor Pith review arXiv 2023
-
[33]
Learning robust global representations by penalizing local predictive power.Advances in neural information pro- cessing systems, 32, 2019
Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predictive power.Advances in neural information pro- cessing systems, 32, 2019. 5
2019
-
[34]
Shuoyuan Wang, Yixuan Li, and Hongxin Wei. Under- standing and mitigating miscalibration in prompt tuning for vision-language models.arXiv preprint arXiv:2410.02681,
-
[35]
Sun database: Large-scale scene recognition from abbey to zoo
Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In2010 IEEE computer so- ciety conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010. 5
2010
-
[36]
Any-shift prompt- ing for generalization over distributions
Zehao Xiao, Jiayi Shen, Mohammad Mahdi Derakhshani, Shengcai Liao, and Cees GM Snoek. Any-shift prompt- ing for generalization over distributions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13849–13860, 2024. 1, 2
2024
-
[37]
Dynaprompt: Dynamic test-time prompt tuning
Zehao Xiao, Shilin Yan, Jack Hong, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiayi Shen, Qi Wang, and Cees GM Snoek. Dynaprompt: Dynamic test-time prompt tuning. InThe Thirteenth International Conference on Learning Represen- tations, 2025. 2, 3, 5, 8
2025
-
[38]
Exploring clip’s dense knowledge for weakly supervised semantic segmentation
Zhiwei Yang, Yucong Meng, Kexue Fu, Feilong Tang, Shuo Wang, and Zhijian Song. Exploring clip’s dense knowledge for weakly supervised semantic segmentation. InProceed- ings of the Computer Vision and Pattern Recognition Con- ference, pages 20223–20232, 2025. 2
2025
-
[39]
C-tpt: Calibrated test-time prompt tuning for vision-language mod- els via text feature dispersion
Hee Suk Yoon, Eunseop Yoon, Joshua Tian Jin Tee, Mark Hasegawa-Johnson, Yingzhen Li, and Chang D Yoo. C-tpt: Calibrated test-time prompt tuning for vision-language mod- els via text feature dispersion. In12th International Confer- ence on Learning Representations, ICLR 2024, 2024. 2, 3, 8
2024
-
[40]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 1
2023
-
[41]
Conditional prompt learning for vision-language mod- els
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language mod- els. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 16816–16825,
-
[42]
Learning to prompt for vision-language models.In- ternational Journal of Computer Vision, 130(9):2337–2348,
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.In- ternational Journal of Computer Vision, 130(9):2337–2348,
-
[43]
Awt: Transferring vision-language models via augmentation, weighting, and transportation.Advances in Neural Information Processing Systems, 37:25561–25591,
Yuhan Zhu, Yuyang Ji, Zhiyu Zhao, Gangshan Wu, and Limin Wang. Awt: Transferring vision-language models via augmentation, weighting, and transportation.Advances in Neural Information Processing Systems, 37:25561–25591,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.