Improving Calibration in Test-Time Prompt Tuning for Vision-Language Models via Data-Free Flatness-Aware Prompt Pretraining

Hyeonseo Jang; Jaebyeong Jeon; Joong-Won Hwang; Kibok Lee

arxiv: 2604.27715 · v1 · submitted 2026-04-30 · 💻 cs.CV

Improving Calibration in Test-Time Prompt Tuning for Vision-Language Models via Data-Free Flatness-Aware Prompt Pretraining

Hyeonseo Jang , Jaebyeong Jeon , Joong-Won Hwang , Kibok Lee This is my paper

Pith reviewed 2026-05-07 05:49 UTC · model grok-4.3

classification 💻 cs.CV

keywords test-time prompt tuningvision-language modelsmodel calibrationloss landscape flatnessprompt initializationdata-free pretraining

0 comments

The pith

Replacing the prompt initialization with data-free flatness-aware pretraining improves both calibration and performance in test-time prompt tuning for vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Test-time prompt tuning adapts vision-language models to unlabeled test data by optimizing textual prompts, but this process often lands in sharp regions of the loss landscape and produces poorly calibrated predictions. The paper shows that regularization approaches improve calibration by implicitly favoring flatter minima, and that the sharpness around the adapted prompt directly affects how well the model generalizes. To address this at the source, the authors introduce Flatness-aware Prompt Pretraining, a data-free step that locates better starting prompts before any test-time optimization begins. Simply swapping the initialization into existing tuning pipelines raises both calibration quality and accuracy without changing any other component, adding cost, or requiring labels.

Core claim

The sharpness of the loss landscape around adapted prompts governs calibration quality in test-time prompt tuning. Flatness-aware Prompt Pretraining locates initial prompts inside flatter regions of that landscape using only unlabeled data, and substituting this initialization into standard TPT pipelines is sufficient to raise both calibration and performance.

What carries the argument

Flatness-aware Prompt Pretraining (FPP), a data-free optimization procedure that places prompts in flatter parts of the loss landscape before test-time adaptation starts.

If this is right

Existing test-time prompt tuning pipelines gain better calibration and accuracy by changing only the starting prompt.
Calibration improves without the accuracy drop that usually accompanies added regularization terms.
The pretraining step adds no extra computation once test-time tuning begins.
The entire process remains label-free from pretraining through adaptation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Flatness in prompt space may be a more direct lever for reliable adaptation than post-hoc output constraints.
The same initialization idea could transfer to other parameter-efficient tuning settings that suffer from miscalibration.
Combining FPP with existing regularization methods might yield further gains on harder distribution shifts.
Measuring the flatness of the loss surface after adaptation could serve as a practical diagnostic for calibration quality.

Load-bearing premise

Prompts discovered by data-free flatness-aware pretraining will place the later test-time optimization inside flatter minima that generalize to new test distributions and produce measurably better calibration.

What would settle it

An experiment in which FPP-initialized prompts show no reduction in sharpness metrics around the final adapted prompt or fail to improve calibration scores relative to standard random or class-name initializations on the same test sets.

Figures

Figures reproduced from arXiv: 2604.27715 by Hyeonseo Jang, Jaebyeong Jeon, Joong-Won Hwang, Kibok Lee.

**Figure 1.** Figure 1: Applying regularization loss into TPT (C-TPT and O view at source ↗

**Figure 2.** Figure 2: Regularized TPT can be interpreted as an optimization view at source ↗

**Figure 3.** Figure 3: Relationship between sharpness of the loss landscape and view at source ↗

read the original abstract

Test-time prompt tuning (TPT) has emerged as a promising technique for enhancing the adaptability of vision-language models by optimizing textual prompts using unlabeled test data. However, prior studies have observed that TPT often produces poorly calibrated models, raising concerns about the reliability of their predictions. Recent works address this issue by incorporating additional regularization terms that constrain model outputs, which improve calibration but often degrade performance. In this work, we reveal that these regularization strategies implicitly encourage optimization toward flatter minima, and that the sharpness of the loss landscape around adapted prompts is a key factor governing calibration quality. Motivated by this observation, we introduce Flatness-aware Prompt Pretraining (FPP), a simple yet effective pretraining framework for TPT that initializes prompts within flatter regions of the loss landscape prior to adaptation. We show that simply replacing the initialization in existing TPT pipelines--without modifying any other components--is sufficient to improve both calibration and performance. Notably, FPP requires no labeled data and incurs no additional computational costs during test-time tuning, making it highly practical for real-world deployment. The code is available at: https://github.com/YonseiML/fpp.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A data-free pretraining step that finds flatter prompt initializations improves both calibration and accuracy in standard TPT pipelines just by swapping the starting point, with no extra test-time cost.

read the letter

The main takeaway is that this paper shows a simple initialization change can lift calibration in test-time prompt tuning without touching the adaptation loop or adding runtime overhead. They first note that existing regularizers for TPT calibration tend to push toward flatter minima, then train prompts in a data-free stage to land in those flatter regions before handing them to ordinary TPT. The result is better calibrated and more accurate predictions on the usual benchmarks, and the code is public.

Referee Report

2 major / 1 minor

Summary. The paper claims that regularization terms used in prior test-time prompt tuning (TPT) methods for vision-language models implicitly promote flatter minima in the loss landscape, and that the sharpness around adapted prompts is a primary driver of poor calibration. Motivated by this, the authors introduce Flatness-aware Prompt Pretraining (FPP), a data-free pretraining stage that finds initial prompts lying in flatter regions. They assert that simply swapping the prompt initialization in any existing TPT pipeline (without changing the test-time objective, optimizer, or any other component) yields simultaneous gains in calibration and accuracy on downstream tasks, with no added test-time cost or labeled data required.

Significance. If the central mechanism holds, the result would be practically significant: it decouples calibration improvement from the usual accuracy-calibration trade-off introduced by explicit regularizers, while adding zero overhead at deployment. The data-free, initialization-only nature makes the method immediately compatible with existing TPT codebases and could influence initialization strategies across test-time adaptation literature. The public code release further strengthens reproducibility.

major comments (2)

[Abstract / regularization analysis] Abstract and the section presenting the regularization-flatness observation: the manuscript states that prior regularizers 'implicitly encourage optimization toward flatter minima' and that 'sharpness of the loss landscape around adapted prompts is a key factor governing calibration quality,' yet provides no explicit flatness metric (Hessian trace, maximum loss under prompt perturbations, or neighborhood sharpness) nor quantitative correlation plots linking these quantities to the reported calibration improvements.
[Experiments / ablation studies] Experimental results on TPT adaptation: while FPP is shown to produce flatter prompts during its own (data-free) pretraining stage, the paper does not report any post-adaptation flatness measurement—e.g., sharpness evaluated on the test-time loss using the unlabeled test batch—on the final adapted prompts. Without this link, the observed calibration and accuracy gains remain consistent with a generic 'better starting point' effect rather than the claimed landscape-geometry mechanism.

minor comments (1)

[Method] The description of how the data-free pretraining objective is constructed (loss terms, sampling of pseudo-labels or augmentations) could be expanded with an explicit equation or pseudocode for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The two major comments correctly identify areas where additional quantitative evidence would strengthen the link between flatness and calibration. We will revise the manuscript to incorporate explicit flatness metrics, correlation plots, and post-adaptation measurements. Our point-by-point responses are below.

read point-by-point responses

Referee: [Abstract / regularization analysis] Abstract and the section presenting the regularization-flatness observation: the manuscript states that prior regularizers 'implicitly encourage optimization toward flatter minima' and that 'sharpness of the loss landscape around adapted prompts is a key factor governing calibration quality,' yet provides no explicit flatness metric (Hessian trace, maximum loss under prompt perturbations, or neighborhood sharpness) nor quantitative correlation plots linking these quantities to the reported calibration improvements.

Authors: We appreciate this point. The original analysis used loss-landscape visualizations and indirect performance evidence. To make the mechanism explicit, we will add a neighborhood sharpness metric (maximum loss under small random prompt perturbations) and include quantitative scatter plots correlating these sharpness values with ECE across methods and datasets. These will be placed in the revised Section 3 and a new figure in the experiments section. revision: yes
Referee: [Experiments / ablation studies] Experimental results on TPT adaptation: while FPP is shown to produce flatter prompts during its own (data-free) pretraining stage, the paper does not report any post-adaptation flatness measurement—e.g., sharpness evaluated on the test-time loss using the unlabeled test batch—on the final adapted prompts. Without this link, the observed calibration and accuracy gains remain consistent with a generic 'better starting point' effect rather than the claimed landscape-geometry mechanism.

Authors: We agree that post-adaptation flatness is needed to rule out a generic initialization effect. In the revision we will measure and report the same perturbation-based sharpness on the test-time loss (using the unlabeled test batch) for the final adapted prompts. Results will compare standard vs. FPP initialization across all benchmarks and be presented in an extended ablation table with discussion in Section 4.3. revision: yes

Circularity Check

0 steps flagged

No significant circularity; initialization change is independent of target metrics

full rationale

The paper's chain begins with an empirical analysis of existing regularization terms in TPT (which implicitly favor flatter minima) and uses that observation only to motivate the design of a data-free pretraining stage (FPP) that produces better initial prompts. The central claim is then that simply swapping the prompt initialization in any existing TPT pipeline yields measurable gains in calibration and accuracy; these gains are evaluated directly on standard metrics rather than being redefined or fitted as a function of flatness. No equation equates a prediction to its own input by construction, no parameter is fitted on a subset and then relabeled a prediction, and no load-bearing premise rests on a self-citation whose content is itself unverified. The flatness analysis functions as design rationale, not as a tautological re-expression of the final performance numbers. The method therefore remains a practical, externally measurable initialization heuristic.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard machine-learning assumptions about loss-landscape geometry and optimization dynamics. No new free parameters, invented entities, or ad-hoc axioms are introduced in the abstract.

axioms (2)

domain assumption Regularization strategies in TPT implicitly encourage optimization toward flatter minima
Presented as a revealed observation that motivates the method.
domain assumption The sharpness of the loss landscape around adapted prompts is a key factor governing calibration quality
Stated directly as the central link between landscape geometry and calibration.

pith-pipeline@v0.9.0 · 5520 in / 1397 out tokens · 61757 ms · 2026-05-07T05:49:28.824090+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages

[1]

Calibration-aware prompt learning for medical vision- language models

Abhishek Basu, Fahad Shamshad, Ashshak Sharifdeen, Karthik Nandakumar, and Muhammad Haris Khan. Calibration-aware prompt learning for medical vision- language models. InProceedings of the British Machine Vision Conference (BMVC), 2025. 1, 2

work page 2025
[2]

Latte: Language trajectory transformer

Arthur Bucker, Luis Figueredo, Sami Haddadin, Ashish Kapoor, Shuang Ma, Sai Vemprala, and Rogerio Bonatti. Latte: Language trajectory transformer. InAdvances in Neu- ral Information Processing Systems (NeurIPS), 2022. 1

work page 2022
[3]

Sharp minima can generalize for deep nets

Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. InPro- ceedings of the 34th International Conference on Machine Learning (ICML), 2017. 1, 2

work page 2017
[4]

Diverse data augmentation with diffusions for effective test-time prompt tuning

Chun-Mei Feng, Kai Yu, Yong Liu, Salman Khan, and Wangmeng Zuo. Diverse data augmentation with diffusions for effective test-time prompt tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 1, 2, 3

work page 2023
[5]

Sharpness-aware minimization for efficiently improving generalization

Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization. InInternational Conference on Learning Representations (ICLR), 2021. 2, 3, 7, 8

work page 2021
[6]

Bias- reduced uncertainty estimation for deep neural classifiers

Yonatan Geifman, Guy Uziel, and Ran El-Yaniv. Bias- reduced uncertainty estimation for deep neural classifiers. InInternational Conference on Learning Representations (ICLR), 2019. 5

work page 2019
[7]

Sotta: Robust test-time adaptation on noisy data streams

Taesik Gong, Yewon Kim, Taeckyung Lee, Sorn Chottana- nurak, and Sung-Ju Lee. Sotta: Robust test-time adaptation on noisy data streams. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 2, 8

work page 2023
[8]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. InProceedings of the 34th International Conference on Machine Learning, pages 1321–1330. PMLR, 2017. 2

work page 2017
[9]

Le, Yunhsuan Sung, Zhen Li, and Tom Duerig

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V . Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. InProceedings of the 34th International Conference on Machine Learning (ICML), 2021. 1

work page 2021
[10]

Soft calibration objectives for neural net- works

Archit Karandikar, Nicholas Cain, Dustin Tran, Balaji Lak- shminarayanan, Jonathon Shlens, Michael Curtis Mozer, and Rebecca Roelofs. Soft calibration objectives for neural net- works. InAdvances in Neural Information Processing Sys- tems, 2021. 2

work page 2021
[11]

Efficient test-time adaptation of vision-language models

Adilbek Karmanov, Dayan Guan, Shijian Lu, Abdulmo- taleb El Saddik, and Eric Xing. Efficient test-time adaptation of vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 7

work page 2024
[12]

The generalized householder transforma- tion and sparse matrices.Linear Algebra and its Applica- tions, 90:221–234, 1987

Linda Kaufman. The generalized householder transforma- tion and sparse matrices.Linear Algebra and its Applica- tions, 90:221–234, 1987. 5. 8

work page 1987
[13]

On large- batch training for deep learning: Generalization gap and sharp minima

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large- batch training for deep learning: Generalization gap and sharp minima. InInternational Conference on Learning Rep- resentations (ICLR), 2017. 1, 2

work page 2017
[14]

Simple but effective: Clip embed- dings for embodied ai

Apoorv Khandelwal, Luca Weihs, Roozbeh Mottaghi, and Aniruddha Kembhavi. Simple but effective: Clip embed- dings for embodied ai. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 1

work page 2022
[15]

Maple: Multi-modal prompt learning

Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19113–19122, 2023. 1, 2, 5

work page 2023
[16]

Self-regulating prompts: Foundational model adaptation without forgetting

Muhammad Uzair Khattak, Syed Talal Wasim, Muzam- mal Naseer, Salman Khan, Ming-Hsuan Yang, and Fa- had Shahbaz Khan. Self-regulating prompts: Foundational model adaptation without forgetting. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 1

work page 2023
[17]

Train- able calibration measures for neural networks from kernel mean embeddings

Aviral Kumar, Sunita Sarawagi, and Ujjwal Jain. Train- able calibration measures for neural networks from kernel mean embeddings. InProceedings of the 35th International Conference on Machine Learning, pages 2805–2814. PMLR,

work page
[18]

Asam: Adaptive sharpness-aware minimiza- tion for scale-invariant learning of deep neural networks

Jungmin Kwon, Jeongseop Kim, Hyunseo Park, and In Kwon Choi. Asam: Adaptive sharpness-aware minimiza- tion for scale-invariant learning of deep neural networks. In Proceedings of the 34th International Conference on Ma- chine Learning (ICML), 2021. 2, 3, 7, 8

work page 2021
[19]

Tibshi- rani, and Larry Wasserman

Jing Lei, Max G’Sell, Alessandro Rinaldo, Ryan J. Tibshi- rani, and Larry Wasserman. Distribution-free predictive in- ference for regression.Journal of the American Statistical Association, 113(523):1094–1111, 2018. 2

work page 2018
[20]

Visualizing the loss landscape of neural nets

Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. InAdvances in Neural Information Processing Systems (NeurIPS), 2018. 1, 2

work page 2018
[21]

Friendly sharpness-aware minimization

Tao Li, Pan Zhou, Zhengbao He, Xinwen Cheng, and Xiaolin Huang. Friendly sharpness-aware minimization. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2

work page 2024
[22]

Fisher-rao metric, geometry, and complexity of neural networks

Tengyuan Liang, Tomaso Poggio, Alexander Rakhlin, and James Stokes. Fisher-rao metric, geometry, and complexity of neural networks. InProceedings of the Twenty-Second In- ternational Conference on Artificial Intelligence and Statis- tics (AISTATS), 2019. 5

work page 2019
[23]

The devil is in the margin: Margin-based label smooth- 9 ing for network calibration

Bingyuan Liu, Ismail Ben Ayed, Adrian Galdran, and Jose Dolz. The devil is in the margin: Margin-based label smooth- 9 ing for network calibration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 80–88, 2022. 2

work page 2022
[24]

Landman, Yixuan Yuan, Alan Yuille, Yucheng Tang, and Zongwei Zhou

Jie Liu, Yixiao Zhang, Jie-Neng Chen, Junfei Xiao, Yongyi Lu, Bennett A. Landman, Yixuan Yuan, Alan Yuille, Yucheng Tang, and Zongwei Zhou. Clip-driven univer- sal model for organ segmentation and tumor detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 1

work page 2023
[25]

Random sharpness-aware min- imization

Yong Liu, Siqi Mai, Minhao Cheng, Xiangning Chen, Cho- Jui Hsieh, and Yang You. Random sharpness-aware min- imization. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. 2, 3

work page 2022
[26]

Explicit regularisation, sharpness and calibration

Israel Mason-Williams, Fredrik Ekholm, and Ferenc Huszar. Explicit regularisation, sharpness and calibration. InAd- vances in Neural Information Processing Systems (NeurIPS) Workshop on Scientific Methods for Understanding Deep Learning, 2024. 2

work page 2024
[27]

Torr, and Puneet K

Jishnu Mukhoti, Viveka Kulharia, Amartya Sanyal, Stuart Golodetz, Philip H.S. Torr, and Puneet K. Dokania. Cali- brating deep neural networks using focal loss. InAdvances in Neural Information Processing Systems (NeurIPS), 2020. 5

work page 2020
[28]

Robust calibration of large vision- language adapters

Balamurali Murugesan, Julio Silva-Rodriguez, Ismail Ben Ayed, and Jose Dolz. Robust calibration of large vision- language adapters. InProceedings of the European Confer- ence on Computer Vision (ECCV), 2024. 1

work page 2024
[29]

Obtaining well calibrated probabilities using bayesian binning

Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. InProceedings of the AAAI Conference on Artificial Intelligence, 2015. 5

work page 2015
[30]

Towards stable test-time adaptation in dynamic wild world

Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Zhiquan Wen, Yaofo Chen, Peilin Zhao, and Mingkui Tan. Towards stable test-time adaptation in dynamic wild world. InInternational Conference on Learning Representations (ICLR), 2023. 2, 8

work page 2023
[31]

Measuring calibration in deep learning

Jeremy Nixon, Michael W, Dusenberry, Linchuan Zhang, Ghassen Jerfel, and Dustin Tran. Measuring calibration in deep learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Work- shops, 2019. 5

work page 2019
[32]

John C. Platt. Probabilistic outputs for support vector ma- chines and comparisons to regularized likelihood methods. InAdvances in Large Margin Classifiers, pages 61–74. MIT Press, 1999. 2

work page 1999
[33]

Learning transferable visual models from natural language supervision.arXiv preprint,

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision.arXiv preprint,

work page
[34]

O-tpt: Orthogonality constraints for calibrating test-time prompt tuning in vision-language models

Ashshak Sharifdeen, Muhammad Akhtar Munir, Sanoojan Baliah, Salman Khan, and Muhammad Haris Khan. O-tpt: Orthogonality constraints for calibrating test-time prompt tuning in vision-language models. InProceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR), pages 19942–19951, 2025. 1, 2, 3, 5, 6, 7, 4

work page 2025
[35]

Test- time prompt tuning for zero-shot generalization in vision- language models

Manli Shu, Weili Nie, De-An Huang, Zhiding Yu, Tom Goldstein, Anima Anandkumar, and Chaowei Xiao. Test- time prompt tuning for zero-shot generalization in vision- language models. InAdvances in Neural Information Pro- cessing Systems (NeurIPS), 2022. 1, 2, 3, 5, 7

work page 2022
[36]

Towards understanding the calibration benefits of sharpness-aware minimization.arXiv preprint arXiv:2505.23866, 2025

Chengli Tan, Yubo Zhou, Haishan Ye, Guang Dai, Junmin Liu, Zengjie Song, Jiangshe Zhang, Zixiang Zhao, Yunda Hao, and Yong Xu. Towards understanding the calibration benefits of sharpness-aware minimization.arXiv preprint arXiv:2505.23866, 2025. 2

work page arXiv 2025
[37]

Springer-Verlag, Berlin, Heidelberg, 2005

Vladimir V ovk, Alex Gammerman, and Glenn Shafer.Al- gorithmic Learning in a Random World. Springer-Verlag, Berlin, Heidelberg, 2005. 2

work page 2005
[38]

Tent: Fully test-time adaptation by entropy minimization

Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Ol- shausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. InInternational Conference on Learning Representations (ICLR), 2021. 1, 4, 8

work page 2021
[39]

Medclip: Contrastive learning from unpaired medical images and text

Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, and Jimeng Sun. Medclip: Contrastive learning from unpaired medical images and text. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022. 1

work page 2022
[40]

Zehao Xiao, Shilin Yan, Jack Hong, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiayi Shen, Cheems Wang, and Cees G. M. Snoek. Dynaprompt: Dynamic test-time prompt tuning. InInternational Conference on Learning Representations (ICLR), 2025. 1, 5

work page 2025
[41]

Hee Suk Yoon, Joshua Tian Jin Tee, Eunseop Yoon, Sunjae Yoon, Gwangsu Kim, Yingzhen Li, and Chang D. Yoo. ESD: Expected Squared Difference as a Tuning-free Trainable Cal- ibration Measure. InThe Eleventh International Conference on Learning Representations, 2023. 2

work page 2023
[42]

Hee Suk Yoon, Eunseop Yoon, Joshua Tian Jin Tee, Mark Hasegawa-Johnson, Yingzhen Li, and Chang D. Yoo. C-tpt: Calibrated test-time prompt tuning for vision-language mod- els via text feature dispersion. InInternational Conference on Learning Representations (ICLR), 2024. 1, 2, 3, 5, 6, 7, 4

work page 2024
[43]

Dual prototype evolving for test-time generalization of vision-language models

Ce Zhang, Simon Stepputtis, Katia Sycara, and Yaqi Xie. Dual prototype evolving for test-time generalization of vision-language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 7

work page 2024
[44]

Preventing zero-shot transfer degradation in continual learning of vision-language models

Zangwei Zheng, Mingyuan Ma, Kai Wang, Ziheng Qin, Xi- angyu Yue, and Yang You. Preventing zero-shot transfer degradation in continual learning of vision-language models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 7

work page 2023
[45]

Conditional prompt learning for vision-language mod- els

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language mod- els. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 16816– 16825, 2022. 1, 2

work page 2022
[46]

Learning to prompt for vision-language models.Inter- national Journal of Computer Vision, pages 1–12, 2022

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.Inter- national Journal of Computer Vision, pages 1–12, 2022. 1, 2, 5, 6, 4

work page 2022
[47]

Towards theoretically understand- 10 ing why sgd generalizes better than adam in deep learn- ing

Pan Zhou, Jiashi Feng, Chao Ma, Caiming Xiong, Steven Hoi, and Weinan E. Towards theoretically understand- 10 ing why sgd generalizes better than adam in deep learn- ing. InAdvances in Neural Information Processing Systems (NeurIPS), 2020. 1, 2

work page 2020
[48]

Efficient test-time prompt tuning for vision-language models.arXiv preprint arXiv:2408.05775, 2024

Yuhan Zhu, Guozhen Zhang, Chen Xu, Haocheng Shen, Xiaoxin Chen, Gangshan Wu, and Limin Wang. Efficient test-time prompt tuning for vision-language models.arXiv preprint arXiv:2408.05775, 2024. 1, 2, 7

work page arXiv 2024
[49]

Surrogate gap minimization improves sharpness-aware training

Juntang Zhuang, Boqing Gong, Liangzhe Yuan, Yin Cui, Hartwig Adam, Nicha Dvornek, Sekhar Tatikonda, James Duncan, and Ting Liu. Surrogate gap minimization improves sharpness-aware training. InInternational Conference on Learning Representations (ICLR), 2022. 2, 3 11 Improving Calibration in Test-Time Prompt Tuning for Vision-Language Models via Data-Free ...

work page 2022
[50]

(A.6) Substituting this into the expression for S(T) , we obtain S(T) =K−1− 2 K X 1≤i<j≤K t⊤ i tj

(A.5) Moreover, since each ti has unit norm, ∥µ(T)∥ 2 2 is deter- mined by the pairwise inner products: ∥µ(T)∥ 2 2 = 1 K2 KX i=1 ti 2 2 = 1 K2  K+ 2 X 1≤i<j≤K t⊤ i tj   . (A.6) Substituting this into the expression for S(T) , we obtain S(T) =K−1− 2 K X 1≤i<j≤K t⊤ i tj. (A.7) Therefore, Ldisp(T) =−S(T), L orth(T) = K 2 K−1−S(T) . (A.8) Thus, both losse...

work page
[51]

a photo of a

(A.19) For v∼Unif(S D−1), the standard fourth-moment identity gives Ev∥Av∥4 2 =E v (v⊤A⊤Av)2 ≤ 3∥A∥4 F D(D+ 2) ≤ 3∥A∥4 F D2 , (A.20) while Ev∥Av∥2 2 = ∥A∥2 F D . (A.21) Combining the two bounds yields Ev∥P T v∥3 2 ≤ √ 3 ∥P T∥3 F D3/2 = √ 3 S(T) 3/2 D3/2 . (A.22) Moreover, since S(T) =K−K∥µ(T)∥ 2 2, (A.23) S(T) is bounded between 0 and K. That is, 0≤S(T)≤K...

work page

[1] [1]

Calibration-aware prompt learning for medical vision- language models

Abhishek Basu, Fahad Shamshad, Ashshak Sharifdeen, Karthik Nandakumar, and Muhammad Haris Khan. Calibration-aware prompt learning for medical vision- language models. InProceedings of the British Machine Vision Conference (BMVC), 2025. 1, 2

work page 2025

[2] [2]

Latte: Language trajectory transformer

Arthur Bucker, Luis Figueredo, Sami Haddadin, Ashish Kapoor, Shuang Ma, Sai Vemprala, and Rogerio Bonatti. Latte: Language trajectory transformer. InAdvances in Neu- ral Information Processing Systems (NeurIPS), 2022. 1

work page 2022

[3] [3]

Sharp minima can generalize for deep nets

Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. InPro- ceedings of the 34th International Conference on Machine Learning (ICML), 2017. 1, 2

work page 2017

[4] [4]

Diverse data augmentation with diffusions for effective test-time prompt tuning

Chun-Mei Feng, Kai Yu, Yong Liu, Salman Khan, and Wangmeng Zuo. Diverse data augmentation with diffusions for effective test-time prompt tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 1, 2, 3

work page 2023

[5] [5]

Sharpness-aware minimization for efficiently improving generalization

Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization. InInternational Conference on Learning Representations (ICLR), 2021. 2, 3, 7, 8

work page 2021

[6] [6]

Bias- reduced uncertainty estimation for deep neural classifiers

Yonatan Geifman, Guy Uziel, and Ran El-Yaniv. Bias- reduced uncertainty estimation for deep neural classifiers. InInternational Conference on Learning Representations (ICLR), 2019. 5

work page 2019

[7] [7]

Sotta: Robust test-time adaptation on noisy data streams

Taesik Gong, Yewon Kim, Taeckyung Lee, Sorn Chottana- nurak, and Sung-Ju Lee. Sotta: Robust test-time adaptation on noisy data streams. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 2, 8

work page 2023

[8] [8]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. InProceedings of the 34th International Conference on Machine Learning, pages 1321–1330. PMLR, 2017. 2

work page 2017

[9] [9]

Le, Yunhsuan Sung, Zhen Li, and Tom Duerig

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V . Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. InProceedings of the 34th International Conference on Machine Learning (ICML), 2021. 1

work page 2021

[10] [10]

Soft calibration objectives for neural net- works

Archit Karandikar, Nicholas Cain, Dustin Tran, Balaji Lak- shminarayanan, Jonathon Shlens, Michael Curtis Mozer, and Rebecca Roelofs. Soft calibration objectives for neural net- works. InAdvances in Neural Information Processing Sys- tems, 2021. 2

work page 2021

[11] [11]

Efficient test-time adaptation of vision-language models

Adilbek Karmanov, Dayan Guan, Shijian Lu, Abdulmo- taleb El Saddik, and Eric Xing. Efficient test-time adaptation of vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 7

work page 2024

[12] [12]

The generalized householder transforma- tion and sparse matrices.Linear Algebra and its Applica- tions, 90:221–234, 1987

Linda Kaufman. The generalized householder transforma- tion and sparse matrices.Linear Algebra and its Applica- tions, 90:221–234, 1987. 5. 8

work page 1987

[13] [13]

On large- batch training for deep learning: Generalization gap and sharp minima

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large- batch training for deep learning: Generalization gap and sharp minima. InInternational Conference on Learning Rep- resentations (ICLR), 2017. 1, 2

work page 2017

[14] [14]

Simple but effective: Clip embed- dings for embodied ai

Apoorv Khandelwal, Luca Weihs, Roozbeh Mottaghi, and Aniruddha Kembhavi. Simple but effective: Clip embed- dings for embodied ai. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 1

work page 2022

[15] [15]

Maple: Multi-modal prompt learning

Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19113–19122, 2023. 1, 2, 5

work page 2023

[16] [16]

Self-regulating prompts: Foundational model adaptation without forgetting

Muhammad Uzair Khattak, Syed Talal Wasim, Muzam- mal Naseer, Salman Khan, Ming-Hsuan Yang, and Fa- had Shahbaz Khan. Self-regulating prompts: Foundational model adaptation without forgetting. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 1

work page 2023

[17] [17]

Train- able calibration measures for neural networks from kernel mean embeddings

Aviral Kumar, Sunita Sarawagi, and Ujjwal Jain. Train- able calibration measures for neural networks from kernel mean embeddings. InProceedings of the 35th International Conference on Machine Learning, pages 2805–2814. PMLR,

work page

[18] [18]

Asam: Adaptive sharpness-aware minimiza- tion for scale-invariant learning of deep neural networks

Jungmin Kwon, Jeongseop Kim, Hyunseo Park, and In Kwon Choi. Asam: Adaptive sharpness-aware minimiza- tion for scale-invariant learning of deep neural networks. In Proceedings of the 34th International Conference on Ma- chine Learning (ICML), 2021. 2, 3, 7, 8

work page 2021

[19] [19]

Tibshi- rani, and Larry Wasserman

Jing Lei, Max G’Sell, Alessandro Rinaldo, Ryan J. Tibshi- rani, and Larry Wasserman. Distribution-free predictive in- ference for regression.Journal of the American Statistical Association, 113(523):1094–1111, 2018. 2

work page 2018

[20] [20]

Visualizing the loss landscape of neural nets

Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. InAdvances in Neural Information Processing Systems (NeurIPS), 2018. 1, 2

work page 2018

[21] [21]

Friendly sharpness-aware minimization

Tao Li, Pan Zhou, Zhengbao He, Xinwen Cheng, and Xiaolin Huang. Friendly sharpness-aware minimization. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2

work page 2024

[22] [22]

Fisher-rao metric, geometry, and complexity of neural networks

Tengyuan Liang, Tomaso Poggio, Alexander Rakhlin, and James Stokes. Fisher-rao metric, geometry, and complexity of neural networks. InProceedings of the Twenty-Second In- ternational Conference on Artificial Intelligence and Statis- tics (AISTATS), 2019. 5

work page 2019

[23] [23]

The devil is in the margin: Margin-based label smooth- 9 ing for network calibration

Bingyuan Liu, Ismail Ben Ayed, Adrian Galdran, and Jose Dolz. The devil is in the margin: Margin-based label smooth- 9 ing for network calibration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 80–88, 2022. 2

work page 2022

[24] [24]

Landman, Yixuan Yuan, Alan Yuille, Yucheng Tang, and Zongwei Zhou

Jie Liu, Yixiao Zhang, Jie-Neng Chen, Junfei Xiao, Yongyi Lu, Bennett A. Landman, Yixuan Yuan, Alan Yuille, Yucheng Tang, and Zongwei Zhou. Clip-driven univer- sal model for organ segmentation and tumor detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 1

work page 2023

[25] [25]

Random sharpness-aware min- imization

Yong Liu, Siqi Mai, Minhao Cheng, Xiangning Chen, Cho- Jui Hsieh, and Yang You. Random sharpness-aware min- imization. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. 2, 3

work page 2022

[26] [26]

Explicit regularisation, sharpness and calibration

Israel Mason-Williams, Fredrik Ekholm, and Ferenc Huszar. Explicit regularisation, sharpness and calibration. InAd- vances in Neural Information Processing Systems (NeurIPS) Workshop on Scientific Methods for Understanding Deep Learning, 2024. 2

work page 2024

[27] [27]

Torr, and Puneet K

Jishnu Mukhoti, Viveka Kulharia, Amartya Sanyal, Stuart Golodetz, Philip H.S. Torr, and Puneet K. Dokania. Cali- brating deep neural networks using focal loss. InAdvances in Neural Information Processing Systems (NeurIPS), 2020. 5

work page 2020

[28] [28]

Robust calibration of large vision- language adapters

Balamurali Murugesan, Julio Silva-Rodriguez, Ismail Ben Ayed, and Jose Dolz. Robust calibration of large vision- language adapters. InProceedings of the European Confer- ence on Computer Vision (ECCV), 2024. 1

work page 2024

[29] [29]

Obtaining well calibrated probabilities using bayesian binning

Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. InProceedings of the AAAI Conference on Artificial Intelligence, 2015. 5

work page 2015

[30] [30]

Towards stable test-time adaptation in dynamic wild world

Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Zhiquan Wen, Yaofo Chen, Peilin Zhao, and Mingkui Tan. Towards stable test-time adaptation in dynamic wild world. InInternational Conference on Learning Representations (ICLR), 2023. 2, 8

work page 2023

[31] [31]

Measuring calibration in deep learning

Jeremy Nixon, Michael W, Dusenberry, Linchuan Zhang, Ghassen Jerfel, and Dustin Tran. Measuring calibration in deep learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Work- shops, 2019. 5

work page 2019

[32] [32]

John C. Platt. Probabilistic outputs for support vector ma- chines and comparisons to regularized likelihood methods. InAdvances in Large Margin Classifiers, pages 61–74. MIT Press, 1999. 2

work page 1999

[33] [33]

Learning transferable visual models from natural language supervision.arXiv preprint,

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision.arXiv preprint,

work page

[34] [34]

O-tpt: Orthogonality constraints for calibrating test-time prompt tuning in vision-language models

Ashshak Sharifdeen, Muhammad Akhtar Munir, Sanoojan Baliah, Salman Khan, and Muhammad Haris Khan. O-tpt: Orthogonality constraints for calibrating test-time prompt tuning in vision-language models. InProceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR), pages 19942–19951, 2025. 1, 2, 3, 5, 6, 7, 4

work page 2025

[35] [35]

Test- time prompt tuning for zero-shot generalization in vision- language models

Manli Shu, Weili Nie, De-An Huang, Zhiding Yu, Tom Goldstein, Anima Anandkumar, and Chaowei Xiao. Test- time prompt tuning for zero-shot generalization in vision- language models. InAdvances in Neural Information Pro- cessing Systems (NeurIPS), 2022. 1, 2, 3, 5, 7

work page 2022

[36] [36]

Towards understanding the calibration benefits of sharpness-aware minimization.arXiv preprint arXiv:2505.23866, 2025

Chengli Tan, Yubo Zhou, Haishan Ye, Guang Dai, Junmin Liu, Zengjie Song, Jiangshe Zhang, Zixiang Zhao, Yunda Hao, and Yong Xu. Towards understanding the calibration benefits of sharpness-aware minimization.arXiv preprint arXiv:2505.23866, 2025. 2

work page arXiv 2025

[37] [37]

Springer-Verlag, Berlin, Heidelberg, 2005

Vladimir V ovk, Alex Gammerman, and Glenn Shafer.Al- gorithmic Learning in a Random World. Springer-Verlag, Berlin, Heidelberg, 2005. 2

work page 2005

[38] [38]

Tent: Fully test-time adaptation by entropy minimization

Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Ol- shausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. InInternational Conference on Learning Representations (ICLR), 2021. 1, 4, 8

work page 2021

[39] [39]

Medclip: Contrastive learning from unpaired medical images and text

Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, and Jimeng Sun. Medclip: Contrastive learning from unpaired medical images and text. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022. 1

work page 2022

[40] [40]

Zehao Xiao, Shilin Yan, Jack Hong, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiayi Shen, Cheems Wang, and Cees G. M. Snoek. Dynaprompt: Dynamic test-time prompt tuning. InInternational Conference on Learning Representations (ICLR), 2025. 1, 5

work page 2025

[41] [41]

Hee Suk Yoon, Joshua Tian Jin Tee, Eunseop Yoon, Sunjae Yoon, Gwangsu Kim, Yingzhen Li, and Chang D. Yoo. ESD: Expected Squared Difference as a Tuning-free Trainable Cal- ibration Measure. InThe Eleventh International Conference on Learning Representations, 2023. 2

work page 2023

[42] [42]

Hee Suk Yoon, Eunseop Yoon, Joshua Tian Jin Tee, Mark Hasegawa-Johnson, Yingzhen Li, and Chang D. Yoo. C-tpt: Calibrated test-time prompt tuning for vision-language mod- els via text feature dispersion. InInternational Conference on Learning Representations (ICLR), 2024. 1, 2, 3, 5, 6, 7, 4

work page 2024

[43] [43]

Dual prototype evolving for test-time generalization of vision-language models

Ce Zhang, Simon Stepputtis, Katia Sycara, and Yaqi Xie. Dual prototype evolving for test-time generalization of vision-language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 7

work page 2024

[44] [44]

Preventing zero-shot transfer degradation in continual learning of vision-language models

Zangwei Zheng, Mingyuan Ma, Kai Wang, Ziheng Qin, Xi- angyu Yue, and Yang You. Preventing zero-shot transfer degradation in continual learning of vision-language models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 7

work page 2023

[45] [45]

Conditional prompt learning for vision-language mod- els

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language mod- els. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 16816– 16825, 2022. 1, 2

work page 2022

[46] [46]

Learning to prompt for vision-language models.Inter- national Journal of Computer Vision, pages 1–12, 2022

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.Inter- national Journal of Computer Vision, pages 1–12, 2022. 1, 2, 5, 6, 4

work page 2022

[47] [47]

Towards theoretically understand- 10 ing why sgd generalizes better than adam in deep learn- ing

Pan Zhou, Jiashi Feng, Chao Ma, Caiming Xiong, Steven Hoi, and Weinan E. Towards theoretically understand- 10 ing why sgd generalizes better than adam in deep learn- ing. InAdvances in Neural Information Processing Systems (NeurIPS), 2020. 1, 2

work page 2020

[48] [48]

Efficient test-time prompt tuning for vision-language models.arXiv preprint arXiv:2408.05775, 2024

Yuhan Zhu, Guozhen Zhang, Chen Xu, Haocheng Shen, Xiaoxin Chen, Gangshan Wu, and Limin Wang. Efficient test-time prompt tuning for vision-language models.arXiv preprint arXiv:2408.05775, 2024. 1, 2, 7

work page arXiv 2024

[49] [49]

Surrogate gap minimization improves sharpness-aware training

Juntang Zhuang, Boqing Gong, Liangzhe Yuan, Yin Cui, Hartwig Adam, Nicha Dvornek, Sekhar Tatikonda, James Duncan, and Ting Liu. Surrogate gap minimization improves sharpness-aware training. InInternational Conference on Learning Representations (ICLR), 2022. 2, 3 11 Improving Calibration in Test-Time Prompt Tuning for Vision-Language Models via Data-Free ...

work page 2022

[50] [50]

(A.6) Substituting this into the expression for S(T) , we obtain S(T) =K−1− 2 K X 1≤i<j≤K t⊤ i tj

(A.5) Moreover, since each ti has unit norm, ∥µ(T)∥ 2 2 is deter- mined by the pairwise inner products: ∥µ(T)∥ 2 2 = 1 K2 KX i=1 ti 2 2 = 1 K2  K+ 2 X 1≤i<j≤K t⊤ i tj   . (A.6) Substituting this into the expression for S(T) , we obtain S(T) =K−1− 2 K X 1≤i<j≤K t⊤ i tj. (A.7) Therefore, Ldisp(T) =−S(T), L orth(T) = K 2 K−1−S(T) . (A.8) Thus, both losse...

work page

[51] [51]

a photo of a

(A.19) For v∼Unif(S D−1), the standard fourth-moment identity gives Ev∥Av∥4 2 =E v (v⊤A⊤Av)2 ≤ 3∥A∥4 F D(D+ 2) ≤ 3∥A∥4 F D2 , (A.20) while Ev∥Av∥2 2 = ∥A∥2 F D . (A.21) Combining the two bounds yields Ev∥P T v∥3 2 ≤ √ 3 ∥P T∥3 F D3/2 = √ 3 S(T) 3/2 D3/2 . (A.22) Moreover, since S(T) =K−K∥µ(T)∥ 2 2, (A.23) S(T) is bounded between 0 and K. That is, 0≤S(T)≤K...

work page