pith. sign in

arxiv: 2506.02671 · v3 · submitted 2025-06-03 · 💻 cs.CV

Test-Time Distillation for Continual Model Adaptation

Pith reviewed 2026-05-19 11:01 UTC · model grok-4.3

classification 💻 cs.CV
keywords continual test-time adaptationtest-time distillationvision-language modelmodel fusionmaximum softmax probabilityoptimal transportdistribution shiftimage classification
0
0 comments X

The pith

Reframing continual test-time adaptation as distillation from a frozen vision-language model with MSP-based fusion and optimal transport rectification prevents error amplification and enables stable unsupervised adaptation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Deep neural networks lose accuracy when data distributions shift after deployment. Existing continual test-time adaptation methods use self-supervision that creates feedback loops amplifying early mistakes into model drift. This paper reframes adaptation as test-time distillation guided by an external frozen vision-language model. It identifies two pitfalls in direct distillation—the generalist trap from the VLM's lack of task specialization and the entropy bias from mismatched model calibrations—and solves them by dynamically blending predictions with maximum softmax probability weighting followed by optimal transport rectification. The resulting CoDiRe framework produces a more reliable supervisory signal that supports continuous stable adaptation at lower computational cost than prior approaches.

Core claim

The paper claims that test-time distillation guided by a frozen VLM overcomes self-referential error amplification in CTTA by first building a blended teacher through dynamic fusion of VLM and target model predictions weighted by maximum softmax probability to circumvent entropy bias, then applying optimal transport-based rectification to align the target model's outputs with this teacher for stable continual adaptation across distribution shifts.

What carries the argument

CoDiRe framework that constructs a robust blended teacher via MSP-weighted dynamic fusion of a generalist VLM and the task-specific target model, then uses Optimal Transport rectification to enforce alignment during adaptation.

If this is right

  • The target model achieves stable adaptation without drifting into amplified errors from self-supervision loops.
  • Adaptation runs with substantially lower time cost than self-supervised baselines such as CoTTA while delivering higher accuracy.
  • The approach works with heterogeneous architectures because MSP avoids reliance on comparable entropy scales.
  • Continuous rectification keeps predictions aligned with the blended teacher across sequential shifts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same MSP-plus-rectification pattern could stabilize multi-model fusion in other unsupervised settings such as domain generalization or federated learning.
  • Replacing the VLM with a different frozen generalist model might transfer the benefits to non-vision modalities if a comparable confidence metric exists.
  • The method suggests that explicit rectification steps can compensate for imperfect teachers in continual learning pipelines.
  • Testing on longer sequences of shifts would reveal whether the stability gains persist beyond the evaluated benchmarks.

Load-bearing premise

Maximum softmax probability provides a reliably superior confidence signal for weighting predictions from heterogeneous models with different calibrations under distribution shifts.

What would settle it

An experiment on ImageNet-C or similar benchmarks where an entropy-based fusion variant of the same VLM-plus-target setup achieves higher accuracy or lower drift than the MSP-weighted version would falsify the central advantage of the proposed fusion step.

Figures

Figures reproduced from arXiv: 2506.02671 by Fanding Huang, Jiazhen Huang, Jingyan Jiang, Qinting Jiang, Xiao Chen, Zhiming Liu, Zhi Wang.

Figure 1
Figure 1. Figure 1: (a) Prompt-based methods require longer testing time, while SAIL introduces AdaptNet, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our proposed SAIL. (a) SAIL introduces AdaptNet, a lightweight and learnable visual adapter that collaborates with a frozen VLM for robust inference. (b) SAIL inte￾grates a gradient-aware reset mechanism driven by the gradient drift indicator (GDI), which detects domain transitions and strategically resets AdaptNet parameters. (c) During inference, the VLM and AdaptNet collaborate to generate t… view at source ↗
Figure 3
Figure 3. Figure 3: Discussions of SAIL. (a) The effect of reset strategies and percentage. (b) The effect of interpolation weights. (c) The distribution curve of output entropies of VLM and AdaptNet. a priori observation above that deep layers are inherently more susceptible to accumulating harmful domain-specific drift from capturing distinct activation statistics and semantic features. In contrast, most alternative strateg… view at source ↗
Figure 4
Figure 4. Figure 4: On CIFAR-10-C with gaussian noise corruption, We split the test samples into four subsets [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of ImageNet-C under 5 level of severity. The dataset showcases 15 types of algorithmically generated corruptions across four categories: noise, blur, weather, and digital. Each corruption type is illustrated at five increasing levels of severity, demonstrating the progressive impact of these corruptions. Office-Home. Office-Home [43] is a domain adaptation dataset consisting of images from fou… view at source ↗
Figure 6
Figure 6. Figure 6: More Entropy Distributions of VLM and AdaptNet on CIFAR-10-C. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: More Entropy Distributions of VLM and AdaptNet on ImageNet-C. E Broader Impacts The proposed SAIL framework offers a lightweight and effective solution for test-time adaptation of vision-language models. Its design enables deployment in resource-constrained environments and enhances model robustness under real-world distribution shifts. These benefits have the potential to support applications in healthcar… view at source ↗
read the original abstract

Deep neural networks often suffer performance degradation upon deployment due to distribution shifts. Continual Test-Time Adaptation (CTTA) aims to address this issue in an unsupervised manner. However, existing methods that rely on self-supervision are prone to an inherent self-referential feedback loop that amplifies initial prediction errors, leading to model drift. We revisit this limitation and propose Test-Time Distillation (TTD), which reframes adaptation as a distillation process guided by a frozen Vision-Language Model (VLM) as an external signal. While promising, we find that direct distillation is fraught with two pitfalls: (1) the Generalist Trap, where the VLM's broad but non-specialized knowledge leads to suboptimal performance on specific tasks and shifts; and (2) the Entropy Bias, where naive model fusion techniques based on entropy fail due to the disparate calibration of heterogeneous models. These pitfalls highlight the need to build a robust supervisory signal and leverage it to guide the target model toward stable adaptation. Hence, we present CoDiRe, a Continual Distillation and Rectification framework for TTD. CoDiRe first constructs a robust blended teacher by dynamically fusing the predictions of the VLM and the target model. Critically, it circumvents the Entropy Bias by leveraging Maximum Softmax Probability (MSP) as a more reliable confidence metric for weighting each model's expertise. Then it applies an Optimal Transport-based rectification to further align predictions with the blended teacher, enabling continuous and stable adaptation. Extensive experiments show that CoDiRe outperforms state-of-the-art baselines, exceeding CoTTA by 10.55% with only 48% of its time cost on ImageNet-C. Project page is publicly available at https://github.com/walawalagoose/TTD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes CoDiRe, a Continual Distillation and Rectification framework for test-time distillation in continual test-time adaptation (CTTA). It addresses self-referential drift in prior self-supervised CTTA methods by using a frozen vision-language model (VLM) as an external teacher signal. The method constructs a blended teacher via dynamic MSP-weighted fusion of the VLM and target model predictions to avoid the Generalist Trap and Entropy Bias, then applies Optimal Transport rectification to align the target model's outputs with this teacher for stable adaptation. Experiments on ImageNet-C report that CoDiRe exceeds CoTTA by 10.55% while using only 48% of its time cost.

Significance. If the empirical claims hold after verification, the work would be significant for the CTTA literature by demonstrating a practical way to leverage generalist VLMs for stable, efficient adaptation without error amplification. The public GitHub repository supports reproducibility, and the combination of MSP-based fusion with OT rectification offers a concrete mechanism to mitigate two identified pitfalls in distillation-based TTD. The efficiency gain alongside accuracy improvement could influence deployment of adaptive models under distribution shift.

major comments (2)
  1. [§3.2] §3.2 (Blended Teacher Construction): The central claim that MSP provides a reliably superior confidence signal for weighting the VLM and target model (to circumvent Entropy Bias) is load-bearing for the quality of the supervisory signal and thus for the reported 10.55% gain; however, the manuscript provides no controlled ablation isolating MSP-weighted fusion against entropy-based weighting or uniform averaging, leaving open the possibility that gains arise from other unablated components such as the OT formulation or learning-rate schedule.
  2. [§4] §4 (Experiments), Table reporting ImageNet-C results: The headline performance comparison (CoDiRe vs. CoTTA) lacks details on the number of runs, standard deviation, or statistical significance tests, and the exact implementation choices for baselines (including CoTTA) are not fully specified, which weakens verifiability of the central empirical claim given the heterogeneous architectures involved.
minor comments (2)
  1. [Abstract and §3] The abstract and method sections use the term 'Entropy Bias' without a formal definition or equation; adding a short mathematical characterization would improve clarity.
  2. [Figure 1] Figure captions for the overall framework diagram could explicitly label the MSP fusion and OT rectification blocks to aid readers in tracing the algorithmic flow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments highlight important aspects for improving clarity and verifiability. We address each major comment point by point below, indicating the revisions we will incorporate in the next version of the manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Blended Teacher Construction): The central claim that MSP provides a reliably superior confidence signal for weighting the VLM and target model (to circumvent Entropy Bias) is load-bearing for the quality of the supervisory signal and thus for the reported 10.55% gain; however, the manuscript provides no controlled ablation isolating MSP-weighted fusion against entropy-based weighting or uniform averaging, leaving open the possibility that gains arise from other unablated components such as the OT formulation or learning-rate schedule.

    Authors: We agree that a dedicated ablation isolating the MSP-weighted fusion would provide stronger empirical support for this design choice. Our analysis in §3.2 motivates MSP on the basis of calibration differences between the VLM and target model, but we will add a controlled ablation study (comparing MSP weighting against entropy-based weighting and uniform averaging) to the revised manuscript. This will quantify the isolated contribution of the weighting scheme while keeping the OT rectification and other components fixed. revision: yes

  2. Referee: [§4] §4 (Experiments), Table reporting ImageNet-C results: The headline performance comparison (CoDiRe vs. CoTTA) lacks details on the number of runs, standard deviation, or statistical significance tests, and the exact implementation choices for baselines (including CoTTA) are not fully specified, which weakens verifiability of the central empirical claim given the heterogeneous architectures involved.

    Authors: We concur that additional statistical details and implementation specifics are necessary for full reproducibility and to substantiate the 10.55% gain. In the revised manuscript we will report results over multiple independent runs (with the exact number stated), include standard deviations, and add statistical significance tests for the primary comparisons. We will also expand the experimental section with precise hyperparameter settings and adaptation details for all baselines, including CoTTA, to ensure fair comparison across heterogeneous models. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines CoDiRe through explicit algorithmic components: MSP-weighted fusion to build a blended teacher (to address Entropy Bias) followed by Optimal Transport rectification. These steps are introduced as novel responses to identified pitfalls (Generalist Trap and Entropy Bias) and are evaluated via external comparisons on benchmarks such as ImageNet-C against baselines like CoTTA. No equations or self-citations are shown to reduce the reported performance gains or the supervisory signal to fitted parameters or prior self-referential inputs by construction. The derivation remains independent of the target results.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that an off-the-shelf VLM supplies a useful external supervisory signal despite domain mismatch, and that MSP is a stable proxy for model expertise during fusion. No new physical entities or mathematical axioms beyond standard deep learning assumptions are introduced.

free parameters (1)
  • MSP-based fusion weights
    Dynamic weights derived from maximum softmax probabilities of VLM and target model; treated as computed rather than hand-tuned but still constitute an implicit choice of confidence metric.
axioms (1)
  • domain assumption A frozen generalist VLM provides a more stable external signal than self-supervision for continual adaptation.
    Invoked when reframing adaptation as distillation guided by VLM to break the self-referential loop.

pith-pipeline@v0.9.0 · 5864 in / 1419 out tokens · 38043 ms · 2026-05-19T11:01:58.585419+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 4 internal anchors

  1. [1]

    Align your prompts: Test-time prompting with distribution alignment for zero-shot generalization

    Jameel Abdul Samadh, Mohammad Hanan Gani, Noor Hussein, Muhammad Uzair Khattak, Muhammad Muzammal Naseer, Fahad Shahbaz Khan, and Salman H Khan. Align your prompts: Test-time prompting with distribution alignment for zero-shot generalization. Advances in Neural Information Processing Systems, 36:80396–80413, 2023

  2. [2]

    Bottom-up and top-down attention for image captioning and visual question answering

    Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6077–6086, 2018

  3. [3]

    On the Opportunities and Risks of Foundation Models

    Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021

  4. [4]

    SANTA: Source anchoring network and target alignment for continual test time adaptation

    Goirik Chakrabarty, Manogna Sreenivas, and Soma Biswas. SANTA: Source anchoring network and target alignment for continual test time adaptation. Transactions on Machine Learning Research, 2023

  5. [5]

    Recall and learn: Fine-tuning deep pretrained language models with less forgetting

    Sanyuan Chen, Yutai Hou, Yiming Cui, Wanxiang Che, Ting Liu, and Xiangzhan Yu. Recall and learn: Fine-tuning deep pretrained language models with less forgetting. arXiv preprint arXiv:2004.12651, 2020

  6. [6]

    Imagenet: A large- scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

  7. [7]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021

  8. [8]

    Logit-based ensemble distribution distilla- tion for robust autoregressive sequence uncertainties

    Yassir Fathullah, Guoxuan Xia, and Mark JF Gales. Logit-based ensemble distribution distilla- tion for robust autoregressive sequence uncertainties. In Uncertainty in Artificial Intelligence, pages 582–591. PMLR, 2023

  9. [9]

    Diverse data augmen- tation with diffusions for effective test-time prompt tuning

    Chun-Mei Feng, Kai Yu, Yong Liu, Salman Khan, and Wangmeng Zuo. Diverse data augmen- tation with diffusions for effective test-time prompt tuning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2704–2714, 2023

  10. [10]

    Clip-adapter: Better vision-language models with feature adapters

    Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision, 132(2):581–595, 2024

  11. [11]

    Deep clustering via joint convolutional autoencoder embedding and relative entropy minimiza- tion

    Kamran Ghasedi Dizaji, Amirhossein Herandi, Cheng Deng, Weidong Cai, and Heng Huang. Deep clustering via joint convolutional autoencoder embedding and relative entropy minimiza- tion. In Proceedings of the IEEE international conference on computer vision, pages 5736–5745, 2017

  12. [12]

    Refir: Grounding large restoration models with retrieval augmentation

    Hang Guo, Tao Dai, Zhihao Ouyang, Taolin Zhang, Yaohua Zha, Bin Chen, and Shu-tao Xia. Refir: Grounding large restoration models with retrieval augmentation. Advances in Neural Information Processing Systems, 37:46593–46621, 2024. 10

  13. [13]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  14. [14]

    Benchmarking neural network robustness to common corruptions and perturbations

    Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. Proceedings of the International Conference on Learning Representations, 2019

  15. [15]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

  16. [16]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021

  17. [17]

    Pcotta: Continual test-time adaptation for multi-task point cloud understanding

    Jincen Jiang, Qianyu Zhou, Yuhang Li, Xinkui Zhao, Meili Wang, Lizhuang Ma, Jian Chang, Jian Zhang, Xuequan Lu, et al. Pcotta: Continual test-time adaptation for multi-task point cloud understanding. Advances in Neural Information Processing Systems, 37:96229–96253, 2024

  18. [18]

    Efficient test-time adaptation of vision-language models

    Adilbek Karmanov, Dayan Guan, Shijian Lu, Abdulmotaleb El Saddik, and Eric Xing. Efficient test-time adaptation of vision-language models. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  19. [19]

    When model meets new normals: test- time adaptation for unsupervised time-series anomaly detection

    Dongmin Kim, Sunghyun Park, and Jaegul Choo. When model meets new normals: test- time adaptation for unsupervised time-series anomaly detection. In Proceedings of the AAAI conference on artificial intelligence, volume 38, pages 13113–13121, 2024

  20. [20]

    Bilinear attention networks

    Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. Bilinear attention networks. Advances in neural information processing systems, 31, 2018

  21. [21]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009

  22. [22]

    Multi- concept customization of text-to-image diffusion

    Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi- concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1931–1941, 2023

  23. [23]

    Entropy is not enough for test-time adaptation: From the perspective of disentangled factors

    Jonghyun Lee, Dahuin Jung, Saehyung Lee, Junsung Park, Juhyeon Shin, Uiwon Hwang, and Sungroh Yoon. Entropy is not enough for test-time adaptation: From the perspective of disentangled factors. arXiv preprint arXiv:2403.07366, 2024

  24. [24]

    Hospedales

    Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M. Hospedales. Deeper, broader and artier domain generalization, 2017

  25. [25]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023

  26. [26]

    Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation. In International conference on machine learning, pages 12888–12900. PMLR, 2022

  27. [27]

    Align before fuse: Vision and language representation learning with momentum distillation

    Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021

  28. [28]

    Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation

    Jian Liang, Dapeng Hu, and Jiashi Feng. Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. In International Conference on Machine Learning (ICML), pages 6028–6039, 2020

  29. [29]

    Frozen clip models are efficient video learners

    Ziyi Lin, Shijie Geng, Renrui Zhang, Peng Gao, Gerard De Melo, Xiaogang Wang, Jifeng Dai, Yu Qiao, and Hongsheng Li. Frozen clip models are efficient video learners. In European Conference on Computer Vision, pages 388–404. Springer, 2022. 11

  30. [30]

    Vida: Homeostatic visual domain adapter for continual test time adaptation,

    Jiaming Liu, Senqiao Yang, Peidong Jia, Ming Lu, Yandong Guo, Wei Xue, and Shanghang Zhang. Vida: Homeostatic visual domain adapter for continual test time adaptation. arXiv preprint arXiv:2306.04344, 2023

  31. [31]

    Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks

    Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32, 2019

  32. [32]

    Prompt distribution learning

    Yuning Lu, Jianzhuang Liu, Yonggang Zhang, Yajing Liu, and Xinmei Tian. Prompt distribution learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5206–5215, 2022

  33. [33]

    Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference

    R Thomas McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. arXiv preprint arXiv:1902.01007, 2019

  34. [34]

    On the stability of fine-tuning bert: Misconceptions, explanations, and strong baselines

    Marius Mosbach, Maksym Andriushchenko, and Dietrich Klakow. On the stability of fine-tuning bert: Misconceptions, explanations, and strong baselines. arXiv preprint arXiv:2006.04884, 2020

  35. [35]

    Efficient test-time model adaptation without forgetting

    Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Yaofo Chen, Shijian Zheng, Peilin Zhao, and Mingkui Tan. Efficient test-time model adaptation without forgetting. In International confer- ence on machine learning, pages 16888–16905. PMLR, 2022

  36. [36]

    Towards stable test-time adaptation in dynamic wild world, 2023

    Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Zhiquan Wen, Yaofo Chen, Peilin Zhao, and Mingkui Tan. Towards stable test-time adaptation in dynamic wild world. arXiv preprint arXiv:2302.12400, 2023

  37. [37]

    Test-time adaptation for depth completion

    Hyoungseob Park, Anjali Gupta, and Alex Wong. Test-time adaptation for depth completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20519–20529, 2024

  38. [38]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021

  39. [39]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  40. [40]

    A layer selection approach to test time adaptation

    Sabyasachi Sahoo, Mostafa ElAraby, Jonas Ngnawe, Yann Batiste Pequignot, Frédéric Precioso, and Christian Gagné. A layer selection approach to test time adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 20237–20245, 2025

  41. [41]

    Removing covariate shift improves robustness against common corruptions

    Steffen Schneider, Evgenia Rusak, Luisa Eck, Oliver Bringmann, Wieland Brendel, and Matthias Bethge. Removing covariate shift improves robustness against common corruptions. CoRR, abs/2006.16971, 2020

  42. [42]

    Test-time prompt tuning for zero-shot generalization in vision-language models

    Manli Shu, Weili Nie, De-An Huang, Zhiding Yu, Tom Goldstein, Anima Anandkumar, and Chaowei Xiao. Test-time prompt tuning for zero-shot generalization in vision-language models. Advances in Neural Information Processing Systems, 35:14274–14289, 2022

  43. [43]

    Deep hashing network for unsupervised domain adaptation

    Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5018–5027, 2017

  44. [44]

    Tent: Fully Test-time Adaptation by Entropy Minimization

    Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. arXiv preprint arXiv:2006.10726, 2020

  45. [45]

    Continual test-time domain adaptation

    Qin Wang, Olga Fink, Luc Van Gool, and Dengxin Dai. Continual test-time domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 7201–7211, 2022. 12

  46. [46]

    Clip-guided prototype modulating for few-shot action recognition

    Xiang Wang, Shiwei Zhang, Jun Cen, Changxin Gao, Yingya Zhang, Deli Zhao, and Nong Sang. Clip-guided prototype modulating for few-shot action recognition. International Journal of Computer Vision, 132(6):1899–1912, 2024

  47. [47]

    Vita-clip: Video and text adaptive clip via multimodal prompting

    Syed Talal Wasim, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan, and Mubarak Shah. Vita-clip: Video and text adaptive clip via multimodal prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23034–23044, 2023

  48. [48]

    Modality- collaborative test-time adaptation for action recognition

    Baochen Xiong, Xiaoshan Yang, Yaguang Song, Yaowei Wang, and Changsheng Xu. Modality- collaborative test-time adaptation for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26732–26741, 2024

  49. [49]

    Vision-language pre-training with triple contrastive learning

    Jinyu Yang, Jiali Duan, Son Tran, Yi Xu, Sampath Chanda, Liqun Chen, Belinda Zeng, Trishul Chilimbi, and Junzhou Huang. Vision-language pre-training with triple contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15671–15680, 2022

  50. [50]

    Exploiting the intrinsic neighborhood structure for source-free domain adaptation

    Shiqi Yang, Joost Van de Weijer, Luis Herranz, Shangling Jui, et al. Exploiting the intrinsic neighborhood structure for source-free domain adaptation. Advances in neural information processing systems, 34:29393–29405, 2021

  51. [51]

    How transferable are features in deep neural networks? Advances in neural information processing systems, 27, 2014

    Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? Advances in neural information processing systems, 27, 2014

  52. [52]

    Deep modular co-attention networks for visual question answering

    Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6281–6290, 2019

  53. [53]

    Investigating the catastrophic forgetting in multimodal large language models

    Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, Yong Jae Lee, and Yi Ma. Investigating the catastrophic forgetting in multimodal large language models. arXiv preprint arXiv:2309.10313, 2023

  54. [54]

    Memo: Test time robustness via adaptation and augmentation

    Marvin Zhang, Sergey Levine, and Chelsea Finn. Memo: Test time robustness via adaptation and augmentation. Advances in neural information processing systems, 35:38629–38642, 2022

  55. [55]

    Learning 3d representa- tions from 2d pre-trained models via image-to-point masked autoencoders

    Renrui Zhang, Liuhui Wang, Yu Qiao, Peng Gao, and Hongsheng Li. Learning 3d representa- tions from 2d pre-trained models via image-to-point masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21769–21780, 2023

  56. [56]

    Can language understand depth? In Proceedings of the 30th ACM International Conference on Multimedia, pages 6868–6874, 2022

    Renrui Zhang, Ziyao Zeng, Ziyu Guo, and Yafeng Li. Can language understand depth? In Proceedings of the 30th ACM International Conference on Multimedia, pages 6868–6874, 2022

  57. [57]

    Boostadapter: Improving test-time adaptation via regional bootstrapping

    Taolin Zhang, Jinpeng Wang, Hang Guo, Tao Dai, Bin Chen, and Shu-Tao Xia. Boostadapter: Improving test-time adaptation via regional bootstrapping. arXiv preprint arXiv:2410.15430, 2024

  58. [58]

    10 Contrastive Residual Energy Test-time Adaptation A

    Hao Zhao, Yuejiang Liu, Alexandre Alahi, and Tao Lin. On pitfalls of test-time adaptation. arXiv preprint arXiv:2306.03536, 2023

  59. [59]

    Conditional prompt learning for vision-language models

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16816–16825, 2022

  60. [60]

    desk,” “keyboard,

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022. 13 Contents of Appendix A Algorithm and Additional Details on SAIL 15 A.1 Pseudo-Codes of SAIL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.2 Additional Detail...