Test-Time Distillation for Continual Model Adaptation

Fanding Huang; Jiazhen Huang; Jingyan Jiang; Qinting Jiang; Xiao Chen; Zhiming Liu; Zhi Wang

arxiv: 2506.02671 · v3 · submitted 2025-06-03 · 💻 cs.CV

Test-Time Distillation for Continual Model Adaptation

Xiao Chen , Jiazhen Huang , Zhiming Liu , Qinting Jiang , Fanding Huang , Jingyan Jiang , Zhi Wang This is my paper

Pith reviewed 2026-05-19 11:01 UTC · model grok-4.3

classification 💻 cs.CV

keywords continual test-time adaptationtest-time distillationvision-language modelmodel fusionmaximum softmax probabilityoptimal transportdistribution shiftimage classification

0 comments

The pith

Reframing continual test-time adaptation as distillation from a frozen vision-language model with MSP-based fusion and optimal transport rectification prevents error amplification and enables stable unsupervised adaptation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Deep neural networks lose accuracy when data distributions shift after deployment. Existing continual test-time adaptation methods use self-supervision that creates feedback loops amplifying early mistakes into model drift. This paper reframes adaptation as test-time distillation guided by an external frozen vision-language model. It identifies two pitfalls in direct distillation—the generalist trap from the VLM's lack of task specialization and the entropy bias from mismatched model calibrations—and solves them by dynamically blending predictions with maximum softmax probability weighting followed by optimal transport rectification. The resulting CoDiRe framework produces a more reliable supervisory signal that supports continuous stable adaptation at lower computational cost than prior approaches.

Core claim

The paper claims that test-time distillation guided by a frozen VLM overcomes self-referential error amplification in CTTA by first building a blended teacher through dynamic fusion of VLM and target model predictions weighted by maximum softmax probability to circumvent entropy bias, then applying optimal transport-based rectification to align the target model's outputs with this teacher for stable continual adaptation across distribution shifts.

What carries the argument

CoDiRe framework that constructs a robust blended teacher via MSP-weighted dynamic fusion of a generalist VLM and the task-specific target model, then uses Optimal Transport rectification to enforce alignment during adaptation.

If this is right

The target model achieves stable adaptation without drifting into amplified errors from self-supervision loops.
Adaptation runs with substantially lower time cost than self-supervised baselines such as CoTTA while delivering higher accuracy.
The approach works with heterogeneous architectures because MSP avoids reliance on comparable entropy scales.
Continuous rectification keeps predictions aligned with the blended teacher across sequential shifts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same MSP-plus-rectification pattern could stabilize multi-model fusion in other unsupervised settings such as domain generalization or federated learning.
Replacing the VLM with a different frozen generalist model might transfer the benefits to non-vision modalities if a comparable confidence metric exists.
The method suggests that explicit rectification steps can compensate for imperfect teachers in continual learning pipelines.
Testing on longer sequences of shifts would reveal whether the stability gains persist beyond the evaluated benchmarks.

Load-bearing premise

Maximum softmax probability provides a reliably superior confidence signal for weighting predictions from heterogeneous models with different calibrations under distribution shifts.

What would settle it

An experiment on ImageNet-C or similar benchmarks where an entropy-based fusion variant of the same VLM-plus-target setup achieves higher accuracy or lower drift than the MSP-weighted version would falsify the central advantage of the proposed fusion step.

Figures

Figures reproduced from arXiv: 2506.02671 by Fanding Huang, Jiazhen Huang, Jingyan Jiang, Qinting Jiang, Xiao Chen, Zhiming Liu, Zhi Wang.

**Figure 2.** Figure 2: Overview of our proposed SAIL. (a) SAIL introduces AdaptNet, a lightweight and learnable visual adapter that collaborates with a frozen VLM for robust inference. (b) SAIL integrates a gradient-aware reset mechanism driven by the gradient drift indicator (GDI), which detects domain transitions and strategically resets AdaptNet parameters. (c) During inference, the VLM and AdaptNet collaborate to generate t… view at source ↗

**Figure 3.** Figure 3: Discussions of SAIL. (a) The effect of reset strategies and percentage. (b) The effect of interpolation weights. (c) The distribution curve of output entropies of VLM and AdaptNet. a priori observation above that deep layers are inherently more susceptible to accumulating harmful domain-specific drift from capturing distinct activation statistics and semantic features. In contrast, most alternative strateg… view at source ↗

**Figure 4.** Figure 4: On CIFAR-10-C with gaussian noise corruption, We split the test samples into four subsets [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Illustration of ImageNet-C under 5 level of severity. The dataset showcases 15 types of algorithmically generated corruptions across four categories: noise, blur, weather, and digital. Each corruption type is illustrated at five increasing levels of severity, demonstrating the progressive impact of these corruptions. Office-Home. Office-Home [43] is a domain adaptation dataset consisting of images from fou… view at source ↗

**Figure 6.** Figure 6: More Entropy Distributions of VLM and AdaptNet on CIFAR-10-C. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: More Entropy Distributions of VLM and AdaptNet on ImageNet-C. E Broader Impacts The proposed SAIL framework offers a lightweight and effective solution for test-time adaptation of vision-language models. Its design enables deployment in resource-constrained environments and enhances model robustness under real-world distribution shifts. These benefits have the potential to support applications in healthcar… view at source ↗

read the original abstract

Deep neural networks often suffer performance degradation upon deployment due to distribution shifts. Continual Test-Time Adaptation (CTTA) aims to address this issue in an unsupervised manner. However, existing methods that rely on self-supervision are prone to an inherent self-referential feedback loop that amplifies initial prediction errors, leading to model drift. We revisit this limitation and propose Test-Time Distillation (TTD), which reframes adaptation as a distillation process guided by a frozen Vision-Language Model (VLM) as an external signal. While promising, we find that direct distillation is fraught with two pitfalls: (1) the Generalist Trap, where the VLM's broad but non-specialized knowledge leads to suboptimal performance on specific tasks and shifts; and (2) the Entropy Bias, where naive model fusion techniques based on entropy fail due to the disparate calibration of heterogeneous models. These pitfalls highlight the need to build a robust supervisory signal and leverage it to guide the target model toward stable adaptation. Hence, we present CoDiRe, a Continual Distillation and Rectification framework for TTD. CoDiRe first constructs a robust blended teacher by dynamically fusing the predictions of the VLM and the target model. Critically, it circumvents the Entropy Bias by leveraging Maximum Softmax Probability (MSP) as a more reliable confidence metric for weighting each model's expertise. Then it applies an Optimal Transport-based rectification to further align predictions with the blended teacher, enabling continuous and stable adaptation. Extensive experiments show that CoDiRe outperforms state-of-the-art baselines, exceeding CoTTA by 10.55% with only 48% of its time cost on ImageNet-C. Project page is publicly available at https://github.com/walawalagoose/TTD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoDiRe shows usable gains on ImageNet-C by blending VLM and target predictions with MSP then rectifying via OT, but the MSP weighting rule is not isolated as the driver.

read the letter

The paper's core move is to treat continual test-time adaptation as distillation from a frozen VLM rather than pure self-supervision, then fix two issues: the VLM's lack of task specificity and the entropy bias that appears when fusing mismatched models. They build a blended teacher with MSP weights and follow it with an optimal transport step to keep the target model aligned over time. On ImageNet-C this beats CoTTA by 10.55 points while cutting runtime roughly in half. That combination is new enough to be worth noting for people already working on CTTA pipelines. The numbers are reported against named baselines and the method is described in enough algorithmic detail to reproduce the main loop. The practical payoff for deployment under shifting conditions is clear from the setup. The soft spot is exactly the one the stress-test flags: the headline improvement rests on MSP producing better fusion weights than entropy or a plain average, yet the text gives no ablation that holds architecture and calibration differences fixed while swapping only the weighting rule. Without that isolation it is possible the OT rectification or the learning-rate schedule or the particular VLM choice is doing most of the work. The assumption that MSP is reliably superior across heterogeneous models is stated but not stress-tested in the reported experiments. This is a paper for the CTTA subgroup that needs a concrete, lower-cost way to stabilize adaptation. It engages the existing literature directly and the empirical claims are checkable on public data, so it deserves a serious referee even if the MSP contribution needs tighter evidence.

Referee Report

2 major / 2 minor

Summary. The paper proposes CoDiRe, a Continual Distillation and Rectification framework for test-time distillation in continual test-time adaptation (CTTA). It addresses self-referential drift in prior self-supervised CTTA methods by using a frozen vision-language model (VLM) as an external teacher signal. The method constructs a blended teacher via dynamic MSP-weighted fusion of the VLM and target model predictions to avoid the Generalist Trap and Entropy Bias, then applies Optimal Transport rectification to align the target model's outputs with this teacher for stable adaptation. Experiments on ImageNet-C report that CoDiRe exceeds CoTTA by 10.55% while using only 48% of its time cost.

Significance. If the empirical claims hold after verification, the work would be significant for the CTTA literature by demonstrating a practical way to leverage generalist VLMs for stable, efficient adaptation without error amplification. The public GitHub repository supports reproducibility, and the combination of MSP-based fusion with OT rectification offers a concrete mechanism to mitigate two identified pitfalls in distillation-based TTD. The efficiency gain alongside accuracy improvement could influence deployment of adaptive models under distribution shift.

major comments (2)

[§3.2] §3.2 (Blended Teacher Construction): The central claim that MSP provides a reliably superior confidence signal for weighting the VLM and target model (to circumvent Entropy Bias) is load-bearing for the quality of the supervisory signal and thus for the reported 10.55% gain; however, the manuscript provides no controlled ablation isolating MSP-weighted fusion against entropy-based weighting or uniform averaging, leaving open the possibility that gains arise from other unablated components such as the OT formulation or learning-rate schedule.
[§4] §4 (Experiments), Table reporting ImageNet-C results: The headline performance comparison (CoDiRe vs. CoTTA) lacks details on the number of runs, standard deviation, or statistical significance tests, and the exact implementation choices for baselines (including CoTTA) are not fully specified, which weakens verifiability of the central empirical claim given the heterogeneous architectures involved.

minor comments (2)

[Abstract and §3] The abstract and method sections use the term 'Entropy Bias' without a formal definition or equation; adding a short mathematical characterization would improve clarity.
[Figure 1] Figure captions for the overall framework diagram could explicitly label the MSP fusion and OT rectification blocks to aid readers in tracing the algorithmic flow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments highlight important aspects for improving clarity and verifiability. We address each major comment point by point below, indicating the revisions we will incorporate in the next version of the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (Blended Teacher Construction): The central claim that MSP provides a reliably superior confidence signal for weighting the VLM and target model (to circumvent Entropy Bias) is load-bearing for the quality of the supervisory signal and thus for the reported 10.55% gain; however, the manuscript provides no controlled ablation isolating MSP-weighted fusion against entropy-based weighting or uniform averaging, leaving open the possibility that gains arise from other unablated components such as the OT formulation or learning-rate schedule.

Authors: We agree that a dedicated ablation isolating the MSP-weighted fusion would provide stronger empirical support for this design choice. Our analysis in §3.2 motivates MSP on the basis of calibration differences between the VLM and target model, but we will add a controlled ablation study (comparing MSP weighting against entropy-based weighting and uniform averaging) to the revised manuscript. This will quantify the isolated contribution of the weighting scheme while keeping the OT rectification and other components fixed. revision: yes
Referee: [§4] §4 (Experiments), Table reporting ImageNet-C results: The headline performance comparison (CoDiRe vs. CoTTA) lacks details on the number of runs, standard deviation, or statistical significance tests, and the exact implementation choices for baselines (including CoTTA) are not fully specified, which weakens verifiability of the central empirical claim given the heterogeneous architectures involved.

Authors: We concur that additional statistical details and implementation specifics are necessary for full reproducibility and to substantiate the 10.55% gain. In the revised manuscript we will report results over multiple independent runs (with the exact number stated), include standard deviations, and add statistical significance tests for the primary comparisons. We will also expand the experimental section with precise hyperparameter settings and adaptation details for all baselines, including CoTTA, to ensure fair comparison across heterogeneous models. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines CoDiRe through explicit algorithmic components: MSP-weighted fusion to build a blended teacher (to address Entropy Bias) followed by Optimal Transport rectification. These steps are introduced as novel responses to identified pitfalls (Generalist Trap and Entropy Bias) and are evaluated via external comparisons on benchmarks such as ImageNet-C against baselines like CoTTA. No equations or self-citations are shown to reduce the reported performance gains or the supervisory signal to fitted parameters or prior self-referential inputs by construction. The derivation remains independent of the target results.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that an off-the-shelf VLM supplies a useful external supervisory signal despite domain mismatch, and that MSP is a stable proxy for model expertise during fusion. No new physical entities or mathematical axioms beyond standard deep learning assumptions are introduced.

free parameters (1)

MSP-based fusion weights
Dynamic weights derived from maximum softmax probabilities of VLM and target model; treated as computed rather than hand-tuned but still constitute an implicit choice of confidence metric.

axioms (1)

domain assumption A frozen generalist VLM provides a more stable external signal than self-supervision for continual adaptation.
Invoked when reframing adaptation as distillation guided by VLM to break the self-referential loop.

pith-pipeline@v0.9.0 · 5864 in / 1419 out tokens · 38043 ms · 2026-05-19T11:01:58.585419+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

λ = exp(max(pVLM_i)) / (exp(max(pVLM_i)) + exp(max(pAda_i))); LAlign = −1/n Σ pic log pAda_ic + category-balance term; GDIt = cos(δt, δanchor)
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Entire SAIL pipeline (batch-wise AdaptNet update, no VLM modification, no augmentation)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 4 internal anchors

[1]

Align your prompts: Test-time prompting with distribution alignment for zero-shot generalization

Jameel Abdul Samadh, Mohammad Hanan Gani, Noor Hussein, Muhammad Uzair Khattak, Muhammad Muzammal Naseer, Fahad Shahbaz Khan, and Salman H Khan. Align your prompts: Test-time prompting with distribution alignment for zero-shot generalization. Advances in Neural Information Processing Systems, 36:80396–80413, 2023

work page 2023
[2]

Bottom-up and top-down attention for image captioning and visual question answering

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6077–6086, 2018

work page 2018
[3]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

SANTA: Source anchoring network and target alignment for continual test time adaptation

Goirik Chakrabarty, Manogna Sreenivas, and Soma Biswas. SANTA: Source anchoring network and target alignment for continual test time adaptation. Transactions on Machine Learning Research, 2023

work page 2023
[5]

Recall and learn: Fine-tuning deep pretrained language models with less forgetting

Sanyuan Chen, Yutai Hou, Yiming Cui, Wanxiang Che, Ting Liu, and Xiangzhan Yu. Recall and learn: Fine-tuning deep pretrained language models with less forgetting. arXiv preprint arXiv:2004.12651, 2020

work page arXiv 2004
[6]

Imagenet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

work page 2009
[7]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021

work page 2021
[8]

Logit-based ensemble distribution distilla- tion for robust autoregressive sequence uncertainties

Yassir Fathullah, Guoxuan Xia, and Mark JF Gales. Logit-based ensemble distribution distilla- tion for robust autoregressive sequence uncertainties. In Uncertainty in Artificial Intelligence, pages 582–591. PMLR, 2023

work page 2023
[9]

Diverse data augmen- tation with diffusions for effective test-time prompt tuning

Chun-Mei Feng, Kai Yu, Yong Liu, Salman Khan, and Wangmeng Zuo. Diverse data augmen- tation with diffusions for effective test-time prompt tuning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2704–2714, 2023

work page 2023
[10]

Clip-adapter: Better vision-language models with feature adapters

Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision, 132(2):581–595, 2024

work page 2024
[11]

Deep clustering via joint convolutional autoencoder embedding and relative entropy minimiza- tion

Kamran Ghasedi Dizaji, Amirhossein Herandi, Cheng Deng, Weidong Cai, and Heng Huang. Deep clustering via joint convolutional autoencoder embedding and relative entropy minimiza- tion. In Proceedings of the IEEE international conference on computer vision, pages 5736–5745, 2017

work page 2017
[12]

Refir: Grounding large restoration models with retrieval augmentation

Hang Guo, Tao Dai, Zhihao Ouyang, Taolin Zhang, Yaohua Zha, Bin Chen, and Shu-tao Xia. Refir: Grounding large restoration models with retrieval augmentation. Advances in Neural Information Processing Systems, 37:46593–46621, 2024. 10

work page 2024
[13]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016
[14]

Benchmarking neural network robustness to common corruptions and perturbations

Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. Proceedings of the International Conference on Learning Representations, 2019

work page 2019
[15]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[16]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021

work page 2021
[17]

Pcotta: Continual test-time adaptation for multi-task point cloud understanding

Jincen Jiang, Qianyu Zhou, Yuhang Li, Xinkui Zhao, Meili Wang, Lizhuang Ma, Jian Chang, Jian Zhang, Xuequan Lu, et al. Pcotta: Continual test-time adaptation for multi-task point cloud understanding. Advances in Neural Information Processing Systems, 37:96229–96253, 2024

work page 2024
[18]

Efficient test-time adaptation of vision-language models

Adilbek Karmanov, Dayan Guan, Shijian Lu, Abdulmotaleb El Saddik, and Eric Xing. Efficient test-time adaptation of vision-language models. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[19]

When model meets new normals: test- time adaptation for unsupervised time-series anomaly detection

Dongmin Kim, Sunghyun Park, and Jaegul Choo. When model meets new normals: test- time adaptation for unsupervised time-series anomaly detection. In Proceedings of the AAAI conference on artificial intelligence, volume 38, pages 13113–13121, 2024

work page 2024
[20]

Bilinear attention networks

Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. Bilinear attention networks. Advances in neural information processing systems, 31, 2018

work page 2018
[21]

Learning multiple layers of features from tiny images

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009

work page 2009
[22]

Multi- concept customization of text-to-image diffusion

Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi- concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1931–1941, 2023

work page 1931
[23]

Entropy is not enough for test-time adaptation: From the perspective of disentangled factors

Jonghyun Lee, Dahuin Jung, Saehyung Lee, Junsung Park, Juhyeon Shin, Uiwon Hwang, and Sungroh Yoon. Entropy is not enough for test-time adaptation: From the perspective of disentangled factors. arXiv preprint arXiv:2403.07366, 2024

work page arXiv 2024
[24]

Hospedales

Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M. Hospedales. Deeper, broader and artier domain generalization, 2017

work page 2017
[25]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023

work page 2023
[26]

Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation. In International conference on machine learning, pages 12888–12900. PMLR, 2022

work page 2022
[27]

Align before fuse: Vision and language representation learning with momentum distillation

Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021

work page 2021
[28]

Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation

Jian Liang, Dapeng Hu, and Jiashi Feng. Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. In International Conference on Machine Learning (ICML), pages 6028–6039, 2020

work page 2020
[29]

Frozen clip models are efficient video learners

Ziyi Lin, Shijie Geng, Renrui Zhang, Peng Gao, Gerard De Melo, Xiaogang Wang, Jifeng Dai, Yu Qiao, and Hongsheng Li. Frozen clip models are efficient video learners. In European Conference on Computer Vision, pages 388–404. Springer, 2022. 11

work page 2022
[30]

Vida: Homeostatic visual domain adapter for continual test time adaptation,

Jiaming Liu, Senqiao Yang, Peidong Jia, Ming Lu, Yandong Guo, Wei Xue, and Shanghang Zhang. Vida: Homeostatic visual domain adapter for continual test time adaptation. arXiv preprint arXiv:2306.04344, 2023

work page arXiv 2023
[31]

Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32, 2019

work page 2019
[32]

Prompt distribution learning

Yuning Lu, Jianzhuang Liu, Yonggang Zhang, Yajing Liu, and Xinmei Tian. Prompt distribution learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5206–5215, 2022

work page 2022
[33]

Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference

R Thomas McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. arXiv preprint arXiv:1902.01007, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1902
[34]

On the stability of fine-tuning bert: Misconceptions, explanations, and strong baselines

Marius Mosbach, Maksym Andriushchenko, and Dietrich Klakow. On the stability of fine-tuning bert: Misconceptions, explanations, and strong baselines. arXiv preprint arXiv:2006.04884, 2020

work page arXiv 2006
[35]

Efficient test-time model adaptation without forgetting

Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Yaofo Chen, Shijian Zheng, Peilin Zhao, and Mingkui Tan. Efficient test-time model adaptation without forgetting. In International confer- ence on machine learning, pages 16888–16905. PMLR, 2022

work page 2022
[36]

Towards stable test-time adaptation in dynamic wild world, 2023

Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Zhiquan Wen, Yaofo Chen, Peilin Zhao, and Mingkui Tan. Towards stable test-time adaptation in dynamic wild world. arXiv preprint arXiv:2302.12400, 2023

work page arXiv 2023
[37]

Test-time adaptation for depth completion

Hyoungseob Park, Anjali Gupta, and Alex Wong. Test-time adaptation for depth completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20519–20529, 2024

work page 2024
[38]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021
[39]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022
[40]

A layer selection approach to test time adaptation

Sabyasachi Sahoo, Mostafa ElAraby, Jonas Ngnawe, Yann Batiste Pequignot, Frédéric Precioso, and Christian Gagné. A layer selection approach to test time adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 20237–20245, 2025

work page 2025
[41]

Removing covariate shift improves robustness against common corruptions

Steffen Schneider, Evgenia Rusak, Luisa Eck, Oliver Bringmann, Wieland Brendel, and Matthias Bethge. Removing covariate shift improves robustness against common corruptions. CoRR, abs/2006.16971, 2020

work page arXiv 2006
[42]

Test-time prompt tuning for zero-shot generalization in vision-language models

Manli Shu, Weili Nie, De-An Huang, Zhiding Yu, Tom Goldstein, Anima Anandkumar, and Chaowei Xiao. Test-time prompt tuning for zero-shot generalization in vision-language models. Advances in Neural Information Processing Systems, 35:14274–14289, 2022

work page 2022
[43]

Deep hashing network for unsupervised domain adaptation

Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5018–5027, 2017

work page 2017
[44]

Tent: Fully Test-time Adaptation by Entropy Minimization

Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. arXiv preprint arXiv:2006.10726, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[45]

Continual test-time domain adaptation

Qin Wang, Olga Fink, Luc Van Gool, and Dengxin Dai. Continual test-time domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 7201–7211, 2022. 12

work page 2022
[46]

Clip-guided prototype modulating for few-shot action recognition

Xiang Wang, Shiwei Zhang, Jun Cen, Changxin Gao, Yingya Zhang, Deli Zhao, and Nong Sang. Clip-guided prototype modulating for few-shot action recognition. International Journal of Computer Vision, 132(6):1899–1912, 2024

work page 1912
[47]

Vita-clip: Video and text adaptive clip via multimodal prompting

Syed Talal Wasim, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan, and Mubarak Shah. Vita-clip: Video and text adaptive clip via multimodal prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23034–23044, 2023

work page 2023
[48]

Modality- collaborative test-time adaptation for action recognition

Baochen Xiong, Xiaoshan Yang, Yaguang Song, Yaowei Wang, and Changsheng Xu. Modality- collaborative test-time adaptation for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26732–26741, 2024

work page 2024
[49]

Vision-language pre-training with triple contrastive learning

Jinyu Yang, Jiali Duan, Son Tran, Yi Xu, Sampath Chanda, Liqun Chen, Belinda Zeng, Trishul Chilimbi, and Junzhou Huang. Vision-language pre-training with triple contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15671–15680, 2022

work page 2022
[50]

Exploiting the intrinsic neighborhood structure for source-free domain adaptation

Shiqi Yang, Joost Van de Weijer, Luis Herranz, Shangling Jui, et al. Exploiting the intrinsic neighborhood structure for source-free domain adaptation. Advances in neural information processing systems, 34:29393–29405, 2021

work page 2021
[51]

How transferable are features in deep neural networks? Advances in neural information processing systems, 27, 2014

Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? Advances in neural information processing systems, 27, 2014

work page 2014
[52]

Deep modular co-attention networks for visual question answering

Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6281–6290, 2019

work page 2019
[53]

Investigating the catastrophic forgetting in multimodal large language models

Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, Yong Jae Lee, and Yi Ma. Investigating the catastrophic forgetting in multimodal large language models. arXiv preprint arXiv:2309.10313, 2023

work page arXiv 2023
[54]

Memo: Test time robustness via adaptation and augmentation

Marvin Zhang, Sergey Levine, and Chelsea Finn. Memo: Test time robustness via adaptation and augmentation. Advances in neural information processing systems, 35:38629–38642, 2022

work page 2022
[55]

Learning 3d representa- tions from 2d pre-trained models via image-to-point masked autoencoders

Renrui Zhang, Liuhui Wang, Yu Qiao, Peng Gao, and Hongsheng Li. Learning 3d representa- tions from 2d pre-trained models via image-to-point masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21769–21780, 2023

work page 2023
[56]

Can language understand depth? In Proceedings of the 30th ACM International Conference on Multimedia, pages 6868–6874, 2022

Renrui Zhang, Ziyao Zeng, Ziyu Guo, and Yafeng Li. Can language understand depth? In Proceedings of the 30th ACM International Conference on Multimedia, pages 6868–6874, 2022

work page 2022
[57]

Boostadapter: Improving test-time adaptation via regional bootstrapping

Taolin Zhang, Jinpeng Wang, Hang Guo, Tao Dai, Bin Chen, and Shu-Tao Xia. Boostadapter: Improving test-time adaptation via regional bootstrapping. arXiv preprint arXiv:2410.15430, 2024

work page arXiv 2024
[58]

10 Contrastive Residual Energy Test-time Adaptation A

Hao Zhao, Yuejiang Liu, Alexandre Alahi, and Tao Lin. On pitfalls of test-time adaptation. arXiv preprint arXiv:2306.03536, 2023

work page arXiv 2023
[59]

Conditional prompt learning for vision-language models

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16816–16825, 2022

work page 2022
[60]

desk,” “keyboard,

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022. 13 Contents of Appendix A Algorithm and Additional Details on SAIL 15 A.1 Pseudo-Codes of SAIL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.2 Additional Detail...

work page arXiv 2022

[1] [1]

Align your prompts: Test-time prompting with distribution alignment for zero-shot generalization

Jameel Abdul Samadh, Mohammad Hanan Gani, Noor Hussein, Muhammad Uzair Khattak, Muhammad Muzammal Naseer, Fahad Shahbaz Khan, and Salman H Khan. Align your prompts: Test-time prompting with distribution alignment for zero-shot generalization. Advances in Neural Information Processing Systems, 36:80396–80413, 2023

work page 2023

[2] [2]

Bottom-up and top-down attention for image captioning and visual question answering

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6077–6086, 2018

work page 2018

[3] [3]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[4] [4]

SANTA: Source anchoring network and target alignment for continual test time adaptation

Goirik Chakrabarty, Manogna Sreenivas, and Soma Biswas. SANTA: Source anchoring network and target alignment for continual test time adaptation. Transactions on Machine Learning Research, 2023

work page 2023

[5] [5]

Recall and learn: Fine-tuning deep pretrained language models with less forgetting

Sanyuan Chen, Yutai Hou, Yiming Cui, Wanxiang Che, Ting Liu, and Xiangzhan Yu. Recall and learn: Fine-tuning deep pretrained language models with less forgetting. arXiv preprint arXiv:2004.12651, 2020

work page arXiv 2004

[6] [6]

Imagenet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

work page 2009

[7] [7]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021

work page 2021

[8] [8]

Logit-based ensemble distribution distilla- tion for robust autoregressive sequence uncertainties

Yassir Fathullah, Guoxuan Xia, and Mark JF Gales. Logit-based ensemble distribution distilla- tion for robust autoregressive sequence uncertainties. In Uncertainty in Artificial Intelligence, pages 582–591. PMLR, 2023

work page 2023

[9] [9]

Diverse data augmen- tation with diffusions for effective test-time prompt tuning

Chun-Mei Feng, Kai Yu, Yong Liu, Salman Khan, and Wangmeng Zuo. Diverse data augmen- tation with diffusions for effective test-time prompt tuning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2704–2714, 2023

work page 2023

[10] [10]

Clip-adapter: Better vision-language models with feature adapters

Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision, 132(2):581–595, 2024

work page 2024

[11] [11]

Deep clustering via joint convolutional autoencoder embedding and relative entropy minimiza- tion

Kamran Ghasedi Dizaji, Amirhossein Herandi, Cheng Deng, Weidong Cai, and Heng Huang. Deep clustering via joint convolutional autoencoder embedding and relative entropy minimiza- tion. In Proceedings of the IEEE international conference on computer vision, pages 5736–5745, 2017

work page 2017

[12] [12]

Refir: Grounding large restoration models with retrieval augmentation

Hang Guo, Tao Dai, Zhihao Ouyang, Taolin Zhang, Yaohua Zha, Bin Chen, and Shu-tao Xia. Refir: Grounding large restoration models with retrieval augmentation. Advances in Neural Information Processing Systems, 37:46593–46621, 2024. 10

work page 2024

[13] [13]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016

[14] [14]

Benchmarking neural network robustness to common corruptions and perturbations

Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. Proceedings of the International Conference on Learning Representations, 2019

work page 2019

[15] [15]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[16] [16]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021

work page 2021

[17] [17]

Pcotta: Continual test-time adaptation for multi-task point cloud understanding

Jincen Jiang, Qianyu Zhou, Yuhang Li, Xinkui Zhao, Meili Wang, Lizhuang Ma, Jian Chang, Jian Zhang, Xuequan Lu, et al. Pcotta: Continual test-time adaptation for multi-task point cloud understanding. Advances in Neural Information Processing Systems, 37:96229–96253, 2024

work page 2024

[18] [18]

Efficient test-time adaptation of vision-language models

Adilbek Karmanov, Dayan Guan, Shijian Lu, Abdulmotaleb El Saddik, and Eric Xing. Efficient test-time adaptation of vision-language models. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024

[19] [19]

When model meets new normals: test- time adaptation for unsupervised time-series anomaly detection

Dongmin Kim, Sunghyun Park, and Jaegul Choo. When model meets new normals: test- time adaptation for unsupervised time-series anomaly detection. In Proceedings of the AAAI conference on artificial intelligence, volume 38, pages 13113–13121, 2024

work page 2024

[20] [20]

Bilinear attention networks

Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. Bilinear attention networks. Advances in neural information processing systems, 31, 2018

work page 2018

[21] [21]

Learning multiple layers of features from tiny images

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009

work page 2009

[22] [22]

Multi- concept customization of text-to-image diffusion

Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi- concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1931–1941, 2023

work page 1931

[23] [23]

Entropy is not enough for test-time adaptation: From the perspective of disentangled factors

Jonghyun Lee, Dahuin Jung, Saehyung Lee, Junsung Park, Juhyeon Shin, Uiwon Hwang, and Sungroh Yoon. Entropy is not enough for test-time adaptation: From the perspective of disentangled factors. arXiv preprint arXiv:2403.07366, 2024

work page arXiv 2024

[24] [24]

Hospedales

Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M. Hospedales. Deeper, broader and artier domain generalization, 2017

work page 2017

[25] [25]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023

work page 2023

[26] [26]

Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation. In International conference on machine learning, pages 12888–12900. PMLR, 2022

work page 2022

[27] [27]

Align before fuse: Vision and language representation learning with momentum distillation

Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021

work page 2021

[28] [28]

Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation

Jian Liang, Dapeng Hu, and Jiashi Feng. Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. In International Conference on Machine Learning (ICML), pages 6028–6039, 2020

work page 2020

[29] [29]

Frozen clip models are efficient video learners

Ziyi Lin, Shijie Geng, Renrui Zhang, Peng Gao, Gerard De Melo, Xiaogang Wang, Jifeng Dai, Yu Qiao, and Hongsheng Li. Frozen clip models are efficient video learners. In European Conference on Computer Vision, pages 388–404. Springer, 2022. 11

work page 2022

[30] [30]

Vida: Homeostatic visual domain adapter for continual test time adaptation,

Jiaming Liu, Senqiao Yang, Peidong Jia, Ming Lu, Yandong Guo, Wei Xue, and Shanghang Zhang. Vida: Homeostatic visual domain adapter for continual test time adaptation. arXiv preprint arXiv:2306.04344, 2023

work page arXiv 2023

[31] [31]

Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32, 2019

work page 2019

[32] [32]

Prompt distribution learning

Yuning Lu, Jianzhuang Liu, Yonggang Zhang, Yajing Liu, and Xinmei Tian. Prompt distribution learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5206–5215, 2022

work page 2022

[33] [33]

Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference

R Thomas McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. arXiv preprint arXiv:1902.01007, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1902

[34] [34]

On the stability of fine-tuning bert: Misconceptions, explanations, and strong baselines

Marius Mosbach, Maksym Andriushchenko, and Dietrich Klakow. On the stability of fine-tuning bert: Misconceptions, explanations, and strong baselines. arXiv preprint arXiv:2006.04884, 2020

work page arXiv 2006

[35] [35]

Efficient test-time model adaptation without forgetting

Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Yaofo Chen, Shijian Zheng, Peilin Zhao, and Mingkui Tan. Efficient test-time model adaptation without forgetting. In International confer- ence on machine learning, pages 16888–16905. PMLR, 2022

work page 2022

[36] [36]

Towards stable test-time adaptation in dynamic wild world, 2023

Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Zhiquan Wen, Yaofo Chen, Peilin Zhao, and Mingkui Tan. Towards stable test-time adaptation in dynamic wild world. arXiv preprint arXiv:2302.12400, 2023

work page arXiv 2023

[37] [37]

Test-time adaptation for depth completion

Hyoungseob Park, Anjali Gupta, and Alex Wong. Test-time adaptation for depth completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20519–20529, 2024

work page 2024

[38] [38]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021

[39] [39]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022

[40] [40]

A layer selection approach to test time adaptation

Sabyasachi Sahoo, Mostafa ElAraby, Jonas Ngnawe, Yann Batiste Pequignot, Frédéric Precioso, and Christian Gagné. A layer selection approach to test time adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 20237–20245, 2025

work page 2025

[41] [41]

Removing covariate shift improves robustness against common corruptions

Steffen Schneider, Evgenia Rusak, Luisa Eck, Oliver Bringmann, Wieland Brendel, and Matthias Bethge. Removing covariate shift improves robustness against common corruptions. CoRR, abs/2006.16971, 2020

work page arXiv 2006

[42] [42]

Test-time prompt tuning for zero-shot generalization in vision-language models

Manli Shu, Weili Nie, De-An Huang, Zhiding Yu, Tom Goldstein, Anima Anandkumar, and Chaowei Xiao. Test-time prompt tuning for zero-shot generalization in vision-language models. Advances in Neural Information Processing Systems, 35:14274–14289, 2022

work page 2022

[43] [43]

Deep hashing network for unsupervised domain adaptation

Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5018–5027, 2017

work page 2017

[44] [44]

Tent: Fully Test-time Adaptation by Entropy Minimization

Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. arXiv preprint arXiv:2006.10726, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006

[45] [45]

Continual test-time domain adaptation

Qin Wang, Olga Fink, Luc Van Gool, and Dengxin Dai. Continual test-time domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 7201–7211, 2022. 12

work page 2022

[46] [46]

Clip-guided prototype modulating for few-shot action recognition

Xiang Wang, Shiwei Zhang, Jun Cen, Changxin Gao, Yingya Zhang, Deli Zhao, and Nong Sang. Clip-guided prototype modulating for few-shot action recognition. International Journal of Computer Vision, 132(6):1899–1912, 2024

work page 1912

[47] [47]

Vita-clip: Video and text adaptive clip via multimodal prompting

Syed Talal Wasim, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan, and Mubarak Shah. Vita-clip: Video and text adaptive clip via multimodal prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23034–23044, 2023

work page 2023

[48] [48]

Modality- collaborative test-time adaptation for action recognition

Baochen Xiong, Xiaoshan Yang, Yaguang Song, Yaowei Wang, and Changsheng Xu. Modality- collaborative test-time adaptation for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26732–26741, 2024

work page 2024

[49] [49]

Vision-language pre-training with triple contrastive learning

Jinyu Yang, Jiali Duan, Son Tran, Yi Xu, Sampath Chanda, Liqun Chen, Belinda Zeng, Trishul Chilimbi, and Junzhou Huang. Vision-language pre-training with triple contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15671–15680, 2022

work page 2022

[50] [50]

Exploiting the intrinsic neighborhood structure for source-free domain adaptation

Shiqi Yang, Joost Van de Weijer, Luis Herranz, Shangling Jui, et al. Exploiting the intrinsic neighborhood structure for source-free domain adaptation. Advances in neural information processing systems, 34:29393–29405, 2021

work page 2021

[51] [51]

How transferable are features in deep neural networks? Advances in neural information processing systems, 27, 2014

Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? Advances in neural information processing systems, 27, 2014

work page 2014

[52] [52]

Deep modular co-attention networks for visual question answering

Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6281–6290, 2019

work page 2019

[53] [53]

Investigating the catastrophic forgetting in multimodal large language models

Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, Yong Jae Lee, and Yi Ma. Investigating the catastrophic forgetting in multimodal large language models. arXiv preprint arXiv:2309.10313, 2023

work page arXiv 2023

[54] [54]

Memo: Test time robustness via adaptation and augmentation

Marvin Zhang, Sergey Levine, and Chelsea Finn. Memo: Test time robustness via adaptation and augmentation. Advances in neural information processing systems, 35:38629–38642, 2022

work page 2022

[55] [55]

Learning 3d representa- tions from 2d pre-trained models via image-to-point masked autoencoders

Renrui Zhang, Liuhui Wang, Yu Qiao, Peng Gao, and Hongsheng Li. Learning 3d representa- tions from 2d pre-trained models via image-to-point masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21769–21780, 2023

work page 2023

[56] [56]

Can language understand depth? In Proceedings of the 30th ACM International Conference on Multimedia, pages 6868–6874, 2022

Renrui Zhang, Ziyao Zeng, Ziyu Guo, and Yafeng Li. Can language understand depth? In Proceedings of the 30th ACM International Conference on Multimedia, pages 6868–6874, 2022

work page 2022

[57] [57]

Boostadapter: Improving test-time adaptation via regional bootstrapping

Taolin Zhang, Jinpeng Wang, Hang Guo, Tao Dai, Bin Chen, and Shu-Tao Xia. Boostadapter: Improving test-time adaptation via regional bootstrapping. arXiv preprint arXiv:2410.15430, 2024

work page arXiv 2024

[58] [58]

10 Contrastive Residual Energy Test-time Adaptation A

Hao Zhao, Yuejiang Liu, Alexandre Alahi, and Tao Lin. On pitfalls of test-time adaptation. arXiv preprint arXiv:2306.03536, 2023

work page arXiv 2023

[59] [59]

Conditional prompt learning for vision-language models

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16816–16825, 2022

work page 2022

[60] [60]

desk,” “keyboard,

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022. 13 Contents of Appendix A Algorithm and Additional Details on SAIL 15 A.1 Pseudo-Codes of SAIL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.2 Additional Detail...

work page arXiv 2022