Test-Time Distillation for Continual Model Adaptation
Pith reviewed 2026-05-19 11:01 UTC · model grok-4.3
The pith
Reframing continual test-time adaptation as distillation from a frozen vision-language model with MSP-based fusion and optimal transport rectification prevents error amplification and enables stable unsupervised adaptation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that test-time distillation guided by a frozen VLM overcomes self-referential error amplification in CTTA by first building a blended teacher through dynamic fusion of VLM and target model predictions weighted by maximum softmax probability to circumvent entropy bias, then applying optimal transport-based rectification to align the target model's outputs with this teacher for stable continual adaptation across distribution shifts.
What carries the argument
CoDiRe framework that constructs a robust blended teacher via MSP-weighted dynamic fusion of a generalist VLM and the task-specific target model, then uses Optimal Transport rectification to enforce alignment during adaptation.
If this is right
- The target model achieves stable adaptation without drifting into amplified errors from self-supervision loops.
- Adaptation runs with substantially lower time cost than self-supervised baselines such as CoTTA while delivering higher accuracy.
- The approach works with heterogeneous architectures because MSP avoids reliance on comparable entropy scales.
- Continuous rectification keeps predictions aligned with the blended teacher across sequential shifts.
Where Pith is reading between the lines
- The same MSP-plus-rectification pattern could stabilize multi-model fusion in other unsupervised settings such as domain generalization or federated learning.
- Replacing the VLM with a different frozen generalist model might transfer the benefits to non-vision modalities if a comparable confidence metric exists.
- The method suggests that explicit rectification steps can compensate for imperfect teachers in continual learning pipelines.
- Testing on longer sequences of shifts would reveal whether the stability gains persist beyond the evaluated benchmarks.
Load-bearing premise
Maximum softmax probability provides a reliably superior confidence signal for weighting predictions from heterogeneous models with different calibrations under distribution shifts.
What would settle it
An experiment on ImageNet-C or similar benchmarks where an entropy-based fusion variant of the same VLM-plus-target setup achieves higher accuracy or lower drift than the MSP-weighted version would falsify the central advantage of the proposed fusion step.
Figures
read the original abstract
Deep neural networks often suffer performance degradation upon deployment due to distribution shifts. Continual Test-Time Adaptation (CTTA) aims to address this issue in an unsupervised manner. However, existing methods that rely on self-supervision are prone to an inherent self-referential feedback loop that amplifies initial prediction errors, leading to model drift. We revisit this limitation and propose Test-Time Distillation (TTD), which reframes adaptation as a distillation process guided by a frozen Vision-Language Model (VLM) as an external signal. While promising, we find that direct distillation is fraught with two pitfalls: (1) the Generalist Trap, where the VLM's broad but non-specialized knowledge leads to suboptimal performance on specific tasks and shifts; and (2) the Entropy Bias, where naive model fusion techniques based on entropy fail due to the disparate calibration of heterogeneous models. These pitfalls highlight the need to build a robust supervisory signal and leverage it to guide the target model toward stable adaptation. Hence, we present CoDiRe, a Continual Distillation and Rectification framework for TTD. CoDiRe first constructs a robust blended teacher by dynamically fusing the predictions of the VLM and the target model. Critically, it circumvents the Entropy Bias by leveraging Maximum Softmax Probability (MSP) as a more reliable confidence metric for weighting each model's expertise. Then it applies an Optimal Transport-based rectification to further align predictions with the blended teacher, enabling continuous and stable adaptation. Extensive experiments show that CoDiRe outperforms state-of-the-art baselines, exceeding CoTTA by 10.55% with only 48% of its time cost on ImageNet-C. Project page is publicly available at https://github.com/walawalagoose/TTD.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes CoDiRe, a Continual Distillation and Rectification framework for test-time distillation in continual test-time adaptation (CTTA). It addresses self-referential drift in prior self-supervised CTTA methods by using a frozen vision-language model (VLM) as an external teacher signal. The method constructs a blended teacher via dynamic MSP-weighted fusion of the VLM and target model predictions to avoid the Generalist Trap and Entropy Bias, then applies Optimal Transport rectification to align the target model's outputs with this teacher for stable adaptation. Experiments on ImageNet-C report that CoDiRe exceeds CoTTA by 10.55% while using only 48% of its time cost.
Significance. If the empirical claims hold after verification, the work would be significant for the CTTA literature by demonstrating a practical way to leverage generalist VLMs for stable, efficient adaptation without error amplification. The public GitHub repository supports reproducibility, and the combination of MSP-based fusion with OT rectification offers a concrete mechanism to mitigate two identified pitfalls in distillation-based TTD. The efficiency gain alongside accuracy improvement could influence deployment of adaptive models under distribution shift.
major comments (2)
- [§3.2] §3.2 (Blended Teacher Construction): The central claim that MSP provides a reliably superior confidence signal for weighting the VLM and target model (to circumvent Entropy Bias) is load-bearing for the quality of the supervisory signal and thus for the reported 10.55% gain; however, the manuscript provides no controlled ablation isolating MSP-weighted fusion against entropy-based weighting or uniform averaging, leaving open the possibility that gains arise from other unablated components such as the OT formulation or learning-rate schedule.
- [§4] §4 (Experiments), Table reporting ImageNet-C results: The headline performance comparison (CoDiRe vs. CoTTA) lacks details on the number of runs, standard deviation, or statistical significance tests, and the exact implementation choices for baselines (including CoTTA) are not fully specified, which weakens verifiability of the central empirical claim given the heterogeneous architectures involved.
minor comments (2)
- [Abstract and §3] The abstract and method sections use the term 'Entropy Bias' without a formal definition or equation; adding a short mathematical characterization would improve clarity.
- [Figure 1] Figure captions for the overall framework diagram could explicitly label the MSP fusion and OT rectification blocks to aid readers in tracing the algorithmic flow.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. The comments highlight important aspects for improving clarity and verifiability. We address each major comment point by point below, indicating the revisions we will incorporate in the next version of the manuscript.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Blended Teacher Construction): The central claim that MSP provides a reliably superior confidence signal for weighting the VLM and target model (to circumvent Entropy Bias) is load-bearing for the quality of the supervisory signal and thus for the reported 10.55% gain; however, the manuscript provides no controlled ablation isolating MSP-weighted fusion against entropy-based weighting or uniform averaging, leaving open the possibility that gains arise from other unablated components such as the OT formulation or learning-rate schedule.
Authors: We agree that a dedicated ablation isolating the MSP-weighted fusion would provide stronger empirical support for this design choice. Our analysis in §3.2 motivates MSP on the basis of calibration differences between the VLM and target model, but we will add a controlled ablation study (comparing MSP weighting against entropy-based weighting and uniform averaging) to the revised manuscript. This will quantify the isolated contribution of the weighting scheme while keeping the OT rectification and other components fixed. revision: yes
-
Referee: [§4] §4 (Experiments), Table reporting ImageNet-C results: The headline performance comparison (CoDiRe vs. CoTTA) lacks details on the number of runs, standard deviation, or statistical significance tests, and the exact implementation choices for baselines (including CoTTA) are not fully specified, which weakens verifiability of the central empirical claim given the heterogeneous architectures involved.
Authors: We concur that additional statistical details and implementation specifics are necessary for full reproducibility and to substantiate the 10.55% gain. In the revised manuscript we will report results over multiple independent runs (with the exact number stated), include standard deviations, and add statistical significance tests for the primary comparisons. We will also expand the experimental section with precise hyperparameter settings and adaptation details for all baselines, including CoTTA, to ensure fair comparison across heterogeneous models. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper defines CoDiRe through explicit algorithmic components: MSP-weighted fusion to build a blended teacher (to address Entropy Bias) followed by Optimal Transport rectification. These steps are introduced as novel responses to identified pitfalls (Generalist Trap and Entropy Bias) and are evaluated via external comparisons on benchmarks such as ImageNet-C against baselines like CoTTA. No equations or self-citations are shown to reduce the reported performance gains or the supervisory signal to fitted parameters or prior self-referential inputs by construction. The derivation remains independent of the target results.
Axiom & Free-Parameter Ledger
free parameters (1)
- MSP-based fusion weights
axioms (1)
- domain assumption A frozen generalist VLM provides a more stable external signal than self-supervision for continual adaptation.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
λ = exp(max(pVLM_i)) / (exp(max(pVLM_i)) + exp(max(pAda_i))); LAlign = −1/n Σ pic log pAda_ic + category-balance term; GDIt = cos(δt, δanchor)
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Entire SAIL pipeline (batch-wise AdaptNet update, no VLM modification, no augmentation)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Align your prompts: Test-time prompting with distribution alignment for zero-shot generalization
Jameel Abdul Samadh, Mohammad Hanan Gani, Noor Hussein, Muhammad Uzair Khattak, Muhammad Muzammal Naseer, Fahad Shahbaz Khan, and Salman H Khan. Align your prompts: Test-time prompting with distribution alignment for zero-shot generalization. Advances in Neural Information Processing Systems, 36:80396–80413, 2023
work page 2023
-
[2]
Bottom-up and top-down attention for image captioning and visual question answering
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6077–6086, 2018
work page 2018
-
[3]
On the Opportunities and Risks of Foundation Models
Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[4]
SANTA: Source anchoring network and target alignment for continual test time adaptation
Goirik Chakrabarty, Manogna Sreenivas, and Soma Biswas. SANTA: Source anchoring network and target alignment for continual test time adaptation. Transactions on Machine Learning Research, 2023
work page 2023
-
[5]
Recall and learn: Fine-tuning deep pretrained language models with less forgetting
Sanyuan Chen, Yutai Hou, Yiming Cui, Wanxiang Che, Ting Liu, and Xiangzhan Yu. Recall and learn: Fine-tuning deep pretrained language models with less forgetting. arXiv preprint arXiv:2004.12651, 2020
-
[6]
Imagenet: A large- scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009
work page 2009
-
[7]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021
work page 2021
-
[8]
Logit-based ensemble distribution distilla- tion for robust autoregressive sequence uncertainties
Yassir Fathullah, Guoxuan Xia, and Mark JF Gales. Logit-based ensemble distribution distilla- tion for robust autoregressive sequence uncertainties. In Uncertainty in Artificial Intelligence, pages 582–591. PMLR, 2023
work page 2023
-
[9]
Diverse data augmen- tation with diffusions for effective test-time prompt tuning
Chun-Mei Feng, Kai Yu, Yong Liu, Salman Khan, and Wangmeng Zuo. Diverse data augmen- tation with diffusions for effective test-time prompt tuning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2704–2714, 2023
work page 2023
-
[10]
Clip-adapter: Better vision-language models with feature adapters
Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision, 132(2):581–595, 2024
work page 2024
-
[11]
Deep clustering via joint convolutional autoencoder embedding and relative entropy minimiza- tion
Kamran Ghasedi Dizaji, Amirhossein Herandi, Cheng Deng, Weidong Cai, and Heng Huang. Deep clustering via joint convolutional autoencoder embedding and relative entropy minimiza- tion. In Proceedings of the IEEE international conference on computer vision, pages 5736–5745, 2017
work page 2017
-
[12]
Refir: Grounding large restoration models with retrieval augmentation
Hang Guo, Tao Dai, Zhihao Ouyang, Taolin Zhang, Yaohua Zha, Bin Chen, and Shu-tao Xia. Refir: Grounding large restoration models with retrieval augmentation. Advances in Neural Information Processing Systems, 37:46593–46621, 2024. 10
work page 2024
-
[13]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016
work page 2016
-
[14]
Benchmarking neural network robustness to common corruptions and perturbations
Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. Proceedings of the International Conference on Learning Representations, 2019
work page 2019
-
[15]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[16]
Scaling up visual and vision-language representation learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021
work page 2021
-
[17]
Pcotta: Continual test-time adaptation for multi-task point cloud understanding
Jincen Jiang, Qianyu Zhou, Yuhang Li, Xinkui Zhao, Meili Wang, Lizhuang Ma, Jian Chang, Jian Zhang, Xuequan Lu, et al. Pcotta: Continual test-time adaptation for multi-task point cloud understanding. Advances in Neural Information Processing Systems, 37:96229–96253, 2024
work page 2024
-
[18]
Efficient test-time adaptation of vision-language models
Adilbek Karmanov, Dayan Guan, Shijian Lu, Abdulmotaleb El Saddik, and Eric Xing. Efficient test-time adaptation of vision-language models. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[19]
When model meets new normals: test- time adaptation for unsupervised time-series anomaly detection
Dongmin Kim, Sunghyun Park, and Jaegul Choo. When model meets new normals: test- time adaptation for unsupervised time-series anomaly detection. In Proceedings of the AAAI conference on artificial intelligence, volume 38, pages 13113–13121, 2024
work page 2024
-
[20]
Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. Bilinear attention networks. Advances in neural information processing systems, 31, 2018
work page 2018
-
[21]
Learning multiple layers of features from tiny images
Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009
work page 2009
-
[22]
Multi- concept customization of text-to-image diffusion
Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi- concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1931–1941, 2023
work page 1931
-
[23]
Entropy is not enough for test-time adaptation: From the perspective of disentangled factors
Jonghyun Lee, Dahuin Jung, Saehyung Lee, Junsung Park, Juhyeon Shin, Uiwon Hwang, and Sungroh Yoon. Entropy is not enough for test-time adaptation: From the perspective of disentangled factors. arXiv preprint arXiv:2403.07366, 2024
-
[24]
Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M. Hospedales. Deeper, broader and artier domain generalization, 2017
work page 2017
-
[25]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023
work page 2023
-
[26]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation. In International conference on machine learning, pages 12888–12900. PMLR, 2022
work page 2022
-
[27]
Align before fuse: Vision and language representation learning with momentum distillation
Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021
work page 2021
-
[28]
Jian Liang, Dapeng Hu, and Jiashi Feng. Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. In International Conference on Machine Learning (ICML), pages 6028–6039, 2020
work page 2020
-
[29]
Frozen clip models are efficient video learners
Ziyi Lin, Shijie Geng, Renrui Zhang, Peng Gao, Gerard De Melo, Xiaogang Wang, Jifeng Dai, Yu Qiao, and Hongsheng Li. Frozen clip models are efficient video learners. In European Conference on Computer Vision, pages 388–404. Springer, 2022. 11
work page 2022
-
[30]
Vida: Homeostatic visual domain adapter for continual test time adaptation,
Jiaming Liu, Senqiao Yang, Peidong Jia, Ming Lu, Yandong Guo, Wei Xue, and Shanghang Zhang. Vida: Homeostatic visual domain adapter for continual test time adaptation. arXiv preprint arXiv:2306.04344, 2023
-
[31]
Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32, 2019
work page 2019
-
[32]
Yuning Lu, Jianzhuang Liu, Yonggang Zhang, Yajing Liu, and Xinmei Tian. Prompt distribution learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5206–5215, 2022
work page 2022
-
[33]
Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference
R Thomas McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. arXiv preprint arXiv:1902.01007, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1902
-
[34]
On the stability of fine-tuning bert: Misconceptions, explanations, and strong baselines
Marius Mosbach, Maksym Andriushchenko, and Dietrich Klakow. On the stability of fine-tuning bert: Misconceptions, explanations, and strong baselines. arXiv preprint arXiv:2006.04884, 2020
-
[35]
Efficient test-time model adaptation without forgetting
Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Yaofo Chen, Shijian Zheng, Peilin Zhao, and Mingkui Tan. Efficient test-time model adaptation without forgetting. In International confer- ence on machine learning, pages 16888–16905. PMLR, 2022
work page 2022
-
[36]
Towards stable test-time adaptation in dynamic wild world, 2023
Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Zhiquan Wen, Yaofo Chen, Peilin Zhao, and Mingkui Tan. Towards stable test-time adaptation in dynamic wild world. arXiv preprint arXiv:2302.12400, 2023
-
[37]
Test-time adaptation for depth completion
Hyoungseob Park, Anjali Gupta, and Alex Wong. Test-time adaptation for depth completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20519–20529, 2024
work page 2024
-
[38]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021
work page 2021
-
[39]
High- resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022
work page 2022
-
[40]
A layer selection approach to test time adaptation
Sabyasachi Sahoo, Mostafa ElAraby, Jonas Ngnawe, Yann Batiste Pequignot, Frédéric Precioso, and Christian Gagné. A layer selection approach to test time adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 20237–20245, 2025
work page 2025
-
[41]
Removing covariate shift improves robustness against common corruptions
Steffen Schneider, Evgenia Rusak, Luisa Eck, Oliver Bringmann, Wieland Brendel, and Matthias Bethge. Removing covariate shift improves robustness against common corruptions. CoRR, abs/2006.16971, 2020
-
[42]
Test-time prompt tuning for zero-shot generalization in vision-language models
Manli Shu, Weili Nie, De-An Huang, Zhiding Yu, Tom Goldstein, Anima Anandkumar, and Chaowei Xiao. Test-time prompt tuning for zero-shot generalization in vision-language models. Advances in Neural Information Processing Systems, 35:14274–14289, 2022
work page 2022
-
[43]
Deep hashing network for unsupervised domain adaptation
Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5018–5027, 2017
work page 2017
-
[44]
Tent: Fully Test-time Adaptation by Entropy Minimization
Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. arXiv preprint arXiv:2006.10726, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[45]
Continual test-time domain adaptation
Qin Wang, Olga Fink, Luc Van Gool, and Dengxin Dai. Continual test-time domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 7201–7211, 2022. 12
work page 2022
-
[46]
Clip-guided prototype modulating for few-shot action recognition
Xiang Wang, Shiwei Zhang, Jun Cen, Changxin Gao, Yingya Zhang, Deli Zhao, and Nong Sang. Clip-guided prototype modulating for few-shot action recognition. International Journal of Computer Vision, 132(6):1899–1912, 2024
work page 1912
-
[47]
Vita-clip: Video and text adaptive clip via multimodal prompting
Syed Talal Wasim, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan, and Mubarak Shah. Vita-clip: Video and text adaptive clip via multimodal prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23034–23044, 2023
work page 2023
-
[48]
Modality- collaborative test-time adaptation for action recognition
Baochen Xiong, Xiaoshan Yang, Yaguang Song, Yaowei Wang, and Changsheng Xu. Modality- collaborative test-time adaptation for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26732–26741, 2024
work page 2024
-
[49]
Vision-language pre-training with triple contrastive learning
Jinyu Yang, Jiali Duan, Son Tran, Yi Xu, Sampath Chanda, Liqun Chen, Belinda Zeng, Trishul Chilimbi, and Junzhou Huang. Vision-language pre-training with triple contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15671–15680, 2022
work page 2022
-
[50]
Exploiting the intrinsic neighborhood structure for source-free domain adaptation
Shiqi Yang, Joost Van de Weijer, Luis Herranz, Shangling Jui, et al. Exploiting the intrinsic neighborhood structure for source-free domain adaptation. Advances in neural information processing systems, 34:29393–29405, 2021
work page 2021
-
[51]
Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? Advances in neural information processing systems, 27, 2014
work page 2014
-
[52]
Deep modular co-attention networks for visual question answering
Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6281–6290, 2019
work page 2019
-
[53]
Investigating the catastrophic forgetting in multimodal large language models
Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, Yong Jae Lee, and Yi Ma. Investigating the catastrophic forgetting in multimodal large language models. arXiv preprint arXiv:2309.10313, 2023
-
[54]
Memo: Test time robustness via adaptation and augmentation
Marvin Zhang, Sergey Levine, and Chelsea Finn. Memo: Test time robustness via adaptation and augmentation. Advances in neural information processing systems, 35:38629–38642, 2022
work page 2022
-
[55]
Learning 3d representa- tions from 2d pre-trained models via image-to-point masked autoencoders
Renrui Zhang, Liuhui Wang, Yu Qiao, Peng Gao, and Hongsheng Li. Learning 3d representa- tions from 2d pre-trained models via image-to-point masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21769–21780, 2023
work page 2023
-
[56]
Renrui Zhang, Ziyao Zeng, Ziyu Guo, and Yafeng Li. Can language understand depth? In Proceedings of the 30th ACM International Conference on Multimedia, pages 6868–6874, 2022
work page 2022
-
[57]
Boostadapter: Improving test-time adaptation via regional bootstrapping
Taolin Zhang, Jinpeng Wang, Hang Guo, Tao Dai, Bin Chen, and Shu-Tao Xia. Boostadapter: Improving test-time adaptation via regional bootstrapping. arXiv preprint arXiv:2410.15430, 2024
-
[58]
10 Contrastive Residual Energy Test-time Adaptation A
Hao Zhao, Yuejiang Liu, Alexandre Alahi, and Tao Lin. On pitfalls of test-time adaptation. arXiv preprint arXiv:2306.03536, 2023
-
[59]
Conditional prompt learning for vision-language models
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16816–16825, 2022
work page 2022
-
[60]
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022. 13 Contents of Appendix A Algorithm and Additional Details on SAIL 15 A.1 Pseudo-Codes of SAIL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.2 Additional Detail...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.