pith. sign in

arxiv: 2606.25478 · v2 · pith:ARM47XW6new · submitted 2026-06-24 · 💻 cs.CV

TACO: Towards Task-Consistent Open-Vocabulary Adaptation in Video Recognition

Pith reviewed 2026-06-30 09:35 UTC · model grok-4.3

classification 💻 cs.CV
keywords open-vocabulary video recognitiontask-consistent adaptationrelative structure distillationCLIP adaptationcross-dataset evaluationbase-to-novel settingsrepresentation geometry regularizationspecialization projection
0
0 comments X

The pith

TACO reduces the mismatch between fine-tuning and evaluation objectives in open-vocabulary video recognition by regularizing relative geometry in representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that inconsistency between optimizing on known training data and evaluating on unseen distributions causes representations to deviate, which existing adaptation methods overlook and which harms generalization. TACO counters this by preserving OOD-relevant alignment through Relative Structure Distillation and by decoupling task-specific optimization from the test-time representation space via a lightweight projection. This approach aims to retain pretrained generalization while incorporating video-specific knowledge. A reader would care because it targets a core tension in adapting models like CLIP to new video tasks without the usual performance drop on novel categories or datasets.

Core claim

TACO mitigates the potential negative effects induced by the inconsistency between the fine-tuning and evaluation objectives, where model optimization is restricted to the known training distribution but evaluated on unseen ones. It does so by proposing Relative Structure Distillation, which regularizes the relative geometry of the representation space and suppresses harmful alignment shift during training, and by decoupling the representation space from the optimization space with a lightweight specialization projection, allowing task-specific adaptation without directly overspecializing the representations used at test time, establishing state-of-the-art performance on diverse benchmarks u

What carries the argument

Relative Structure Distillation, which regularizes the relative geometry of the representation space, combined with a lightweight specialization projection that decouples the representation space from the optimization space.

If this is right

  • Adaptation preserves OOD-relevant alignment beyond the training distribution.
  • Harmful alignment shift is suppressed during training without overspecializing test representations.
  • Task-specific adaptation proceeds while the representations used at evaluation remain closer to the pretrained geometry.
  • State-of-the-art results follow on cross-dataset and base-to-novel video recognition benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same relative-geometry constraint might reduce objective mismatch in image or audio open-vocabulary adaptation settings.
  • Combining the specialization projection with other forms of knowledge distillation could further separate optimization from evaluation spaces.
  • Measuring the change in pairwise distances among features on unseen videos before and after applying the distillation term would directly test the claimed preservation effect.

Load-bearing premise

The observed deviation of representations beyond the fine-tuning data distribution is primarily inherited from the inconsistency between fine-tuning and evaluation objectives, and preserving relative geometry alignment will suppress harmful shift without introducing new trade-offs.

What would settle it

A controlled experiment in which models trained with Relative Structure Distillation still exhibit large representation deviations on held-out video distributions, with no corresponding gain in base-to-novel or cross-dataset accuracy, would falsify the central mechanism.

Figures

Figures reproduced from arXiv: 2606.25478 by Chengju Liu, Liuyi Wang, Mengxian Hu, Minghao Zhu, Qijun Chen, Xiao Lin, Xiaoyan Qi, Xun Zhou.

Figure 1
Figure 1. Figure 1: (a) Standard fine-tuning aims to align visual and text embeddings within the known training [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Replacing the encoders of the stan￾dard fine-tuning model and our model with the original CLIP encoders. A closer look at preserving generalization. To better understand the essence of preserved general￾ization, we investigate the impact of each fine-tuned encoder on generalization by replacing it with the corresponding CLIP’s original encoders during eval￾uation. We show the harmonic mean of zero-shot per… view at source ↗
Figure 3
Figure 3. Figure 3: Similarity distributions of visual embeddings between the CLIP model and var￾ious fine-tuned models in OOD space (UCF, HMDB, and K600). Characterizing representation deviation in OOD Space. To support the above hypothesis, we inves￾tigate how representations deviate in OOD space dur￾ing fine-tuning. Specifically, [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Alignment shift DKL and general￾ization performance of various adapted mod￾els in OOD space (UCF, HMDB, and K600) Quantifying alignment shift in adaptation. Be￾yond the deviation of individual representations in OOD space, a more critical issue is whether such de￾viation further disrupts the original cross-modal align￾ment. We therefore quantify the resulting alignment shift and examine its relationship wi… view at source ↗
Figure 5
Figure 5. Figure 5: An overview of the TACO framework (left) and the source of geometric anchors (right). [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Max cosine similarities be￾tween random geometric anchors and ID text representations Analysis of Random Geometric Anchors Overlapping with ID Space As discussed in the main text, while our proposed random geometric anchors could theoretically over￾lap with the in-distribution training distribution, this proba￾bility is negligible during fine-tuning. This is due to the fact that CLIP’s representation space… view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of model attention maps. Discussion on image-domain generalization. Although TACO is designed for open-vocabulary video adaptation, its benefit is not limited to the video domain. As shown in [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: t-SNE [30] visualization of the CLIP model on UCF-101. Baseline: UCF (82.8%) [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: t-SNE [30] visualization of the standard fine-tuning model on UCF-101. TACO: UCF (84.9%) [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: t-SNE [30] visualization of our TACO model on UCF-101. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
read the original abstract

Adapting CLIP for open-vocabulary video recognition necessitates a delicate balance between newly acquired video knowledge and the pretrained generalization. While existing studies pursue this generalization-specialization trade-off with additional regularizations or constraints, we argue that they overlook the deviation of representations beyond the fine-tuning data distribution, resulting in suboptimal adaptation effects. We believe such deviation is inherited from the inconsistency between the fine-tuning and evaluation objectives, where model optimization is restricted to the known training distribution but evaluated on unseen ones. In this paper, we introduce \emph{TACO}, a simple yet effective framework to mitigate the potential negative effects induced by this inconsistency. Our key insight is that adaptation should preserve OOD-relevant alignment beyond the training distribution. To this end, we propose \emph{Relative Structure Distillation}, which regularizes the relative geometry of the representation space and suppresses harmful alignment shift during training. We further decouple the representation space from the optimization space with a lightweight specialization projection, allowing task-specific adaptation without directly overspecializing the representations used at test time. \emph{TACO} establishes state-of-the-art performance on diverse benchmarks under cross-dataset and base-to-novel settings. Code will be released at https://github.com/ZMHH-H/TACO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes TACO for adapting CLIP to open-vocabulary video recognition. It identifies an inconsistency between fine-tuning (restricted to training distribution) and evaluation (on unseen distributions) as the source of harmful representation deviation, and introduces Relative Structure Distillation to regularize relative geometry in the representation space plus a lightweight specialization projection to decouple representation from optimization space. The framework is claimed to achieve state-of-the-art results on diverse benchmarks under cross-dataset and base-to-novel settings.

Significance. If the empirical claims are substantiated, the work offers a lightweight, conceptually clean approach to preserving OOD-relevant alignment without heavy additional constraints, which could meaningfully advance practical open-vocabulary video adaptation.

major comments (1)
  1. The provided abstract (and the high-level description in the reader's summary) supplies no equations, implementation details, baselines, ablation studies, or error analysis. The central SOTA claim therefore cannot be evaluated from the given material; the full manuscript must include these in the methods (§3) and experiments (§4) sections for the contribution to be assessable.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the review. The comment appears to focus on the abstract, but the full manuscript contains the requested details in the methods and experiments sections. We address the point below.

read point-by-point responses
  1. Referee: The provided abstract (and the high-level description in the reader's summary) supplies no equations, implementation details, baselines, ablation studies, or error analysis. The central SOTA claim therefore cannot be evaluated from the given material; the full manuscript must include these in the methods (§3) and experiments (§4) sections for the contribution to be assessable.

    Authors: The full manuscript includes equations for Relative Structure Distillation and the specialization projection in §3 (Methods). Section 4 (Experiments) reports implementation details, multiple baselines, ablation studies on each component, cross-dataset and base-to-novel results, and error analysis. The abstract is intentionally high-level per standard practice; the SOTA claims are substantiated by the quantitative results and ablations in the full text. revision: no

Circularity Check

0 steps flagged

No circularity: method proposal is declarative and empirically grounded

full rationale

The paper introduces TACO as a framework with Relative Structure Distillation and a specialization projection to address objective inconsistency in CLIP adaptation for video recognition. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claims rest on the stated insight about OOD deviation and are validated through benchmark experiments rather than reducing to inputs by construction. This is a standard empirical method paper with self-contained logical structure.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract contains no equations, fitted constants, or new postulated entities. No free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.1-grok · 5771 in / 1254 out tokens · 40206 ms · 2026-06-30T09:35:32.436824+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 10 canonical work pages · 6 internal anchors

  1. [1]

    Leveraging vision-language models for improving domain generalization in image classification

    Sravanti Addepalli, Ashish Ramayee Asokan, Lakshay Sharma, and R Venkatesh Babu. Leveraging vision-language models for improving domain generalization in image classification. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 23922–23932, 2024

  2. [2]

    Frozen in time: A joint video and image encoder for end-to-end retrieval

    Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021

  3. [3]

    A Short Note about Kinetics-600

    Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. A short note about kinetics-600.arXiv preprint arXiv:1808.01340, 2018

  4. [4]

    Elaborative rehearsal for zero-shot action recognition

    Shizhe Chen and Dong Huang. Elaborative rehearsal for zero-shot action recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021

  5. [5]

    Ost: Refining text knowledge with optimal spatio-temporal descriptor for general video recognition

    Tongjia Chen, Hongshan Yu, Zhengeng Yang, Zechuan Li, Wei Sun, and Chen Chen. Ost: Refining text knowledge with optimal spatio-temporal descriptor for general video recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  6. [6]

    Cat-seg: Cost aggregation for open-vocabulary semantic segmentation

    Seokju Cho, Heeseong Shin, Sunghwan Hong, Anurag Arnab, Paul Hongsuck Seo, and Seungryong Kim. Cat-seg: Cost aggregation for open-vocabulary semantic segmentation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4113–4123, 2024

  7. [7]

    Enabling multimodal generation on clip via vision-language knowledge distillation.arXiv preprint arXiv:2203.06386, 2022

    Wenliang Dai, Lu Hou, Lifeng Shang, Xin Jiang, Qun Liu, and Pascale Fung. Enabling multimodal generation on clip via vision-language knowledge distillation.arXiv preprint arXiv:2203.06386, 2022

  8. [8]

    Estimating the intrinsic dimension of datasets by a minimal neighborhood information.Scientific reports, 2017

    Elena Facco, Maria d’Errico, Alex Rodriguez, and Alessandro Laio. Estimating the intrinsic dimension of datasets by a minimal neighborhood information.Scientific reports, 2017

  9. [9]

    Groundvts: Visual token sampling in multimodal large language models for video temporal grounding

    Rong Fan, Kaiyan Xiao, Minghao Zhu, Liuyi Wang, Kai Dai, and Zhao Yang. Groundvts: Visual token sampling in multimodal large language models for video temporal grounding. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2026

  10. [10]

    something something

    Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2017

  11. [11]

    Open-vocabulary object detection via vision and language knowledge distillation

    Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open-vocabulary object detection via vision and language knowledge distillation. InInternational Conference on Learning Representations (ICLR), 2022

  12. [12]

    Mmrl: Multi-modal representation learning for vision-language models

    Yuncheng Guo and Xiaodong Gu. Mmrl: Multi-modal representation learning for vision-language models. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  13. [13]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. InInternational Conference on Learning Representations (ICLR), 2022

  14. [14]

    Efficient text-driven motion generation via latent consistency training.IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2026

    Mengxian Hu, Minghao Zhu, Xun Zhou, Qingqing Yan, Shu Li, Chengju Liu, and Qijun Chen. Efficient text-driven motion generation via latent consistency training.IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2026

  15. [15]

    Froster: Frozen clip is a strong teacher for open- vocabulary action recognition

    Xiaohu Huang, Hao Zhou, Kun Yao, and Kai Han. Froster: Frozen clip is a strong teacher for open- vocabulary action recognition. InInternational Conference on Learning Representations (ICLR), 2024

  16. [16]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  17. [17]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InInternational Conference on Machine Learning (ICML), pages 4904–4916, 2021

  18. [18]

    Prompting visual-language models for efficient video understanding

    Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, and Weidi Xie. Prompting visual-language models for efficient video understanding. InProceedings of the European Conference on Computer Vision (ECCV), pages 105–124, 2022. 10

  19. [19]

    Prompting visual-language models for efficient video understanding

    Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, and Weidi Xie. Prompting visual-language models for efficient video understanding. InProceedings of the European Conference on Computer Vision (ECCV), 2022

  20. [20]

    The Kinetics Human Action Video Dataset

    Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950, 2017

  21. [21]

    Maple: Multi-modal prompt learning

    Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  22. [22]

    Self-regulating prompts: Foundational model adaptation without forgetting

    Muhammad Uzair Khattak, Syed Talal Wasim, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Self-regulating prompts: Foundational model adaptation without forgetting. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

  23. [23]

    Kuehne, H

    H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: A large video database for human motion recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2556–2563, 2011

  24. [24]

    Fine-tuning can distort pretrained features and underperform out-of-distribution

    Ananya Kumar, Aditi Raghunathan, Robbie Matthew Jones, Tengyu Ma, and Percy Liang. Fine-tuning can distort pretrained features and underperform out-of-distribution. InInternational Conference on Learning Representations (ICLR), 2022

  25. [25]

    Promptkd: Unsupervised prompt distillation for vision-language models

    Zheng Li, Xiang Li, Xinyi Fu, Xin Zhang, Weiqiang Wang, Shuo Chen, and Jian Yang. Promptkd: Unsupervised prompt distillation for vision-language models. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  26. [26]

    Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning

    Victor Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Yeung, and James Y Zou. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

  27. [27]

    Match, expand and improve: Unsupervised finetuning for zero-shot action recognition with language knowledge

    Wei Lin, Leonid Karlinsky, Nina Shvetsova, Horst Possegger, Mateusz Kozinski, Rameswar Panda, Rogerio Feris, Hilde Kuehne, and Horst Bischof. Match, expand and improve: Unsupervised finetuning for zero-shot action recognition with language knowledge. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

  28. [28]

    Clipose: Category-level object pose estimation with pre-trained vision-language knowledge

    Xiao Lin, Minghao Zhu, Ronghao Dang, Guangliang Zhou, Shaolong Shu, Feng Lin, Chengju Liu, and Qijun Chen. Clipose: Category-level object pose estimation with pre-trained vision-language knowledge. IEEE Transactions on Circuits and Systems for Video Technology, 2024

  29. [29]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  30. [30]

    Visualizing data using t-sne.Journal of machine learning research, 2008

    Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of machine learning research, 2008

  31. [31]

    Some methods of classification and analysis of multivariate observations

    James B McQueen. Some methods of classification and analysis of multivariate observations. InProc. of 5th Berkeley Symposium on Math. Stat. and Prob., pages 281–297, 1967

  32. [32]

    Improving zero-shot gen- eralization of learned prompts via unsupervised knowledge distillation

    Marco Mistretta, Alberto Baldrati, Marco Bertini, and Andrew D Bagdanov. Improving zero-shot gen- eralization of learned prompts via unsupervised knowledge distillation. InProceedings of the European Conference on Computer Vision (ECCV), 2024

  33. [33]

    Expanding language-image pretrained models for general video recognition

    Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xi- ang, and Haibin Ling. Expanding language-image pretrained models for general video recognition. In Proceedings of the European Conference on Computer Vision (ECCV), pages 1–18, 2022

  34. [34]

    St-adapter: Parameter-efficient image- to-video transfer learning

    Junting Pan, Ziyi Lin, Xiatian Zhu, Jing Shao, and Hongsheng Li. St-adapter: Parameter-efficient image- to-video transfer learning. InAdvances in Neural Information Processing Systems (NeurIPS), pages 26462–26477, 2022

  35. [35]

    Clipping: Distilling clip-based models with a student base for video-language retrieval

    Renjing Pei, Jianzhuang Liu, Weimian Li, Bin Shao, Songcen Xu, Peng Dai, Juwei Lu, and Youliang Yan. Clipping: Distilling clip-based models with a student base for video-language retrieval. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  36. [36]

    Disentangling spatial and temporal learning for efficient image-to-video transfer learning

    Zhiwu Qing, Shiwei Zhang, Ziyuan Huang, Yingya Zhang, Changxin Gao, Deli Zhao, and Nong Sang. Disentangling spatial and temporal learning for efficient image-to-video transfer learning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 11

  37. [37]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning (ICML), pages 8748–8763, 2021

  38. [38]

    Fine-tuned clip models are efficient video learners

    Hanoona Rasheed, Muhammad Uzair Khattak, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Fine-tuned clip models are efficient video learners. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6545–6554, 2023

  39. [39]

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402, 2012

  40. [40]

    Actionclip: A new paradigm for video action recognition

    Mengmeng Wang, Jiazheng Xing, and Yong Liu. Actionclip: A new paradigm for video action recognition. arXiv preprint arXiv:2109.08472, 2021

  41. [41]

    Vita-clip: Video and text adaptive clip via multimodal prompting

    Syed Talal Wasim, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan, and Mubarak Shah. Vita-clip: Video and text adaptive clip via multimodal prompting. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 23034–23044, 2023

  42. [42]

    Open-vclip: Transforming clip to an open-vocabulary video model via interpolated weight optimization

    Zejia Weng, Xitong Yang, Ang Li, Zuxuan Wu, and Yu-Gang Jiang. Open-vclip: Transforming clip to an open-vocabulary video model via interpolated weight optimization. InInternational Conference on Machine Learning (ICML), pages 36978–36989, 2023

  43. [43]

    Robust fine-tuning of zero-shot models

    Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, et al. Robust fine-tuning of zero-shot models. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  44. [44]

    Revisiting classifier: Transferring vision-language models for video recognition

    Wenhao Wu, Zhun Sun, and Wanli Ouyang. Revisiting classifier: Transferring vision-language models for video recognition. InProceedings of the AAAI Conference on Artificial Intelligence, pages 2847–2855, 2023

  45. [45]

    Bidirectional cross-modal knowledge exploration for video recognition with pre-trained vision-language models

    Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, and Wanli Ouyang. Bidirectional cross-modal knowledge exploration for video recognition with pre-trained vision-language models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6620– 6630, 2023

  46. [46]

    Clip-kd: An empirical study of clip model distillation

    Chuanguang Yang, Zhulin An, Libo Huang, Junyu Bi, Xinqiang Yu, Han Yang, Boyu Diao, and Yongjun Xu. Clip-kd: An empirical study of clip model distillation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 15952–15962, 2024

  47. [47]

    Mma: Multi-modal adapter for vision- language models

    Lingxiao Yang, Ru-Yuan Zhang, Yanchen Wang, and Xiaohua Xie. Mma: Multi-modal adapter for vision- language models. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  48. [48]

    Aim: Adapting image models for efficient video action recognition

    Taojiannan Yang, Yi Zhu, Yusheng Xie, Aston Zhang, Chen Chen, and Mu Li. Aim: Adapting image models for efficient video action recognition. InInternational Conference on Learning Representations (ICLR), 2022

  49. [49]

    Learning to generalize without bias for open-vocabulary action recognition.arXiv preprint arXiv:2502.20158, 2025

    Yating Yu, Congqi Cao, Yifan Zhang, and Yanning Zhang. Learning to generalize without bias for open-vocabulary action recognition.arXiv preprint arXiv:2502.20158, 2025

  50. [50]

    Florence: A New Foundation Model for Computer Vision

    Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, et al. Florence: A new foundation model for computer vision.arXiv preprint arXiv:2111.11432, 2021

  51. [51]

    Learning to prompt for vision-language models.International Journal of Computer Vision (IJCV), 130(9):2337–2348, 2022

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.International Journal of Computer Vision (IJCV), 130(9):2337–2348, 2022

  52. [52]

    Conditional prompt learning for vision- language models

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision- language models. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  53. [53]

    Complementary relation contrastive distillation

    Jinguo Zhu, Shixiang Tang, Dapeng Chen, Shijie Yu, Yakun Liu, Mingzhe Rong, Aijun Yang, and Xiaohua Wang. Complementary relation contrastive distillation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 12

  54. [54]

    Fine-grained spatiotemporal motion alignment for contrastive video representation learning

    Minghao Zhu, Xiao Lin, Ronghao Dang, Chengju Liu, and Qijun Chen. Fine-grained spatiotemporal motion alignment for contrastive video representation learning. InProceedings of the 31st ACM International Conference on Multimedia, pages 4725–4736, 2023

  55. [55]

    Mote: Reconciling generalization with specialization for visual-language to video knowledge transfer

    Minghao Zhu, Zhengpu Wang, Mengxian Hu, Ronghao Dang, Xiao Lin, Xun Zhou, Chengju Liu, and Qijun Chen. Mote: Reconciling generalization with specialization for visual-language to video knowledge transfer. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  56. [56]

    Orthogonal temporal interpolation for zero-shot video recognition

    Yan Zhu, Junbao Zhuo, Bin Ma, Jiajia Geng, Xiaoming Wei, Xiaolin Wei, and Shuhui Wang. Orthogonal temporal interpolation for zero-shot video recognition. InProceedings of the 31st ACM International Conference on Multimedia, pages 7491–7501, 2023. 13 Appendix Overview This appendix is organized as follows: •Section A: Limitations and Broader Impact. •Secti...