pith. machine review for the scientific record. sign in

arxiv: 2511.22125 · v2 · submitted 2025-11-27 · 💻 cs.CV

GA2-CLIP: Generic Attribute Anchor for Efficient Prompt Tuningin Video-Language Models

Pith reviewed 2026-05-17 05:13 UTC · model grok-4.3

classification 💻 cs.CV
keywords prompt tuningvideo-language modelsgeneralizationvision-language modelsgeneric anchorsbase-to-new predictionCLIP
0
0 comments X

The pith

Generic attribute anchors from pre-trained prompts and negative videos allow video prompt tuning without losing generalization to new classes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard soft prompt tuning on video tasks causes the model to overfit to training classes and lose ability on new ones. To fix this without weakening the prompts, they add hard prompt tokens from pre-trained models on other data, link them to soft tokens with a mapping layer, and use irrelevant videos plus negative prompts as anchors. This keeps the semantic space wide. A reader would care because it allows adapting VLMs to videos while keeping them useful for everything else.

Core claim

By concatenating pre-trained prompts from other datasets as hard tokens with soft prompt tokens and coupling them via a learnable mapping layer, along with introducing generic attribute anchors consisting of irrelevant video sets and negative prompts, the method prevents the semantic space from narrowing and overfitting to supervised categories during fine-tuning on video tasks.

What carries the argument

Generic attribute anchor using pre-trained hard prompts coupled to soft prompts and negative video prompts to maintain broad semantic relevance.

Load-bearing premise

That externally added pre-trained prompts and negative prompts from irrelevant videos will maintain generic relevance in the semantic space without introducing biases or impairing soft prompt learning.

What would settle it

If experiments on base-to-new video class prediction show no improvement or worse results when adding the generic attribute anchors compared to standard soft prompt tuning.

Figures

Figures reproduced from arXiv: 2511.22125 by Bin Wang, Mingliang Gao, Runmin Cong, Ruotong Hu, Wei Zhang, Wenqian Wang, Wentong Li, Xudong Jiang.

Figure 1
Figure 1. Figure 1: Comparative analysis of video-text alignment process [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of existing video prompt tuning architectures. (a) ViFi-CLIP inputs multiple learnable soft tokens combined with [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Architecture of the Generic Attribute Anchors CLIP (GA [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of different hard and soft prompt token cou [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The effect of different settings on base to novel. (a) The [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: t-SNE visualizations for HMDB-51 and UCF-101 datasets. We perform t-SNE visualizations of the final visual embedded [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
read the original abstract

Visual and textual soft prompt tuning can effectively improve the adaptability of Vision-Language Models (VLMs) in downstream tasks. However, fine-tuning on video tasks impairs the model's generalization ability to unseen classes. Existing methods attempt to mitigate this forgetting effect by regularizing the gap between hand-crafted prompts and soft prompts, but this also weakens the learning ability of soft prompts. To address this challenge, we propose a plug-and-play coupling prompt learning framework to optimize the generalization performance of V-L models in video tasks, with the core motivation of mitigating semantic space narrowing during fine-tuning by introducing an externally supervised prompt. Specifically, for textual prompts, we introduce pre-trained prompts from other datasets as hard prompt tokens. These are concatenated with soft prompt tokens and coupled via a learnable mapping layer. This competitive prompting approach prevents the semantic space from overfitting to supervised categories. In addition, we introduce a set of well-designed irrelevant video sets and negative prompts as generic attribute anchors to maintain the generic relevance of the attributes in the pre-trained semantic space, thus preserving the generalization ability. Experiments on video tasks demonstrate that our method significantly outperforms state-of-the-art prompt tuning approaches across generalization benchmarks, particularly on base-to-new class prediction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes GA2-CLIP, a plug-and-play coupling prompt learning framework for adapting vision-language models to video tasks. It concatenates pre-trained prompts from other datasets with soft prompt tokens via a learnable mapping layer for textual prompts, and introduces irrelevant video sets plus negative prompts as generic attribute anchors. The core motivation is to mitigate semantic space narrowing and forgetting during fine-tuning while preserving generalization, with the central empirical claim being significant outperformance over state-of-the-art prompt tuning methods on video generalization benchmarks, especially base-to-new class prediction.

Significance. If the empirical results hold under rigorous controls, the approach offers a practical, efficient way to improve generalization in VLM prompt tuning for videos without the typical trade-off of weakened learning ability from regularization. It could advance few-shot video understanding by leveraging external anchors to maintain pre-trained semantic breadth, with potential applicability to other multimodal tasks.

major comments (2)
  1. [Abstract] Abstract: The central claim that the method 'significantly outperforms state-of-the-art prompt tuning approaches across generalization benchmarks' is stated without any quantitative results, specific metrics, baseline comparisons, ablation studies, or error bars. This makes the empirical contribution difficult to evaluate and risks resting on post-hoc choices, as the soundness assessment notes.
  2. [Abstract] Abstract (and §3 Method, per the description of anchors): The load-bearing assumption that 'irrelevant video sets and negative prompts' serve as stable generic attribute anchors that 'preserve the generalization ability' without introducing biases or overlapping in motion/object/scene statistics with target tasks is not accompanied by selection criteria, distribution analysis, or ablations. If these sets share latent attributes with evaluation videos, the competitive prompting could reinforce rather than counteract overfitting, directly undermining the base-to-new generalization claim.
minor comments (2)
  1. [Abstract] The title expands GA2-CLIP as 'Generic Attribute Anchor' but the abstract does not explicitly restate this or clarify how the learnable mapping layer parameters are initialized or regularized.
  2. [Method] Notation for the coupling of hard and soft prompts via the mapping layer could be formalized with an equation for clarity in the method description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help us improve the clarity and rigor of the presentation. We address each major comment below and will revise the manuscript to incorporate the suggested changes where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the method 'significantly outperforms state-of-the-art prompt tuning approaches across generalization benchmarks' is stated without any quantitative results, specific metrics, baseline comparisons, ablation studies, or error bars. This makes the empirical contribution difficult to evaluate and risks resting on post-hoc choices, as the soundness assessment notes.

    Authors: We agree that the abstract would be strengthened by including key quantitative results to support the central claim. In the revised version, we will add specific metrics (e.g., average accuracy gains on base-to-new class prediction across video benchmarks such as UCF101 and HMDB51), explicit comparisons to baselines including CoOp, CoCoOp, and MaPLe, and reference to standard deviations from multiple runs. This will make the empirical contribution more transparent while remaining within abstract length constraints. revision: yes

  2. Referee: [Abstract] Abstract (and §3 Method, per the description of anchors): The load-bearing assumption that 'irrelevant video sets and negative prompts' serve as stable generic attribute anchors that 'preserve the generalization ability' without introducing biases or overlapping in motion/object/scene statistics with target tasks is not accompanied by selection criteria, distribution analysis, or ablations. If these sets share latent attributes with evaluation videos, the competitive prompting could reinforce rather than counteract overfitting, directly undermining the base-to-new generalization claim.

    Authors: We acknowledge this important point about potential biases and overlaps. The manuscript describes the sets as 'well-designed' to avoid direct class overlap with target tasks, but we agree that explicit selection criteria, distribution analysis, and additional ablations would better substantiate the claim. In the revision, we will expand Section 3 and add an appendix detailing the selection process (e.g., sourcing from broad action datasets while excluding base/new classes of the target benchmarks), include qualitative distribution comparisons, and report ablations on anchor set variations. Our existing results show that removing the anchors degrades base-to-new performance, supporting their utility, but we will provide further evidence to address the risk of latent reinforcement of overfitting. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical method with external validation

full rationale

The paper describes an engineering contribution: a coupling prompt framework that concatenates pre-trained hard prompts with soft prompts via a learnable layer and adds irrelevant video sets plus negative prompts as anchors. No mathematical derivation, uniqueness theorem, or fitted-parameter prediction is claimed that reduces to its own inputs by construction. The central claims rest on experimental results across generalization benchmarks rather than self-referential definitions or load-bearing self-citations. The approach is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

Review performed on abstract only; full methods, assumptions, and experimental design are unavailable so ledger entries are necessarily incomplete and conservative.

free parameters (1)
  • learnable mapping layer parameters
    The coupling between hard and soft prompt tokens is performed by a learnable mapping layer whose weights are fitted during training.
axioms (2)
  • domain assumption Pre-trained prompts from other datasets remain semantically relevant when transferred to the target video task.
    Invoked when the paper states that these prompts are introduced as hard tokens to prevent overfitting.
  • ad hoc to paper Irrelevant video sets and negative prompts act as stable anchors that preserve generic attribute relevance without harming task-specific learning.
    Core motivation stated in the abstract for maintaining generalization.

pith-pipeline@v0.9.0 · 5533 in / 1427 out tokens · 32008 ms · 2026-05-17T05:13:27.506259+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 6 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Vivit: A video vision transformer

    Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lu ˇci´c, and Cordelia Schmid. Vivit: A video vision transformer. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 6836–6846, 2021

  3. [3]

    Is space-time attention all you need for video understanding? InICML, page 4, 2021

    Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? InICML, page 4, 2021

  4. [4]

    Rethinking zero-shot video classification: End-to-end training for realistic applications

    Biagio Brattoli, Joseph Tighe, Fedor Zhdanov, Pietro Per- ona, and Krzysztof Chalupka. Rethinking zero-shot video classification: End-to-end training for realistic applications. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 4613–4623, 2020

  5. [5]

    Elaborative rehearsal for zero-shot action recognition

    Shizhe Chen and Dong Huang. Elaborative rehearsal for zero-shot action recognition. InProceedings of the IEEE/CVF international conference on computer vision, pages 13638–13647, 2021

  6. [6]

    Tree of attributes prompt learning for vision-language models.arXiv preprint arXiv:2410.11201, 2024

    Tong Ding, Wanhua Li, Zhongqi Miao, and Hanspeter Pfis- ter. Tree of attributes prompt learning for vision-language models.arXiv preprint arXiv:2410.11201, 2024

  7. [7]

    Learning to prompt for open-vocabulary ob- ject detection with vision-language model

    Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, and Guoqi Li. Learning to prompt for open-vocabulary ob- ject detection with vision-language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14084–14093, 2022

  8. [8]

    Multiscale vision transformers

    Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichten- hofer. Multiscale vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 6824–6835, 2021

  9. [9]

    X3d: Expanding architectures for efficient video recognition

    Christoph Feichtenhofer. X3d: Expanding architectures for efficient video recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 203–213, 2020

  10. [10]

    The” something something” video database for learning and evaluating visual common sense

    Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The” something something” video database for learning and evaluating visual common sense. InProceedings of the IEEE international conference on com- puter vision, pages 5842...

  11. [11]

    Scaling up visual and vision-language representa- tion learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR, 2021

  12. [12]

    Vi- sual prompt tuning

    Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi- sual prompt tuning. InEuropean Conference on Computer Vision, pages 709–727. Springer, 2022

  13. [13]

    Prompting visual-language models for efficient video understanding

    Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, and Weidi Xie. Prompting visual-language models for efficient video understanding. InEuropean Conference on Computer Vi- sion, pages 105–124. Springer, 2022

  14. [14]

    The Kinetics Human Action Video Dataset

    Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics hu- man action video dataset.arXiv preprint arXiv:1705.06950, 2017

  15. [15]

    Maple: Multi-modal prompt learning

    Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19113–19122, 2023

  16. [16]

    Self-regulating prompts: Foundational model adaptation without forgetting

    Muhammad Uzair Khattak, Syed Talal Wasim, Muzam- mal Naseer, Salman Khan, Ming-Hsuan Yang, and Fa- had Shahbaz Khan. Self-regulating prompts: Foundational model adaptation without forgetting. InProceedings of the IEEE/CVF international conference on computer vision, pages 15190–15200, 2023

  17. [17]

    Aapl: Adding attributes to prompt learning for vision-language models

    Gahyeon Kim, Sohee Kim, and Seokju Lee. Aapl: Adding attributes to prompt learning for vision-language models. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 1572–1582, 2024

  18. [18]

    Hmdb: a large video database for human motion recognition

    Hildegard Kuehne, Hueihan Jhuang, Est ´ıbaliz Garrote, Tomaso Poggio, and Thomas Serre. Hmdb: a large video database for human motion recognition. In2011 Inter- national conference on computer vision, pages 2556–2563. IEEE, 2011

  19. [19]

    Dpc: Dual-prompt collaboration for tuning vision-language models

    Haoyang Li, Liang Wang, Chao Wang, Jing Jiang, Yan Peng, and Guodong Long. Dpc: Dual-prompt collaboration for tuning vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 25623–25632, 2025

  20. [20]

    Tokenpacker: Efficient visual projector for multimodal llm.International Journal of Computer Vision, pages 1–19, 2025

    Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, and Lei Zhang. Tokenpacker: Efficient visual projector for multimodal llm.International Journal of Computer Vision, pages 1–19, 2025

  21. [21]

    Tea: Temporal excitation and aggregation for action recognition

    Yan Li, Bin Ji, Xintian Shi, Jianguo Zhang, Bin Kang, and Limin Wang. Tea: Temporal excitation and aggregation for action recognition. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 909–918, 2020

  22. [22]

    Promptkd: Unsupervised prompt distillation for vision-language models

    Zheng Li, Xiang Li, Xinyi Fu, Xin Zhang, Weiqiang Wang, Shuo Chen, and Jian Yang. Promptkd: Unsupervised prompt distillation for vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26617–26626, 2024

  23. [23]

    Atprompt: Textual prompt learning with embedded attributes.arXiv preprint arXiv:2412.09442, 2024

    Zheng Li, Yibing Song, Penghai Zhao, Ming-Ming Cheng, Xiang Li, and Jian Yang. Atprompt: Textual prompt learning with embedded attributes.arXiv preprint arXiv:2412.09442, 2024

  24. [24]

    Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual rep- resentation by alignment before projection.arXiv preprint arXiv:2311.10122, 2023

  25. [25]

    Tsm: Temporal shift module for efficient video understanding

    Ji Lin, Chuang Gan, and Song Han. Tsm: Temporal shift module for efficient video understanding. InProceedings of the IEEE/CVF international conference on computer vision, pages 7083–7093, 2019

  26. [26]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024

  27. [27]

    Revisiting temporal modeling for clip-based image-to-video knowledge transferring

    Ruyang Liu, Jingjia Huang, Ge Li, Jiashi Feng, Xinglong Wu, and Thomas H Li. Revisiting temporal modeling for clip-based image-to-video knowledge transferring. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6555–6564, 2023

  28. [28]

    Llava-plus: Learning to use tools for creating multi- modal agents

    Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, et al. Llava-plus: Learning to use tools for creating multi- modal agents. InEuropean Conference on Computer Vision, pages 126–142. Springer, 2024

  29. [29]

    Video swin transformer

    Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022

  30. [30]

    Visualizing data using t-sne.Journal of machine learning research, 9 (Nov):2579–2605, 2008

    Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of machine learning research, 9 (Nov):2579–2605, 2008

  31. [31]

    Video transformer network

    Daniel Neimark, Omri Bar, Maya Zohar, and Dotan As- selmann. Video transformer network. InProceedings of the IEEE/CVF international conference on computer vision, pages 3163–3172, 2021

  32. [32]

    Expanding language-image pretrained models for gen- eral video recognition

    Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, and Haibin Ling. Expanding language-image pretrained models for gen- eral video recognition. InEuropean Conference on Com- puter Vision, pages 1–18. Springer, 2022

  33. [33]

    St-adapter: Parameter-efficient image-to-video transfer learning.Advances in Neural Information Process- ing Systems, 35:26462–26477, 2022

    Junting Pan, Ziyi Lin, Xiatian Zhu, Jing Shao, and Hong- sheng Li. St-adapter: Parameter-efficient image-to-video transfer learning.Advances in Neural Information Process- ing Systems, 35:26462–26477, 2022

  34. [34]

    Zero-shot action recogni- tion with error-correcting output codes

    Jie Qin, Li Liu, Ling Shao, Fumin Shen, Bingbing Ni, Ji- axin Chen, and Yunhong Wang. Zero-shot action recogni- tion with error-correcting output codes. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 2833–2842, 2017

  35. [35]

    Disentangling spatial and temporal learning for efficient image-to-video transfer learning

    Zhiwu Qing, Shiwei Zhang, Ziyuan Huang, Yingya Zhang, Changxin Gao, Deli Zhao, and Nong Sang. Disentangling spatial and temporal learning for efficient image-to-video transfer learning. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 13934–13944, 2023

  36. [36]

    Learning spatio- temporal representation with pseudo-3d residual networks

    Zhaofan Qiu, Ting Yao, and Tao Mei. Learning spatio- temporal representation with pseudo-3d residual networks. Inproceedings of the IEEE International Conference on Computer Vision, pages 5533–5541, 2017

  37. [37]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021

  38. [38]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  39. [39]

    Self-supervised video transformer

    Kanchana Ranasinghe, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan, and Michael S Ryoo. Self-supervised video transformer. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 2874–2884, 2022

  40. [40]

    Denseclip: Language-guided dense prediction with context- aware prompting

    Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu. Denseclip: Language-guided dense prediction with context- aware prompting. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 18082–18091, 2022

  41. [41]

    Fine-tuned clip models are efficient video learners

    Hanoona Rasheed, Muhammad Uzair Khattak, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Fine-tuned clip models are efficient video learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6545–6554, 2023

  42. [42]

    Margin maximiz- ing loss functions.Advances in neural information process- ing systems, 16, 2003

    Saharon Rosset, Ji Zhu, and Trevor Hastie. Margin maximiz- ing loss functions.Advances in neural information process- ing systems, 16, 2003

  43. [43]

    A closer look at the few-shot adaptation of large vision-language models

    Julio Silva-Rodriguez, Sina Hajimiri, Ismail Ben Ayed, and Jose Dolz. A closer look at the few-shot adaptation of large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23681–23690, 2024

  44. [44]

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402, 2012

  45. [45]

    Lst: Lad- der side-tuning for parameter and memory efficient transfer learning.Advances in Neural Information Processing Sys- tems, 35:12991–13005, 2022

    Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. Lst: Lad- der side-tuning for parameter and memory efficient transfer learning.Advances in Neural Information Processing Sys- tems, 35:12991–13005, 2022

  46. [46]

    Argue: Attribute-guided prompt tuning for vision-language models

    Xinyu Tian, Shu Zou, Zhaoyuan Yang, and Jing Zhang. Argue: Attribute-guided prompt tuning for vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28578– 28587, 2024

  47. [47]

    A closer look at spatiotemporal convolutions for action recognition

    Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. InProceedings of the IEEE conference on Computer Vision and Pattern Recogni- tion, pages 6450–6459, 2018

  48. [48]

    Tds-clip: Temporal dif- ference side network for image-to-video transfer learning

    Bin Wang and Wenqian Wang. Tds-clip: Temporal dif- ference side network for image-to-video transfer learning. arXiv preprint arXiv:2408.10688, 2024

  49. [49]

    Vilt-clip: Video and language tuning clip with multimodal prompt learning and scenario-guided optimization

    Hao Wang, Fang Liu, Licheng Jiao, Jiahao Wang, Ze- hua Hao, Shuo Li, Lingling Li, Puhua Chen, and Xu Liu. Vilt-clip: Video and language tuning clip with multimodal prompt learning and scenario-guided optimization. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 5390–5400, 2024

  50. [50]

    Tdn: Temporal difference networks for efficient action recog- nition

    Limin Wang, Zhan Tong, Bin Ji, and Gangshan Wu. Tdn: Temporal difference networks for efficient action recog- nition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1895–1904, 2021

  51. [51]

    Actionclip: Adapting language-image pretrained models for video action recognition.IEEE Trans- actions on Neural Networks and Learning Systems, 2023

    Mengmeng Wang, Jiazheng Xing, Jianbiao Mei, Yong Liu, and Yunliang Jiang. Actionclip: Adapting language-image pretrained models for video action recognition.IEEE Trans- actions on Neural Networks and Learning Systems, 2023

  52. [52]

    A multimodal, multi-task adapting frame- work for video action recognition

    Mengmeng Wang, Jiazheng Xing, Boyuan Jiang, Jun Chen, Jianbiao Mei, Xingxing Zuo, Guang Dai, Jingdong Wang, and Yong Liu. A multimodal, multi-task adapting frame- work for video action recognition. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5517– 5525, 2024

  53. [53]

    Vita-clip: Video and text adaptive clip via multimodal prompting

    Syed Talal Wasim, Muzammal Naseer, Salman Khan, Fa- had Shahbaz Khan, and Mubarak Shah. Vita-clip: Video and text adaptive clip via multimodal prompting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 23034–23044, 2023

  54. [54]

    Bidirectional cross- modal knowledge exploration for video recognition with pre-trained vision-language models

    Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, and Wanli Ouyang. Bidirectional cross- modal knowledge exploration for video recognition with pre-trained vision-language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6620–6630, 2023

  55. [55]

    Textrefiner: Internal visual feature as efficient refiner for vision-language models prompt tuning

    Jingjing Xie, Yuxin Zhang, Jun Peng, Zhaohong Huang, and Liujuan Cao. Textrefiner: Internal visual feature as efficient refiner for vision-language models prompt tuning. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 8718–8726, 2025

  56. [56]

    Pointllm: Empowering large language models to understand point clouds

    Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiang- miao Pang, and Dahua Lin. Pointllm: Empowering large language models to understand point clouds. InEuropean Conference on Computer Vision, pages 131–147. Springer, 2024

  57. [57]

    Aim: Adapting image models for efficient video action recognition.arXiv preprint arXiv:2302.03024, 2023

    Taojiannan Yang, Yi Zhu, Yusheng Xie, Aston Zhang, Chen Chen, and Mu Li. Aim: Adapting image models for efficient video action recognition.arXiv preprint arXiv:2302.03024, 2023

  58. [58]

    Florence: A New Foundation Model for Computer Vision

    Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, et al. Florence: A new foundation model for computer vision.arXiv preprint arXiv:2111.11432, 2021

  59. [59]

    Lit: Zero-shot transfer with locked-image text tuning

    Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18123–18133, 2022

  60. [60]

    Side-tuning: a baseline for net- work adaptation via additive side networks

    Jeffrey O Zhang, Alexander Sax, Amir Zamir, Leonidas Guibas, and Jitendra Malik. Side-tuning: a baseline for net- work adaptation via additive side networks. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pages 698–714. Springer, 2020

  61. [61]

    Tip-adapter: Training-free clip-adapter for better vision- language modeling.arXiv preprint arXiv:2111.03930, 2021

    Renrui Zhang, Rongyao Fang, Wei Zhang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip-adapter: Training-free clip-adapter for better vision- language modeling.arXiv preprint arXiv:2111.03930, 2021

  62. [62]

    Conditional prompt learning for vision-language mod- els

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language mod- els. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 16816–16825, 2022

  63. [63]

    Learning to prompt for vision-language models.In- ternational Journal of Computer Vision, 130(9):2337–2348, 2022

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.In- ternational Journal of Computer Vision, 130(9):2337–2348, 2022

  64. [64]

    Prompt-aligned gradient for prompt tuning

    Beier Zhu, Yulei Niu, Yucheng Han, Yue Wu, and Hanwang Zhang. Prompt-aligned gradient for prompt tuning. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 15659–15669, 2023

  65. [65]

    Towards universal representation for unseen action recognition

    Yi Zhu, Yang Long, Yu Guan, Shawn Newsam, and Ling Shao. Towards universal representation for unseen action recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 9436–9445, 2018