GA2-CLIP: Generic Attribute Anchor for Efficient Prompt Tuningin Video-Language Models
Pith reviewed 2026-05-17 05:13 UTC · model grok-4.3
The pith
Generic attribute anchors from pre-trained prompts and negative videos allow video prompt tuning without losing generalization to new classes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By concatenating pre-trained prompts from other datasets as hard tokens with soft prompt tokens and coupling them via a learnable mapping layer, along with introducing generic attribute anchors consisting of irrelevant video sets and negative prompts, the method prevents the semantic space from narrowing and overfitting to supervised categories during fine-tuning on video tasks.
What carries the argument
Generic attribute anchor using pre-trained hard prompts coupled to soft prompts and negative video prompts to maintain broad semantic relevance.
Load-bearing premise
That externally added pre-trained prompts and negative prompts from irrelevant videos will maintain generic relevance in the semantic space without introducing biases or impairing soft prompt learning.
What would settle it
If experiments on base-to-new video class prediction show no improvement or worse results when adding the generic attribute anchors compared to standard soft prompt tuning.
Figures
read the original abstract
Visual and textual soft prompt tuning can effectively improve the adaptability of Vision-Language Models (VLMs) in downstream tasks. However, fine-tuning on video tasks impairs the model's generalization ability to unseen classes. Existing methods attempt to mitigate this forgetting effect by regularizing the gap between hand-crafted prompts and soft prompts, but this also weakens the learning ability of soft prompts. To address this challenge, we propose a plug-and-play coupling prompt learning framework to optimize the generalization performance of V-L models in video tasks, with the core motivation of mitigating semantic space narrowing during fine-tuning by introducing an externally supervised prompt. Specifically, for textual prompts, we introduce pre-trained prompts from other datasets as hard prompt tokens. These are concatenated with soft prompt tokens and coupled via a learnable mapping layer. This competitive prompting approach prevents the semantic space from overfitting to supervised categories. In addition, we introduce a set of well-designed irrelevant video sets and negative prompts as generic attribute anchors to maintain the generic relevance of the attributes in the pre-trained semantic space, thus preserving the generalization ability. Experiments on video tasks demonstrate that our method significantly outperforms state-of-the-art prompt tuning approaches across generalization benchmarks, particularly on base-to-new class prediction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes GA2-CLIP, a plug-and-play coupling prompt learning framework for adapting vision-language models to video tasks. It concatenates pre-trained prompts from other datasets with soft prompt tokens via a learnable mapping layer for textual prompts, and introduces irrelevant video sets plus negative prompts as generic attribute anchors. The core motivation is to mitigate semantic space narrowing and forgetting during fine-tuning while preserving generalization, with the central empirical claim being significant outperformance over state-of-the-art prompt tuning methods on video generalization benchmarks, especially base-to-new class prediction.
Significance. If the empirical results hold under rigorous controls, the approach offers a practical, efficient way to improve generalization in VLM prompt tuning for videos without the typical trade-off of weakened learning ability from regularization. It could advance few-shot video understanding by leveraging external anchors to maintain pre-trained semantic breadth, with potential applicability to other multimodal tasks.
major comments (2)
- [Abstract] Abstract: The central claim that the method 'significantly outperforms state-of-the-art prompt tuning approaches across generalization benchmarks' is stated without any quantitative results, specific metrics, baseline comparisons, ablation studies, or error bars. This makes the empirical contribution difficult to evaluate and risks resting on post-hoc choices, as the soundness assessment notes.
- [Abstract] Abstract (and §3 Method, per the description of anchors): The load-bearing assumption that 'irrelevant video sets and negative prompts' serve as stable generic attribute anchors that 'preserve the generalization ability' without introducing biases or overlapping in motion/object/scene statistics with target tasks is not accompanied by selection criteria, distribution analysis, or ablations. If these sets share latent attributes with evaluation videos, the competitive prompting could reinforce rather than counteract overfitting, directly undermining the base-to-new generalization claim.
minor comments (2)
- [Abstract] The title expands GA2-CLIP as 'Generic Attribute Anchor' but the abstract does not explicitly restate this or clarify how the learnable mapping layer parameters are initialized or regularized.
- [Method] Notation for the coupling of hard and soft prompts via the mapping layer could be formalized with an equation for clarity in the method description.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help us improve the clarity and rigor of the presentation. We address each major comment below and will revise the manuscript to incorporate the suggested changes where appropriate.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that the method 'significantly outperforms state-of-the-art prompt tuning approaches across generalization benchmarks' is stated without any quantitative results, specific metrics, baseline comparisons, ablation studies, or error bars. This makes the empirical contribution difficult to evaluate and risks resting on post-hoc choices, as the soundness assessment notes.
Authors: We agree that the abstract would be strengthened by including key quantitative results to support the central claim. In the revised version, we will add specific metrics (e.g., average accuracy gains on base-to-new class prediction across video benchmarks such as UCF101 and HMDB51), explicit comparisons to baselines including CoOp, CoCoOp, and MaPLe, and reference to standard deviations from multiple runs. This will make the empirical contribution more transparent while remaining within abstract length constraints. revision: yes
-
Referee: [Abstract] Abstract (and §3 Method, per the description of anchors): The load-bearing assumption that 'irrelevant video sets and negative prompts' serve as stable generic attribute anchors that 'preserve the generalization ability' without introducing biases or overlapping in motion/object/scene statistics with target tasks is not accompanied by selection criteria, distribution analysis, or ablations. If these sets share latent attributes with evaluation videos, the competitive prompting could reinforce rather than counteract overfitting, directly undermining the base-to-new generalization claim.
Authors: We acknowledge this important point about potential biases and overlaps. The manuscript describes the sets as 'well-designed' to avoid direct class overlap with target tasks, but we agree that explicit selection criteria, distribution analysis, and additional ablations would better substantiate the claim. In the revision, we will expand Section 3 and add an appendix detailing the selection process (e.g., sourcing from broad action datasets while excluding base/new classes of the target benchmarks), include qualitative distribution comparisons, and report ablations on anchor set variations. Our existing results show that removing the anchors degrades base-to-new performance, supporting their utility, but we will provide further evidence to address the risk of latent reinforcement of overfitting. revision: partial
Circularity Check
No significant circularity; empirical method with external validation
full rationale
The paper describes an engineering contribution: a coupling prompt framework that concatenates pre-trained hard prompts with soft prompts via a learnable layer and adds irrelevant video sets plus negative prompts as anchors. No mathematical derivation, uniqueness theorem, or fitted-parameter prediction is claimed that reduces to its own inputs by construction. The central claims rest on experimental results across generalization benchmarks rather than self-referential definitions or load-bearing self-citations. The approach is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
free parameters (1)
- learnable mapping layer parameters
axioms (2)
- domain assumption Pre-trained prompts from other datasets remain semantically relevant when transferred to the target video task.
- ad hoc to paper Irrelevant video sets and negative prompts act as stable anchors that preserve generic attribute relevance without harming task-specific learning.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we introduce pre-trained prompts from other datasets as hard prompt tokens. These are concatenated with soft prompt tokens and coupled via a learnable mapping layer... irrelevant video sets and negative prompts as generic attribute anchors
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Experiments on video tasks demonstrate that our method significantly outperforms state-of-the-art prompt tuning approaches across generalization benchmarks
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Vivit: A video vision transformer
Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lu ˇci´c, and Cordelia Schmid. Vivit: A video vision transformer. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 6836–6846, 2021
work page 2021
-
[3]
Is space-time attention all you need for video understanding? InICML, page 4, 2021
Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? InICML, page 4, 2021
work page 2021
-
[4]
Rethinking zero-shot video classification: End-to-end training for realistic applications
Biagio Brattoli, Joseph Tighe, Fedor Zhdanov, Pietro Per- ona, and Krzysztof Chalupka. Rethinking zero-shot video classification: End-to-end training for realistic applications. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 4613–4623, 2020
work page 2020
-
[5]
Elaborative rehearsal for zero-shot action recognition
Shizhe Chen and Dong Huang. Elaborative rehearsal for zero-shot action recognition. InProceedings of the IEEE/CVF international conference on computer vision, pages 13638–13647, 2021
work page 2021
-
[6]
Tree of attributes prompt learning for vision-language models.arXiv preprint arXiv:2410.11201, 2024
Tong Ding, Wanhua Li, Zhongqi Miao, and Hanspeter Pfis- ter. Tree of attributes prompt learning for vision-language models.arXiv preprint arXiv:2410.11201, 2024
-
[7]
Learning to prompt for open-vocabulary ob- ject detection with vision-language model
Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, and Guoqi Li. Learning to prompt for open-vocabulary ob- ject detection with vision-language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14084–14093, 2022
work page 2022
-
[8]
Multiscale vision transformers
Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichten- hofer. Multiscale vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 6824–6835, 2021
work page 2021
-
[9]
X3d: Expanding architectures for efficient video recognition
Christoph Feichtenhofer. X3d: Expanding architectures for efficient video recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 203–213, 2020
work page 2020
-
[10]
The” something something” video database for learning and evaluating visual common sense
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The” something something” video database for learning and evaluating visual common sense. InProceedings of the IEEE international conference on com- puter vision, pages 5842...
work page 2017
-
[11]
Scaling up visual and vision-language representa- tion learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR, 2021
work page 2021
-
[12]
Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi- sual prompt tuning. InEuropean Conference on Computer Vision, pages 709–727. Springer, 2022
work page 2022
-
[13]
Prompting visual-language models for efficient video understanding
Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, and Weidi Xie. Prompting visual-language models for efficient video understanding. InEuropean Conference on Computer Vi- sion, pages 105–124. Springer, 2022
work page 2022
-
[14]
The Kinetics Human Action Video Dataset
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics hu- man action video dataset.arXiv preprint arXiv:1705.06950, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[15]
Maple: Multi-modal prompt learning
Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19113–19122, 2023
work page 2023
-
[16]
Self-regulating prompts: Foundational model adaptation without forgetting
Muhammad Uzair Khattak, Syed Talal Wasim, Muzam- mal Naseer, Salman Khan, Ming-Hsuan Yang, and Fa- had Shahbaz Khan. Self-regulating prompts: Foundational model adaptation without forgetting. InProceedings of the IEEE/CVF international conference on computer vision, pages 15190–15200, 2023
work page 2023
-
[17]
Aapl: Adding attributes to prompt learning for vision-language models
Gahyeon Kim, Sohee Kim, and Seokju Lee. Aapl: Adding attributes to prompt learning for vision-language models. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 1572–1582, 2024
work page 2024
-
[18]
Hmdb: a large video database for human motion recognition
Hildegard Kuehne, Hueihan Jhuang, Est ´ıbaliz Garrote, Tomaso Poggio, and Thomas Serre. Hmdb: a large video database for human motion recognition. In2011 Inter- national conference on computer vision, pages 2556–2563. IEEE, 2011
work page 2011
-
[19]
Dpc: Dual-prompt collaboration for tuning vision-language models
Haoyang Li, Liang Wang, Chao Wang, Jing Jiang, Yan Peng, and Guodong Long. Dpc: Dual-prompt collaboration for tuning vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 25623–25632, 2025
work page 2025
-
[20]
Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, and Lei Zhang. Tokenpacker: Efficient visual projector for multimodal llm.International Journal of Computer Vision, pages 1–19, 2025
work page 2025
-
[21]
Tea: Temporal excitation and aggregation for action recognition
Yan Li, Bin Ji, Xintian Shi, Jianguo Zhang, Bin Kang, and Limin Wang. Tea: Temporal excitation and aggregation for action recognition. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 909–918, 2020
work page 2020
-
[22]
Promptkd: Unsupervised prompt distillation for vision-language models
Zheng Li, Xiang Li, Xinyi Fu, Xin Zhang, Weiqiang Wang, Shuo Chen, and Jian Yang. Promptkd: Unsupervised prompt distillation for vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26617–26626, 2024
work page 2024
-
[23]
Atprompt: Textual prompt learning with embedded attributes.arXiv preprint arXiv:2412.09442, 2024
Zheng Li, Yibing Song, Penghai Zhao, Ming-Ming Cheng, Xiang Li, and Jian Yang. Atprompt: Textual prompt learning with embedded attributes.arXiv preprint arXiv:2412.09442, 2024
-
[24]
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual rep- resentation by alignment before projection.arXiv preprint arXiv:2311.10122, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
Tsm: Temporal shift module for efficient video understanding
Ji Lin, Chuang Gan, and Song Han. Tsm: Temporal shift module for efficient video understanding. InProceedings of the IEEE/CVF international conference on computer vision, pages 7083–7093, 2019
work page 2019
-
[26]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Revisiting temporal modeling for clip-based image-to-video knowledge transferring
Ruyang Liu, Jingjia Huang, Ge Li, Jiashi Feng, Xinglong Wu, and Thomas H Li. Revisiting temporal modeling for clip-based image-to-video knowledge transferring. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6555–6564, 2023
work page 2023
-
[28]
Llava-plus: Learning to use tools for creating multi- modal agents
Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, et al. Llava-plus: Learning to use tools for creating multi- modal agents. InEuropean Conference on Computer Vision, pages 126–142. Springer, 2024
work page 2024
-
[29]
Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022
work page 2022
-
[30]
Visualizing data using t-sne.Journal of machine learning research, 9 (Nov):2579–2605, 2008
Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of machine learning research, 9 (Nov):2579–2605, 2008
work page 2008
-
[31]
Daniel Neimark, Omri Bar, Maya Zohar, and Dotan As- selmann. Video transformer network. InProceedings of the IEEE/CVF international conference on computer vision, pages 3163–3172, 2021
work page 2021
-
[32]
Expanding language-image pretrained models for gen- eral video recognition
Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, and Haibin Ling. Expanding language-image pretrained models for gen- eral video recognition. InEuropean Conference on Com- puter Vision, pages 1–18. Springer, 2022
work page 2022
-
[33]
Junting Pan, Ziyi Lin, Xiatian Zhu, Jing Shao, and Hong- sheng Li. St-adapter: Parameter-efficient image-to-video transfer learning.Advances in Neural Information Process- ing Systems, 35:26462–26477, 2022
work page 2022
-
[34]
Zero-shot action recogni- tion with error-correcting output codes
Jie Qin, Li Liu, Ling Shao, Fumin Shen, Bingbing Ni, Ji- axin Chen, and Yunhong Wang. Zero-shot action recogni- tion with error-correcting output codes. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 2833–2842, 2017
work page 2017
-
[35]
Disentangling spatial and temporal learning for efficient image-to-video transfer learning
Zhiwu Qing, Shiwei Zhang, Ziyuan Huang, Yingya Zhang, Changxin Gao, Deli Zhao, and Nong Sang. Disentangling spatial and temporal learning for efficient image-to-video transfer learning. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 13934–13944, 2023
work page 2023
-
[36]
Learning spatio- temporal representation with pseudo-3d residual networks
Zhaofan Qiu, Ting Yao, and Tao Mei. Learning spatio- temporal representation with pseudo-3d residual networks. Inproceedings of the IEEE International Conference on Computer Vision, pages 5533–5541, 2017
work page 2017
-
[37]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021
work page 2021
-
[38]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021
work page 2021
-
[39]
Self-supervised video transformer
Kanchana Ranasinghe, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan, and Michael S Ryoo. Self-supervised video transformer. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 2874–2884, 2022
work page 2022
-
[40]
Denseclip: Language-guided dense prediction with context- aware prompting
Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu. Denseclip: Language-guided dense prediction with context- aware prompting. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 18082–18091, 2022
work page 2022
-
[41]
Fine-tuned clip models are efficient video learners
Hanoona Rasheed, Muhammad Uzair Khattak, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Fine-tuned clip models are efficient video learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6545–6554, 2023
work page 2023
-
[42]
Margin maximiz- ing loss functions.Advances in neural information process- ing systems, 16, 2003
Saharon Rosset, Ji Zhu, and Trevor Hastie. Margin maximiz- ing loss functions.Advances in neural information process- ing systems, 16, 2003
work page 2003
-
[43]
A closer look at the few-shot adaptation of large vision-language models
Julio Silva-Rodriguez, Sina Hajimiri, Ismail Ben Ayed, and Jose Dolz. A closer look at the few-shot adaptation of large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23681–23690, 2024
work page 2024
-
[44]
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402, 2012
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[45]
Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. Lst: Lad- der side-tuning for parameter and memory efficient transfer learning.Advances in Neural Information Processing Sys- tems, 35:12991–13005, 2022
work page 2022
-
[46]
Argue: Attribute-guided prompt tuning for vision-language models
Xinyu Tian, Shu Zou, Zhaoyuan Yang, and Jing Zhang. Argue: Attribute-guided prompt tuning for vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28578– 28587, 2024
work page 2024
-
[47]
A closer look at spatiotemporal convolutions for action recognition
Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. InProceedings of the IEEE conference on Computer Vision and Pattern Recogni- tion, pages 6450–6459, 2018
work page 2018
-
[48]
Tds-clip: Temporal dif- ference side network for image-to-video transfer learning
Bin Wang and Wenqian Wang. Tds-clip: Temporal dif- ference side network for image-to-video transfer learning. arXiv preprint arXiv:2408.10688, 2024
-
[49]
Hao Wang, Fang Liu, Licheng Jiao, Jiahao Wang, Ze- hua Hao, Shuo Li, Lingling Li, Puhua Chen, and Xu Liu. Vilt-clip: Video and language tuning clip with multimodal prompt learning and scenario-guided optimization. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 5390–5400, 2024
work page 2024
-
[50]
Tdn: Temporal difference networks for efficient action recog- nition
Limin Wang, Zhan Tong, Bin Ji, and Gangshan Wu. Tdn: Temporal difference networks for efficient action recog- nition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1895–1904, 2021
work page 1904
-
[51]
Mengmeng Wang, Jiazheng Xing, Jianbiao Mei, Yong Liu, and Yunliang Jiang. Actionclip: Adapting language-image pretrained models for video action recognition.IEEE Trans- actions on Neural Networks and Learning Systems, 2023
work page 2023
-
[52]
A multimodal, multi-task adapting frame- work for video action recognition
Mengmeng Wang, Jiazheng Xing, Boyuan Jiang, Jun Chen, Jianbiao Mei, Xingxing Zuo, Guang Dai, Jingdong Wang, and Yong Liu. A multimodal, multi-task adapting frame- work for video action recognition. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5517– 5525, 2024
work page 2024
-
[53]
Vita-clip: Video and text adaptive clip via multimodal prompting
Syed Talal Wasim, Muzammal Naseer, Salman Khan, Fa- had Shahbaz Khan, and Mubarak Shah. Vita-clip: Video and text adaptive clip via multimodal prompting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 23034–23044, 2023
work page 2023
-
[54]
Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, and Wanli Ouyang. Bidirectional cross- modal knowledge exploration for video recognition with pre-trained vision-language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6620–6630, 2023
work page 2023
-
[55]
Textrefiner: Internal visual feature as efficient refiner for vision-language models prompt tuning
Jingjing Xie, Yuxin Zhang, Jun Peng, Zhaohong Huang, and Liujuan Cao. Textrefiner: Internal visual feature as efficient refiner for vision-language models prompt tuning. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 8718–8726, 2025
work page 2025
-
[56]
Pointllm: Empowering large language models to understand point clouds
Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiang- miao Pang, and Dahua Lin. Pointllm: Empowering large language models to understand point clouds. InEuropean Conference on Computer Vision, pages 131–147. Springer, 2024
work page 2024
-
[57]
Taojiannan Yang, Yi Zhu, Yusheng Xie, Aston Zhang, Chen Chen, and Mu Li. Aim: Adapting image models for efficient video action recognition.arXiv preprint arXiv:2302.03024, 2023
-
[58]
Florence: A New Foundation Model for Computer Vision
Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, et al. Florence: A new foundation model for computer vision.arXiv preprint arXiv:2111.11432, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[59]
Lit: Zero-shot transfer with locked-image text tuning
Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18123–18133, 2022
work page 2022
-
[60]
Side-tuning: a baseline for net- work adaptation via additive side networks
Jeffrey O Zhang, Alexander Sax, Amir Zamir, Leonidas Guibas, and Jitendra Malik. Side-tuning: a baseline for net- work adaptation via additive side networks. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pages 698–714. Springer, 2020
work page 2020
-
[61]
Renrui Zhang, Rongyao Fang, Wei Zhang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip-adapter: Training-free clip-adapter for better vision- language modeling.arXiv preprint arXiv:2111.03930, 2021
-
[62]
Conditional prompt learning for vision-language mod- els
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language mod- els. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 16816–16825, 2022
work page 2022
-
[63]
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.In- ternational Journal of Computer Vision, 130(9):2337–2348, 2022
work page 2022
-
[64]
Prompt-aligned gradient for prompt tuning
Beier Zhu, Yulei Niu, Yucheng Han, Yue Wu, and Hanwang Zhang. Prompt-aligned gradient for prompt tuning. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 15659–15669, 2023
work page 2023
-
[65]
Towards universal representation for unseen action recognition
Yi Zhu, Yang Long, Yu Guan, Shawn Newsam, and Ling Shao. Towards universal representation for unseen action recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 9436–9445, 2018
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.