arxiv: 2511.22125 · v2 · submitted 2025-11-27 · 💻 cs.CV

GA2-CLIP: Generic Attribute Anchor for Efficient Prompt Tuningin Video-Language Models

Bin Wang , Ruotong Hu , Wentong Li , Wenqian Wang , Mingliang Gao , Runmin Cong , Wei Zhang , Xudong Jiang This is my paper

Pith reviewed 2026-05-17 05:13 UTC · model grok-4.3

classification 💻 cs.CV

keywords prompt tuningvideo-language modelsgeneralizationvision-language modelsgeneric anchorsbase-to-new predictionCLIP

0 comments

The pith

Generic attribute anchors from pre-trained prompts and negative videos allow video prompt tuning without losing generalization to new classes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard soft prompt tuning on video tasks causes the model to overfit to training classes and lose ability on new ones. To fix this without weakening the prompts, they add hard prompt tokens from pre-trained models on other data, link them to soft tokens with a mapping layer, and use irrelevant videos plus negative prompts as anchors. This keeps the semantic space wide. A reader would care because it allows adapting VLMs to videos while keeping them useful for everything else.

Core claim

By concatenating pre-trained prompts from other datasets as hard tokens with soft prompt tokens and coupling them via a learnable mapping layer, along with introducing generic attribute anchors consisting of irrelevant video sets and negative prompts, the method prevents the semantic space from narrowing and overfitting to supervised categories during fine-tuning on video tasks.

What carries the argument

Generic attribute anchor using pre-trained hard prompts coupled to soft prompts and negative video prompts to maintain broad semantic relevance.

Load-bearing premise

That externally added pre-trained prompts and negative prompts from irrelevant videos will maintain generic relevance in the semantic space without introducing biases or impairing soft prompt learning.

What would settle it

If experiments on base-to-new video class prediction show no improvement or worse results when adding the generic attribute anchors compared to standard soft prompt tuning.

Figures

Figures reproduced from arXiv: 2511.22125 by Bin Wang, Mingliang Gao, Runmin Cong, Ruotong Hu, Wei Zhang, Wenqian Wang, Wentong Li, Xudong Jiang.

**Figure 2.** Figure 2: Comparison of existing video prompt tuning architectures. (a) ViFi-CLIP inputs multiple learnable soft tokens combined with [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Architecture of the Generic Attribute Anchors CLIP (GA [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of different hard and soft prompt token cou [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: The effect of different settings on base to novel. (a) The [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: t-SNE visualizations for HMDB-51 and UCF-101 datasets. We perform t-SNE visualizations of the final visual embedded [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

read the original abstract

Visual and textual soft prompt tuning can effectively improve the adaptability of Vision-Language Models (VLMs) in downstream tasks. However, fine-tuning on video tasks impairs the model's generalization ability to unseen classes. Existing methods attempt to mitigate this forgetting effect by regularizing the gap between hand-crafted prompts and soft prompts, but this also weakens the learning ability of soft prompts. To address this challenge, we propose a plug-and-play coupling prompt learning framework to optimize the generalization performance of V-L models in video tasks, with the core motivation of mitigating semantic space narrowing during fine-tuning by introducing an externally supervised prompt. Specifically, for textual prompts, we introduce pre-trained prompts from other datasets as hard prompt tokens. These are concatenated with soft prompt tokens and coupled via a learnable mapping layer. This competitive prompting approach prevents the semantic space from overfitting to supervised categories. In addition, we introduce a set of well-designed irrelevant video sets and negative prompts as generic attribute anchors to maintain the generic relevance of the attributes in the pre-trained semantic space, thus preserving the generalization ability. Experiments on video tasks demonstrate that our method significantly outperforms state-of-the-art prompt tuning approaches across generalization benchmarks, particularly on base-to-new class prediction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GA2-CLIP couples hard prompts from other datasets with soft prompts via a learnable layer and adds irrelevant-video anchors to limit semantic narrowing in video prompt tuning.

read the letter

The paper's main move is to stop soft prompt tuning on video data from collapsing the model's semantic space around the training classes. They do this by pulling in pre-trained hard prompt tokens from other datasets, concatenating them with the learnable soft tokens, and tying the two together with a small mapping layer. On top of that they add a set of irrelevant videos paired with negative prompts that act as generic attribute anchors during training. The claim is that this competitive setup keeps the model from overfitting while still letting the soft prompts learn the target task, which shows up as better base-to-new generalization on video benchmarks. That framing of external hard prompts plus explicit anchors is not exactly how prior prompt-tuning papers handled the forgetting problem, so the combination counts as new engineering. The approach is also presented as plug-and-play, which is a practical plus if the gains turn out to be stable across datasets. The soft spots are straightforward. The abstract states clear outperformance but supplies no numbers, no ablation tables, no baseline details, and no description of how the irrelevant video sets were actually selected or verified to be non-overlapping. Without those controls the central result is hard to judge. The stress-test worry about anchor overlap is also live: if the chosen irrelevant clips share motion statistics or object categories with the evaluation videos, the anchors could reinforce rather than counteract narrowing. The paper does not appear to contain formal proofs or released code, so reproducibility will depend on the experimental section once it is examined. This is the kind of incremental adaptation paper that researchers who actually fine-tune VLMs on video data would want to read and test. It is not foundational, but the motivation is honest and the method is simple enough to implement quickly. I would send it to peer review so the experiments and anchor choices can be checked properly.

Referee Report

2 major / 2 minor

Summary. The paper proposes GA2-CLIP, a plug-and-play coupling prompt learning framework for adapting vision-language models to video tasks. It concatenates pre-trained prompts from other datasets with soft prompt tokens via a learnable mapping layer for textual prompts, and introduces irrelevant video sets plus negative prompts as generic attribute anchors. The core motivation is to mitigate semantic space narrowing and forgetting during fine-tuning while preserving generalization, with the central empirical claim being significant outperformance over state-of-the-art prompt tuning methods on video generalization benchmarks, especially base-to-new class prediction.

Significance. If the empirical results hold under rigorous controls, the approach offers a practical, efficient way to improve generalization in VLM prompt tuning for videos without the typical trade-off of weakened learning ability from regularization. It could advance few-shot video understanding by leveraging external anchors to maintain pre-trained semantic breadth, with potential applicability to other multimodal tasks.

major comments (2)

[Abstract] Abstract: The central claim that the method 'significantly outperforms state-of-the-art prompt tuning approaches across generalization benchmarks' is stated without any quantitative results, specific metrics, baseline comparisons, ablation studies, or error bars. This makes the empirical contribution difficult to evaluate and risks resting on post-hoc choices, as the soundness assessment notes.
[Abstract] Abstract (and §3 Method, per the description of anchors): The load-bearing assumption that 'irrelevant video sets and negative prompts' serve as stable generic attribute anchors that 'preserve the generalization ability' without introducing biases or overlapping in motion/object/scene statistics with target tasks is not accompanied by selection criteria, distribution analysis, or ablations. If these sets share latent attributes with evaluation videos, the competitive prompting could reinforce rather than counteract overfitting, directly undermining the base-to-new generalization claim.

minor comments (2)

[Abstract] The title expands GA2-CLIP as 'Generic Attribute Anchor' but the abstract does not explicitly restate this or clarify how the learnable mapping layer parameters are initialized or regularized.
[Method] Notation for the coupling of hard and soft prompts via the mapping layer could be formalized with an equation for clarity in the method description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help us improve the clarity and rigor of the presentation. We address each major comment below and will revise the manuscript to incorporate the suggested changes where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the method 'significantly outperforms state-of-the-art prompt tuning approaches across generalization benchmarks' is stated without any quantitative results, specific metrics, baseline comparisons, ablation studies, or error bars. This makes the empirical contribution difficult to evaluate and risks resting on post-hoc choices, as the soundness assessment notes.

Authors: We agree that the abstract would be strengthened by including key quantitative results to support the central claim. In the revised version, we will add specific metrics (e.g., average accuracy gains on base-to-new class prediction across video benchmarks such as UCF101 and HMDB51), explicit comparisons to baselines including CoOp, CoCoOp, and MaPLe, and reference to standard deviations from multiple runs. This will make the empirical contribution more transparent while remaining within abstract length constraints. revision: yes
Referee: [Abstract] Abstract (and §3 Method, per the description of anchors): The load-bearing assumption that 'irrelevant video sets and negative prompts' serve as stable generic attribute anchors that 'preserve the generalization ability' without introducing biases or overlapping in motion/object/scene statistics with target tasks is not accompanied by selection criteria, distribution analysis, or ablations. If these sets share latent attributes with evaluation videos, the competitive prompting could reinforce rather than counteract overfitting, directly undermining the base-to-new generalization claim.

Authors: We acknowledge this important point about potential biases and overlaps. The manuscript describes the sets as 'well-designed' to avoid direct class overlap with target tasks, but we agree that explicit selection criteria, distribution analysis, and additional ablations would better substantiate the claim. In the revision, we will expand Section 3 and add an appendix detailing the selection process (e.g., sourcing from broad action datasets while excluding base/new classes of the target benchmarks), include qualitative distribution comparisons, and report ablations on anchor set variations. Our existing results show that removing the anchors degrades base-to-new performance, supporting their utility, but we will provide further evidence to address the risk of latent reinforcement of overfitting. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical method with external validation

full rationale

The paper describes an engineering contribution: a coupling prompt framework that concatenates pre-trained hard prompts with soft prompts via a learnable layer and adds irrelevant video sets plus negative prompts as anchors. No mathematical derivation, uniqueness theorem, or fitted-parameter prediction is claimed that reduces to its own inputs by construction. The central claims rest on experimental results across generalization benchmarks rather than self-referential definitions or load-bearing self-citations. The approach is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

Review performed on abstract only; full methods, assumptions, and experimental design are unavailable so ledger entries are necessarily incomplete and conservative.

free parameters (1)

learnable mapping layer parameters
The coupling between hard and soft prompt tokens is performed by a learnable mapping layer whose weights are fitted during training.

axioms (2)

domain assumption Pre-trained prompts from other datasets remain semantically relevant when transferred to the target video task.
Invoked when the paper states that these prompts are introduced as hard tokens to prevent overfitting.
ad hoc to paper Irrelevant video sets and negative prompts act as stable anchors that preserve generic attribute relevance without harming task-specific learning.
Core motivation stated in the abstract for maintaining generalization.

pith-pipeline@v0.9.0 · 5533 in / 1427 out tokens · 32008 ms · 2026-05-17T05:13:27.506259+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we introduce pre-trained prompts from other datasets as hard prompt tokens. These are concatenated with soft prompt tokens and coupled via a learnable mapping layer... irrelevant video sets and negative prompts as generic attribute anchors
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experiments on video tasks demonstrate that our method significantly outperforms state-of-the-art prompt tuning approaches across generalization benchmarks

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 6 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Vivit: A video vision transformer

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lu ˇci´c, and Cordelia Schmid. Vivit: A video vision transformer. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 6836–6846, 2021

work page 2021
[3]

Is space-time attention all you need for video understanding? InICML, page 4, 2021

Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? InICML, page 4, 2021

work page 2021
[4]

Rethinking zero-shot video classification: End-to-end training for realistic applications

Biagio Brattoli, Joseph Tighe, Fedor Zhdanov, Pietro Per- ona, and Krzysztof Chalupka. Rethinking zero-shot video classification: End-to-end training for realistic applications. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 4613–4623, 2020

work page 2020
[5]

Elaborative rehearsal for zero-shot action recognition

Shizhe Chen and Dong Huang. Elaborative rehearsal for zero-shot action recognition. InProceedings of the IEEE/CVF international conference on computer vision, pages 13638–13647, 2021

work page 2021
[6]

Tree of attributes prompt learning for vision-language models.arXiv preprint arXiv:2410.11201, 2024

Tong Ding, Wanhua Li, Zhongqi Miao, and Hanspeter Pfis- ter. Tree of attributes prompt learning for vision-language models.arXiv preprint arXiv:2410.11201, 2024

work page arXiv 2024
[7]

Learning to prompt for open-vocabulary ob- ject detection with vision-language model

Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, and Guoqi Li. Learning to prompt for open-vocabulary ob- ject detection with vision-language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14084–14093, 2022

work page 2022
[8]

Multiscale vision transformers

Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichten- hofer. Multiscale vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 6824–6835, 2021

work page 2021
[9]

X3d: Expanding architectures for efficient video recognition

Christoph Feichtenhofer. X3d: Expanding architectures for efficient video recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 203–213, 2020

work page 2020
[10]

The” something something” video database for learning and evaluating visual common sense

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The” something something” video database for learning and evaluating visual common sense. InProceedings of the IEEE international conference on com- puter vision, pages 5842...

work page 2017
[11]

Scaling up visual and vision-language representa- tion learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR, 2021

work page 2021
[12]

Vi- sual prompt tuning

Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi- sual prompt tuning. InEuropean Conference on Computer Vision, pages 709–727. Springer, 2022

work page 2022
[13]

Prompting visual-language models for efficient video understanding

Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, and Weidi Xie. Prompting visual-language models for efficient video understanding. InEuropean Conference on Computer Vi- sion, pages 105–124. Springer, 2022

work page 2022
[14]

The Kinetics Human Action Video Dataset

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics hu- man action video dataset.arXiv preprint arXiv:1705.06950, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[15]

Maple: Multi-modal prompt learning

Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19113–19122, 2023

work page 2023
[16]

Self-regulating prompts: Foundational model adaptation without forgetting

Muhammad Uzair Khattak, Syed Talal Wasim, Muzam- mal Naseer, Salman Khan, Ming-Hsuan Yang, and Fa- had Shahbaz Khan. Self-regulating prompts: Foundational model adaptation without forgetting. InProceedings of the IEEE/CVF international conference on computer vision, pages 15190–15200, 2023

work page 2023
[17]

Aapl: Adding attributes to prompt learning for vision-language models

Gahyeon Kim, Sohee Kim, and Seokju Lee. Aapl: Adding attributes to prompt learning for vision-language models. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 1572–1582, 2024

work page 2024
[18]

Hmdb: a large video database for human motion recognition

Hildegard Kuehne, Hueihan Jhuang, Est ´ıbaliz Garrote, Tomaso Poggio, and Thomas Serre. Hmdb: a large video database for human motion recognition. In2011 Inter- national conference on computer vision, pages 2556–2563. IEEE, 2011

work page 2011
[19]

Dpc: Dual-prompt collaboration for tuning vision-language models

Haoyang Li, Liang Wang, Chao Wang, Jing Jiang, Yan Peng, and Guodong Long. Dpc: Dual-prompt collaboration for tuning vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 25623–25632, 2025

work page 2025
[20]

Tokenpacker: Efficient visual projector for multimodal llm.International Journal of Computer Vision, pages 1–19, 2025

Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, and Lei Zhang. Tokenpacker: Efficient visual projector for multimodal llm.International Journal of Computer Vision, pages 1–19, 2025

work page 2025
[21]

Tea: Temporal excitation and aggregation for action recognition

Yan Li, Bin Ji, Xintian Shi, Jianguo Zhang, Bin Kang, and Limin Wang. Tea: Temporal excitation and aggregation for action recognition. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 909–918, 2020

work page 2020
[22]

Promptkd: Unsupervised prompt distillation for vision-language models

Zheng Li, Xiang Li, Xinyi Fu, Xin Zhang, Weiqiang Wang, Shuo Chen, and Jian Yang. Promptkd: Unsupervised prompt distillation for vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26617–26626, 2024

work page 2024
[23]

Atprompt: Textual prompt learning with embedded attributes.arXiv preprint arXiv:2412.09442, 2024

Zheng Li, Yibing Song, Penghai Zhao, Ming-Ming Cheng, Xiang Li, and Jian Yang. Atprompt: Textual prompt learning with embedded attributes.arXiv preprint arXiv:2412.09442, 2024

work page arXiv 2024
[24]

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual rep- resentation by alignment before projection.arXiv preprint arXiv:2311.10122, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Tsm: Temporal shift module for efficient video understanding

Ji Lin, Chuang Gan, and Song Han. Tsm: Temporal shift module for efficient video understanding. InProceedings of the IEEE/CVF international conference on computer vision, pages 7083–7093, 2019

work page 2019
[26]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Revisiting temporal modeling for clip-based image-to-video knowledge transferring

Ruyang Liu, Jingjia Huang, Ge Li, Jiashi Feng, Xinglong Wu, and Thomas H Li. Revisiting temporal modeling for clip-based image-to-video knowledge transferring. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6555–6564, 2023

work page 2023
[28]

Llava-plus: Learning to use tools for creating multi- modal agents

Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, et al. Llava-plus: Learning to use tools for creating multi- modal agents. InEuropean Conference on Computer Vision, pages 126–142. Springer, 2024

work page 2024
[29]

Video swin transformer

Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022

work page 2022
[30]

Visualizing data using t-sne.Journal of machine learning research, 9 (Nov):2579–2605, 2008

Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of machine learning research, 9 (Nov):2579–2605, 2008

work page 2008
[31]

Video transformer network

Daniel Neimark, Omri Bar, Maya Zohar, and Dotan As- selmann. Video transformer network. InProceedings of the IEEE/CVF international conference on computer vision, pages 3163–3172, 2021

work page 2021
[32]

Expanding language-image pretrained models for gen- eral video recognition

Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, and Haibin Ling. Expanding language-image pretrained models for gen- eral video recognition. InEuropean Conference on Com- puter Vision, pages 1–18. Springer, 2022

work page 2022
[33]

St-adapter: Parameter-efficient image-to-video transfer learning.Advances in Neural Information Process- ing Systems, 35:26462–26477, 2022

Junting Pan, Ziyi Lin, Xiatian Zhu, Jing Shao, and Hong- sheng Li. St-adapter: Parameter-efficient image-to-video transfer learning.Advances in Neural Information Process- ing Systems, 35:26462–26477, 2022

work page 2022
[34]

Zero-shot action recogni- tion with error-correcting output codes

Jie Qin, Li Liu, Ling Shao, Fumin Shen, Bingbing Ni, Ji- axin Chen, and Yunhong Wang. Zero-shot action recogni- tion with error-correcting output codes. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 2833–2842, 2017

work page 2017
[35]

Disentangling spatial and temporal learning for efficient image-to-video transfer learning

Zhiwu Qing, Shiwei Zhang, Ziyuan Huang, Yingya Zhang, Changxin Gao, Deli Zhao, and Nong Sang. Disentangling spatial and temporal learning for efficient image-to-video transfer learning. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 13934–13944, 2023

work page 2023
[36]

Learning spatio- temporal representation with pseudo-3d residual networks

Zhaofan Qiu, Ting Yao, and Tao Mei. Learning spatio- temporal representation with pseudo-3d residual networks. Inproceedings of the IEEE International Conference on Computer Vision, pages 5533–5541, 2017

work page 2017
[37]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021

work page 2021
[38]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021
[39]

Self-supervised video transformer

Kanchana Ranasinghe, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan, and Michael S Ryoo. Self-supervised video transformer. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 2874–2884, 2022

work page 2022
[40]

Denseclip: Language-guided dense prediction with context- aware prompting

Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu. Denseclip: Language-guided dense prediction with context- aware prompting. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 18082–18091, 2022

work page 2022
[41]

Fine-tuned clip models are efficient video learners

Hanoona Rasheed, Muhammad Uzair Khattak, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Fine-tuned clip models are efficient video learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6545–6554, 2023

work page 2023
[42]

Margin maximiz- ing loss functions.Advances in neural information process- ing systems, 16, 2003

Saharon Rosset, Ji Zhu, and Trevor Hastie. Margin maximiz- ing loss functions.Advances in neural information process- ing systems, 16, 2003

work page 2003
[43]

A closer look at the few-shot adaptation of large vision-language models

Julio Silva-Rodriguez, Sina Hajimiri, Ismail Ben Ayed, and Jose Dolz. A closer look at the few-shot adaptation of large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23681–23690, 2024

work page 2024
[44]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012
[45]

Lst: Lad- der side-tuning for parameter and memory efficient transfer learning.Advances in Neural Information Processing Sys- tems, 35:12991–13005, 2022

Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. Lst: Lad- der side-tuning for parameter and memory efficient transfer learning.Advances in Neural Information Processing Sys- tems, 35:12991–13005, 2022

work page 2022
[46]

Argue: Attribute-guided prompt tuning for vision-language models

Xinyu Tian, Shu Zou, Zhaoyuan Yang, and Jing Zhang. Argue: Attribute-guided prompt tuning for vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28578– 28587, 2024

work page 2024
[47]

A closer look at spatiotemporal convolutions for action recognition

Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. InProceedings of the IEEE conference on Computer Vision and Pattern Recogni- tion, pages 6450–6459, 2018

work page 2018
[48]

Tds-clip: Temporal dif- ference side network for image-to-video transfer learning

Bin Wang and Wenqian Wang. Tds-clip: Temporal dif- ference side network for image-to-video transfer learning. arXiv preprint arXiv:2408.10688, 2024

work page arXiv 2024
[49]

Vilt-clip: Video and language tuning clip with multimodal prompt learning and scenario-guided optimization

Hao Wang, Fang Liu, Licheng Jiao, Jiahao Wang, Ze- hua Hao, Shuo Li, Lingling Li, Puhua Chen, and Xu Liu. Vilt-clip: Video and language tuning clip with multimodal prompt learning and scenario-guided optimization. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 5390–5400, 2024

work page 2024
[50]

Tdn: Temporal difference networks for efficient action recog- nition

Limin Wang, Zhan Tong, Bin Ji, and Gangshan Wu. Tdn: Temporal difference networks for efficient action recog- nition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1895–1904, 2021

work page 1904
[51]

Actionclip: Adapting language-image pretrained models for video action recognition.IEEE Trans- actions on Neural Networks and Learning Systems, 2023

Mengmeng Wang, Jiazheng Xing, Jianbiao Mei, Yong Liu, and Yunliang Jiang. Actionclip: Adapting language-image pretrained models for video action recognition.IEEE Trans- actions on Neural Networks and Learning Systems, 2023

work page 2023
[52]

A multimodal, multi-task adapting frame- work for video action recognition

Mengmeng Wang, Jiazheng Xing, Boyuan Jiang, Jun Chen, Jianbiao Mei, Xingxing Zuo, Guang Dai, Jingdong Wang, and Yong Liu. A multimodal, multi-task adapting frame- work for video action recognition. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5517– 5525, 2024

work page 2024
[53]

Vita-clip: Video and text adaptive clip via multimodal prompting

Syed Talal Wasim, Muzammal Naseer, Salman Khan, Fa- had Shahbaz Khan, and Mubarak Shah. Vita-clip: Video and text adaptive clip via multimodal prompting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 23034–23044, 2023

work page 2023
[54]

Bidirectional cross- modal knowledge exploration for video recognition with pre-trained vision-language models

Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, and Wanli Ouyang. Bidirectional cross- modal knowledge exploration for video recognition with pre-trained vision-language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6620–6630, 2023

work page 2023
[55]

Textrefiner: Internal visual feature as efficient refiner for vision-language models prompt tuning

Jingjing Xie, Yuxin Zhang, Jun Peng, Zhaohong Huang, and Liujuan Cao. Textrefiner: Internal visual feature as efficient refiner for vision-language models prompt tuning. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 8718–8726, 2025

work page 2025
[56]

Pointllm: Empowering large language models to understand point clouds

Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiang- miao Pang, and Dahua Lin. Pointllm: Empowering large language models to understand point clouds. InEuropean Conference on Computer Vision, pages 131–147. Springer, 2024

work page 2024
[57]

Aim: Adapting image models for efficient video action recognition.arXiv preprint arXiv:2302.03024, 2023

Taojiannan Yang, Yi Zhu, Yusheng Xie, Aston Zhang, Chen Chen, and Mu Li. Aim: Adapting image models for efficient video action recognition.arXiv preprint arXiv:2302.03024, 2023

work page arXiv 2023
[58]

Florence: A New Foundation Model for Computer Vision

Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, et al. Florence: A new foundation model for computer vision.arXiv preprint arXiv:2111.11432, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[59]

Lit: Zero-shot transfer with locked-image text tuning

Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18123–18133, 2022

work page 2022
[60]

Side-tuning: a baseline for net- work adaptation via additive side networks

Jeffrey O Zhang, Alexander Sax, Amir Zamir, Leonidas Guibas, and Jitendra Malik. Side-tuning: a baseline for net- work adaptation via additive side networks. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pages 698–714. Springer, 2020

work page 2020
[61]

Tip-adapter: Training-free clip-adapter for better vision- language modeling.arXiv preprint arXiv:2111.03930, 2021

Renrui Zhang, Rongyao Fang, Wei Zhang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip-adapter: Training-free clip-adapter for better vision- language modeling.arXiv preprint arXiv:2111.03930, 2021

work page arXiv 2021
[62]

Conditional prompt learning for vision-language mod- els

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language mod- els. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 16816–16825, 2022

work page 2022
[63]

Learning to prompt for vision-language models.In- ternational Journal of Computer Vision, 130(9):2337–2348, 2022

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.In- ternational Journal of Computer Vision, 130(9):2337–2348, 2022

work page 2022
[64]

Prompt-aligned gradient for prompt tuning

Beier Zhu, Yulei Niu, Yucheng Han, Yue Wu, and Hanwang Zhang. Prompt-aligned gradient for prompt tuning. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 15659–15669, 2023

work page 2023
[65]

Towards universal representation for unseen action recognition

Yi Zhu, Yang Long, Yu Guan, Shawn Newsam, and Ling Shao. Towards universal representation for unseen action recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 9436–9445, 2018

work page 2018