Recognition: unknown
Motion-Guided Semantic Alignment with Negative Prompts for Zero-Shot Video Action Recognition
Pith reviewed 2026-05-10 06:37 UTC · model grok-4.3
The pith
Separating motion features from static content and aligning videos with negative text prompts lets CLIP recognize actions never seen in training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present a novel framework that enhances CLIP with disentangled embeddings and semantic-guided interaction. A Motion Separation Module (MSM) separates motion-sensitive and global-static features, while a Motion Aggregation Block (MAB) employs gated cross-attention to refine motion representation without re-coupling redundant information. To facilitate generalization to unseen categories, we enforce semantic alignment between video features and textual representations by aligning projected embeddings with positive textual prompts, while leveraging negative prompts to explicitly model non-class semantics.
What carries the argument
The Motion Separation Module (MSM) that isolates motion-sensitive features, the Motion Aggregation Block (MAB) that performs gated cross-attention on motion, and the dual use of positive and negative textual prompts to enforce semantic alignment.
If this is right
- The framework produces consistent gains over prior CLIP-based zero-shot methods on standard coarse and fine-grained action benchmarks.
- Negative prompts allow the model to represent what an action is not, supporting better transfer to classes absent from training.
- Gated cross-attention in the aggregation block keeps motion features clean without reintroducing static redundancy.
- The same alignment strategy works across both broad category sets and detailed action distinctions.
Where Pith is reading between the lines
- The motion-disentanglement idea could transfer to other video-text tasks such as dense captioning or temporal localization.
- Negative prompts might reduce overconfident predictions on ambiguous or out-of-distribution video clips.
- Independent ablations of the separation and aggregation steps would clarify which part drives most of the reported improvement.
- The approach points toward lightweight adaptation techniques that avoid full model retraining when new action classes appear.
Load-bearing premise
That isolating motion features and aligning them with negative prompts will reliably shrink the semantic gap to unseen actions without creating new errors on real video distributions.
What would settle it
Running the method on a fresh fine-grained action dataset where it fails to beat the strongest prior CLIP zero-shot baseline would falsify the central claim.
read the original abstract
Zero-shot action recognition is challenging due to the semantic gap between seen and unseen classes. We present a novel framework that enhances CLIP with disentangled embeddings and semantic-guided interaction. A Motion Separation Module (MSM) separates motion-sensitive and global-static features, while a Motion Aggregation Block (MAB) employs gated cross-attention to refine motion representation without re-coupling redundant information. To facilitate generalization to unseen categories, we enforce semantic alignment between video features and textual representations by aligning projected embeddings with positive textual prompts, while leveraging negative prompts to explicitly model "non-class" semantics. Experiments on standard benchmarks demonstrate that our method consistently outperforms prior CLIP-based approaches, achieving robust zero-shot action recognition across both coarse and fine-grained datasets.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a framework for zero-shot video action recognition that augments CLIP with a Motion Separation Module (MSM) to disentangle motion-sensitive from global-static features and a Motion Aggregation Block (MAB) that uses gated cross-attention to refine motion representations. Semantic alignment is enforced by projecting video embeddings to match positive textual prompts while using negative prompts to explicitly capture non-class semantics, with the goal of closing the semantic gap for unseen action classes. The central claim is that this yields consistent outperformance over prior CLIP-based methods on standard benchmarks for both coarse- and fine-grained datasets.
Significance. If the empirical results hold after proper validation, the work offers a concrete mechanism for improving generalization in zero-shot settings by combining motion disentanglement with explicit negative-prompt modeling of non-class semantics. This could be useful for fine-grained actions where motion overlap is high. The approach is grounded in existing CLIP architectures and does not appear to introduce new free parameters beyond standard training, which is a positive attribute. However, the absence of any quantitative metrics, baselines, or ablations in the abstract limits immediate assessment of impact.
major comments (2)
- [Abstract] Abstract: The claim that the method 'consistently outperforms prior CLIP-based approaches' is presented without any accuracy numbers, dataset names, splits, error bars, or comparison tables. This is load-bearing for the central empirical claim; without these elements the soundness of the outperformance assertion cannot be evaluated.
- [Methods] Methods section (negative prompt construction): The procedure for generating negative prompts for truly unseen classes is not specified (e.g., fixed 'not [class]' templates, vocabulary drawn from seen classes, or learned). This detail is critical because reliance on training-class priors would violate standard zero-shot protocols and could produce non-generalizable gains, especially on fine-grained datasets where motion overlap makes 'non-class' semantics ambiguous.
minor comments (1)
- [Abstract] Abstract: The phrase 'disentangled embeddings and semantic-guided interaction' is used without a one-sentence pointer to the MSM/MAB modules or the alignment objective; a brief clarification would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the changes we will make in the revision.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that the method 'consistently outperforms prior CLIP-based approaches' is presented without any accuracy numbers, dataset names, splits, error bars, or comparison tables. This is load-bearing for the central empirical claim; without these elements the soundness of the outperformance assertion cannot be evaluated.
Authors: We agree that the abstract would benefit from greater specificity to support the central empirical claim. In the revised manuscript, we will update the abstract to explicitly name the standard benchmarks (UCF101, HMDB51, and a fine-grained dataset such as Something-Something V2), the zero-shot splits used, and direct reference to the quantitative comparisons and tables in the experiments section. This will allow immediate evaluation of the outperformance while respecting abstract length constraints. revision: yes
-
Referee: [Methods] Methods section (negative prompt construction): The procedure for generating negative prompts for truly unseen classes is not specified (e.g., fixed 'not [class]' templates, vocabulary drawn from seen classes, or learned). This detail is critical because reliance on training-class priors would violate standard zero-shot protocols and could produce non-generalizable gains, especially on fine-grained datasets where motion overlap makes 'non-class' semantics ambiguous.
Authors: We thank the referee for this important clarification request. The negative prompts are constructed via the fixed template 'not [class]' using the name of the target (unseen) action class at evaluation time. This is standard in zero-shot settings where test class names are provided for prompt construction, and no vocabulary or information from the training classes is used. No learned components are involved. We will add an explicit paragraph in the Methods section describing this construction process, including an example, to confirm adherence to zero-shot protocols. revision: yes
Circularity Check
No circularity: empirical method validated on benchmarks without self-referential derivations.
full rationale
The paper proposes architectural components (MSM for motion separation, MAB for gated cross-attention, and positive/negative prompt alignment) to address the semantic gap in zero-shot action recognition. All load-bearing claims reduce to experimental outperformance on standard coarse and fine-grained datasets rather than any first-principles derivation, fitted-parameter prediction, or self-citation chain. No equations appear in the abstract, and the described framework introduces new modules whose effectiveness is measured externally against prior CLIP baselines; nothing reduces to its own inputs by construction. This is a standard empirical contribution whose validity rests on benchmark results, not internal tautology.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
INTRODUCTION In recent years, large-scale vision–language pre-trained models such as CLIP [1] have shown remarkable success in cross-modal learning, driving significant advances in zero-shot learning (ZSL) [2]. Extend- ing this paradigm to the video domain, zero-shot action recognition (ZSAR) [3] seeks to classify unseen actions by transferring knowl- edg...
-
[2]
Motion-Guided Semantic Alignment with Negative Prompts for Zero-Shot Video Action Recognition
PROPOSED METHOD 2.1. CLIP-based Video-Text Representation As shown in Fig. 1, given an input videoF v ∈R T×C×H×W with Tframes, we first extract visual representations using a frozen CLIP visual encoder. To adapt CLIP to the video domain while preserving zero-shot generalization, we introduce a lightweight Dual Adapter (DA) [17], which is shared across bot...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
All results are reported in accuracy (%). Method Publication HMDB-51 UCF-101 K-600 Methods with Vision Training ER-ZSAR [3] ICCV’21 35.3±4.6 51.8±2.9 42.1±1.4 JigSawNet [4] TIP’19 39.3±3.9 56.8±2.8 — Methods with Vision-Language Training A5 [5] ECCV’22 44.3±2.2 69.3±4.2 — X-CLIP [20] ECCV’22 46.3±0.6 70.3±2.3 67.1±1.0 Vita-CLIP [7] CVPR’23 48.6±0.6 75.0±0...
-
[4]
Accuracy in %
HM = harmonic mean of Base and Novel. Accuracy in %. Kinetics400 HMDB-51 Method Base Novel HM Base Novel HM Vanilla CLIP B/16 [1] 53.3 46.8 49.8 53.3 46.8 49.8 ActionCLIP B/16 [6] 69.0 57.2 62.6 69.1 37.3 48.5 XCLIP B/16 [20] 74.1 56.4 64.0 69.4 45.5 55.0 A5 [5] 74.1 56.4 64.0 46.2 16.0 23.8 ViFi-CLIP B/16 [21] 76.461.167.9 73.8 53.3 61.9 ZAR B/16 [11] 75...
-
[5]
EXPERIMENTS 3.1. Experimental Setup To evaluate our method, we conduct experiments on five widely used benchmarks: Kinetics-400 [22], Kinetics-600 [23], HMDB51 [24], UCF101 [25], and Something-Something V2 (SSv2) [26]. Kinetics- 400 serves as the training set, while HMDB51, UCF101, SSv2, and Kinetics-600 (excluding overlaps with K400) are used for zero-sh...
-
[6]
These designs jointly enhance the model’s ability to gener- alize from base to novel classes, providing a principled step toward zero-shot video action recognition
CONCLUSION In conclusion, our motion-guided framework effectively disentan- gles motion and global cues, integrates them into semantically aligned representations, and leverages negative prompts for robust learning. These designs jointly enhance the model’s ability to gener- alize from base to novel classes, providing a principled step toward zero-shot vi...
-
[7]
Learning trans- ferable visual models from natural language supervision,
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al., “Learning trans- ferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763
2021
-
[8]
Zero-shot learning with semantic output codes,
Mark Palatucci, Dean Pomerleau, Geoffrey E Hinton, and Tom M Mitchell, “Zero-shot learning with semantic output codes,”Advances in neural information processing systems, vol. 22, 2009
2009
-
[9]
Elaborative rehearsal for zero- shot action recognition,
Shizhe Chen and Dong Huang, “Elaborative rehearsal for zero- shot action recognition,” inProceedings of the IEEE/CVF in- ternational conference on computer vision, 2021, pp. 13638– 13647
2021
-
[10]
Jigsawnet: Shredded image reassembly using convolutional neural network and loop-based composi- tion,
Canyu Le and Xin Li, “Jigsawnet: Shredded image reassembly using convolutional neural network and loop-based composi- tion,”IEEE Transactions on Image Processing, vol. 28, no. 8, pp. 4000–4015, 2019
2019
-
[11]
Prompting visual-language models for efficient video understanding,
Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, and Weidi Xie, “Prompting visual-language models for efficient video understanding,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 105–124
2022
-
[12]
Actionclip: Adapting language-image pre- trained models for video action recognition,
Mengmeng Wang, Jiazheng Xing, Jianbiao Mei, Yong Liu, and Yunliang Jiang, “Actionclip: Adapting language-image pre- trained models for video action recognition,”IEEE transac- tions on neural networks and learning systems, vol. 36, no. 1, pp. 625–637, 2023
2023
-
[13]
Vita-clip: Video and text adaptive clip via multimodal prompting,
Syed Talal Wasim, Muzammal Naseer, Salman Khan, Fa- had Shahbaz Khan, and Mubarak Shah, “Vita-clip: Video and text adaptive clip via multimodal prompting,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 23034–23044
2023
-
[14]
Ez-clip: Efficient zeroshot video action recognition,
Shahzad Ahmad, Sukalpa Chanda, and Yogesh S Rawat, “Ez- clip: Efficient zeroshot video action recognition,”arXiv preprint arXiv:2312.08010, 2023
-
[15]
Is temporal prompting all we need for limited labeled action recognition?,
Shreyank Gowda, Boyan Gao, Xiao Gu, and Xiabo Jin, “Is temporal prompting all we need for limited labeled action recognition?,” inProceedings of the Computer Vision and Pat- tern Recognition Conference, 2025, pp. 682–692
2025
-
[16]
Kronecker mask and interpretive prompts are language-action video learners,
Yang JingYi, Zitong YU, Nixiuming, He Jia, and Hui Li, “Kronecker mask and interpretive prompts are language-action video learners,” inThe Thirteenth International Conference on Learning Representations, 2025
2025
-
[17]
Zar: Zero-shot action recognition with dynamic prompt tuning,
Qiyue Liang, Cheng Lu, Chun Tao, and Jan P Allebach, “Zar: Zero-shot action recognition with dynamic prompt tuning,” Electronic Imaging, vol. 37, pp. 1–10, 2025
2025
-
[18]
Match, expand and improve: Un- supervised finetuning for zero-shot action recognition with lan- guage knowledge,
Wei Lin, Leonid Karlinsky, Nina Shvetsova, Horst Posseg- ger, Mateusz Kozinski, Rameswar Panda, Rogerio Feris, Hilde Kuehne, and Horst Bischof, “Match, expand and improve: Un- supervised finetuning for zero-shot action recognition with lan- guage knowledge,” inICCV, 2023
2023
-
[19]
Telling stories for common sense zero-shot action recognition,
Shreyank N Gowda and Laura Sevilla-Lara, “Telling stories for common sense zero-shot action recognition,” inProceedings of the Asian Conference on Computer Vision, 2024, pp. 4577– 4594
2024
-
[20]
Building a multi-modal spatiotem- poral expert for zero-shot action recognition with clip,
Yating Yu, Congqi Cao, Yueran Zhang, Qinyi Lv, Lingtong Min, and Yanning Zhang, “Building a multi-modal spatiotem- poral expert for zero-shot action recognition with clip,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2025, vol. 39, pp. 9689–9697
2025
-
[21]
FROSTER: Frozen CLIP is a strong teacher for open- vocabulary action recognition,
Xiaohu Huang, Hao Zhou, Kun Yao, and Kai Han, “FROSTER: Frozen CLIP is a strong teacher for open- vocabulary action recognition,” inThe Twelfth International Conference on Learning Representations, 2024
2024
-
[22]
Continual learning improves zero-shot action recog- nition,
Shreyank N Gowda, Davide Moltisanti, and Laura Sevilla- Lara, “Continual learning improves zero-shot action recog- nition,” inProceedings of the Asian Conference on Computer Vision, 2024, pp. 3239–3256
2024
-
[23]
Adapterhub: A framework for adapting transform- ers,
Jonas Pfeiffer, Andreas R ¨uckl´e, Clifton Poth, Aishwarya Ka- math, Ivan Vuli´c, Sebastian Ruder, Kyunghyun Cho, and Iryna Gurevych, “Adapterhub: A framework for adapting transform- ers,” inProceedings of the 2020 conference on empirical meth- ods in natural language processing: system demonstrations, 2020, pp. 46–54
2020
-
[24]
Understanding the impact of negative prompts: When and how do they take effect?,
Yuanhao Ban, Ruochen Wang, Tianyi Zhou, Minhao Cheng, Boqing Gong, and Cho-Jui Hsieh, “Understanding the impact of negative prompts: When and how do they take effect?,” in european conference on computer vision. Springer, 2024, pp. 190–206
2024
-
[25]
Deep adaptive wavelet network,
Maria Ximena Bastidas Rodriguez, Adrien Gruson, Luisa Polania, Shin Fujieda, Flavio Prieto, Kohei Takayama, and Toshiya Hachisuka, “Deep adaptive wavelet network,” inPro- ceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2020, pp. 3111–3119
2020
-
[26]
Expanding language-image pretrained models for gen- eral video recognition,
Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, and Haibin Ling, “Expanding language-image pretrained models for gen- eral video recognition,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 1–18
2022
-
[27]
Fine-tuned clip models are efficient video learners,
Hanoona Rasheed, Muhammad Uzair Khattak, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan, “Fine-tuned clip models are efficient video learners,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, 2023, pp. 6545–6554
2023
-
[28]
The kinetics human action video dataset,
Andrew Zisserman, Joao Carreira, Karen Simonyan, Will Kay, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, et al., “The kinetics human action video dataset,”arXiv preprint arXiv, vol. 1705, 2017
2017
- [29]
-
[30]
Hmdb: a large video database for human motion recognition,
Hildegard Kuehne, Hueihan Jhuang, Est´ıbaliz Garrote, Tomaso Poggio, and Thomas Serre, “Hmdb: a large video database for human motion recognition,” in2011 International conference on computer vision. IEEE, 2011, pp. 2556–2563
2011
-
[31]
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,”arXiv preprint arXiv:1212.0402, 2012
work page internal anchor Pith review arXiv 2012
-
[32]
The” something something
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al., “The” something something” video database for learning and evaluating visual common sense,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 5842–5850
2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.