arxiv: 2604.17062 · v1 · submitted 2026-04-18 · 💻 cs.CV

Recognition: unknown

Motion-Guided Semantic Alignment with Negative Prompts for Zero-Shot Video Action Recognition

Yiming Wang , Frederick W. B. Li , Jingyun Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:37 UTC · model grok-4.3

classification 💻 cs.CV

keywords zero-shot action recognitionCLIP adaptationmotion separationnegative promptssemantic alignmentvideo understandingdisentangled features

0 comments

The pith

Separating motion features from static content and aligning videos with negative text prompts lets CLIP recognize actions never seen in training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to prove that CLIP can be made more effective for zero-shot video action recognition by first disentangling motion-sensitive features from global static ones, then refining the motion signal through gated attention, and finally aligning the resulting video embeddings to both positive class prompts and negative non-class prompts. A reader would care if true because this would let recognition systems identify new actions in videos without collecting labeled examples for every possible category, addressing the core limitation that prevents current models from working on evolving or rare activities. The approach relies on two new modules to keep motion information clean and uses negative prompts to explicitly teach the model what the video is not, which the authors show produces stronger results than earlier CLIP adaptations on both coarse and fine-grained benchmarks.

Core claim

We present a novel framework that enhances CLIP with disentangled embeddings and semantic-guided interaction. A Motion Separation Module (MSM) separates motion-sensitive and global-static features, while a Motion Aggregation Block (MAB) employs gated cross-attention to refine motion representation without re-coupling redundant information. To facilitate generalization to unseen categories, we enforce semantic alignment between video features and textual representations by aligning projected embeddings with positive textual prompts, while leveraging negative prompts to explicitly model non-class semantics.

What carries the argument

The Motion Separation Module (MSM) that isolates motion-sensitive features, the Motion Aggregation Block (MAB) that performs gated cross-attention on motion, and the dual use of positive and negative textual prompts to enforce semantic alignment.

If this is right

The framework produces consistent gains over prior CLIP-based zero-shot methods on standard coarse and fine-grained action benchmarks.
Negative prompts allow the model to represent what an action is not, supporting better transfer to classes absent from training.
Gated cross-attention in the aggregation block keeps motion features clean without reintroducing static redundancy.
The same alignment strategy works across both broad category sets and detailed action distinctions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The motion-disentanglement idea could transfer to other video-text tasks such as dense captioning or temporal localization.
Negative prompts might reduce overconfident predictions on ambiguous or out-of-distribution video clips.
Independent ablations of the separation and aggregation steps would clarify which part drives most of the reported improvement.
The approach points toward lightweight adaptation techniques that avoid full model retraining when new action classes appear.

Load-bearing premise

That isolating motion features and aligning them with negative prompts will reliably shrink the semantic gap to unseen actions without creating new errors on real video distributions.

What would settle it

Running the method on a fresh fine-grained action dataset where it fails to beat the strongest prior CLIP zero-shot baseline would falsify the central claim.

read the original abstract

Zero-shot action recognition is challenging due to the semantic gap between seen and unseen classes. We present a novel framework that enhances CLIP with disentangled embeddings and semantic-guided interaction. A Motion Separation Module (MSM) separates motion-sensitive and global-static features, while a Motion Aggregation Block (MAB) employs gated cross-attention to refine motion representation without re-coupling redundant information. To facilitate generalization to unseen categories, we enforce semantic alignment between video features and textual representations by aligning projected embeddings with positive textual prompts, while leveraging negative prompts to explicitly model "non-class" semantics. Experiments on standard benchmarks demonstrate that our method consistently outperforms prior CLIP-based approaches, achieving robust zero-shot action recognition across both coarse and fine-grained datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a motion separation module and negative-prompt alignment to CLIP for zero-shot video action recognition, but the abstract supplies no numbers or implementation details so the claimed gains cannot be evaluated.

read the letter

The core idea is to split motion-sensitive features from static ones with a Motion Separation Module, refine the motion part via gated cross-attention in a Motion Aggregation Block, and then align the resulting video embeddings to both positive class prompts and negative prompts that stand in for non-class semantics. This is meant to shrink the gap between seen and unseen action classes in a CLIP backbone. The approach is a direct, incremental extension of existing cross-attention and prompt-tuning tricks rather than a conceptual departure, but the modules are clearly described and the motivation for handling motion explicitly in actions is reasonable. Credit is due for trying to make the negative side of the alignment explicit instead of leaving it implicit. The main problem is that the abstract asserts consistent outperformance on standard benchmarks without any numbers, baselines, ablations, dataset splits, or error bars. That leaves the central empirical claim unevaluable. The stress-test point about negative-prompt construction is also live: the abstract does not say whether negatives come from fixed templates, other seen classes, or learned components, so it is unclear whether the method truly stays zero-shot or quietly uses information from the training split. On fine-grained datasets where motion patterns overlap, that ambiguity matters. The citation pattern and math look standard for this sub-area, with no obvious circularity or invented entities. This work is aimed at people already adapting CLIP to video tasks who want practical tweaks for zero-shot settings. A reader in that niche could extract the module designs for their own experiments, but only after seeing the actual results and negative-prompt details. It is worth sending to peer review if the full paper contains proper quantitative comparisons and reproducible negative-prompt generation; otherwise the evidence is too thin to justify referee time.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a framework for zero-shot video action recognition that augments CLIP with a Motion Separation Module (MSM) to disentangle motion-sensitive from global-static features and a Motion Aggregation Block (MAB) that uses gated cross-attention to refine motion representations. Semantic alignment is enforced by projecting video embeddings to match positive textual prompts while using negative prompts to explicitly capture non-class semantics, with the goal of closing the semantic gap for unseen action classes. The central claim is that this yields consistent outperformance over prior CLIP-based methods on standard benchmarks for both coarse- and fine-grained datasets.

Significance. If the empirical results hold after proper validation, the work offers a concrete mechanism for improving generalization in zero-shot settings by combining motion disentanglement with explicit negative-prompt modeling of non-class semantics. This could be useful for fine-grained actions where motion overlap is high. The approach is grounded in existing CLIP architectures and does not appear to introduce new free parameters beyond standard training, which is a positive attribute. However, the absence of any quantitative metrics, baselines, or ablations in the abstract limits immediate assessment of impact.

major comments (2)

[Abstract] Abstract: The claim that the method 'consistently outperforms prior CLIP-based approaches' is presented without any accuracy numbers, dataset names, splits, error bars, or comparison tables. This is load-bearing for the central empirical claim; without these elements the soundness of the outperformance assertion cannot be evaluated.
[Methods] Methods section (negative prompt construction): The procedure for generating negative prompts for truly unseen classes is not specified (e.g., fixed 'not [class]' templates, vocabulary drawn from seen classes, or learned). This detail is critical because reliance on training-class priors would violate standard zero-shot protocols and could produce non-generalizable gains, especially on fine-grained datasets where motion overlap makes 'non-class' semantics ambiguous.

minor comments (1)

[Abstract] Abstract: The phrase 'disentangled embeddings and semantic-guided interaction' is used without a one-sentence pointer to the MSM/MAB modules or the alignment objective; a brief clarification would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the changes we will make in the revision.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that the method 'consistently outperforms prior CLIP-based approaches' is presented without any accuracy numbers, dataset names, splits, error bars, or comparison tables. This is load-bearing for the central empirical claim; without these elements the soundness of the outperformance assertion cannot be evaluated.

Authors: We agree that the abstract would benefit from greater specificity to support the central empirical claim. In the revised manuscript, we will update the abstract to explicitly name the standard benchmarks (UCF101, HMDB51, and a fine-grained dataset such as Something-Something V2), the zero-shot splits used, and direct reference to the quantitative comparisons and tables in the experiments section. This will allow immediate evaluation of the outperformance while respecting abstract length constraints. revision: yes
Referee: [Methods] Methods section (negative prompt construction): The procedure for generating negative prompts for truly unseen classes is not specified (e.g., fixed 'not [class]' templates, vocabulary drawn from seen classes, or learned). This detail is critical because reliance on training-class priors would violate standard zero-shot protocols and could produce non-generalizable gains, especially on fine-grained datasets where motion overlap makes 'non-class' semantics ambiguous.

Authors: We thank the referee for this important clarification request. The negative prompts are constructed via the fixed template 'not [class]' using the name of the target (unseen) action class at evaluation time. This is standard in zero-shot settings where test class names are provided for prompt construction, and no vocabulary or information from the training classes is used. No learned components are involved. We will add an explicit paragraph in the Methods section describing this construction process, including an example, to confirm adherence to zero-shot protocols. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method validated on benchmarks without self-referential derivations.

full rationale

The paper proposes architectural components (MSM for motion separation, MAB for gated cross-attention, and positive/negative prompt alignment) to address the semantic gap in zero-shot action recognition. All load-bearing claims reduce to experimental outperformance on standard coarse and fine-grained datasets rather than any first-principles derivation, fitted-parameter prediction, or self-citation chain. No equations appear in the abstract, and the described framework introduces new modules whose effectiveness is measured externally against prior CLIP baselines; nothing reduces to its own inputs by construction. This is a standard empirical contribution whose validity rests on benchmark results, not internal tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be extracted beyond the high-level module names.

pith-pipeline@v0.9.0 · 5423 in / 1129 out tokens · 30937 ms · 2026-05-10T06:37:27.332989+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 4 canonical work pages · 2 internal anchors

[1]

INTRODUCTION In recent years, large-scale vision–language pre-trained models such as CLIP [1] have shown remarkable success in cross-modal learning, driving significant advances in zero-shot learning (ZSL) [2]. Extend- ing this paradigm to the video domain, zero-shot action recognition (ZSAR) [3] seeks to classify unseen actions by transferring knowl- edg...
[2]

Motion-Guided Semantic Alignment with Negative Prompts for Zero-Shot Video Action Recognition

PROPOSED METHOD 2.1. CLIP-based Video-Text Representation As shown in Fig. 1, given an input videoF v ∈R T×C×H×W with Tframes, we first extract visual representations using a frozen CLIP visual encoder. To adapt CLIP to the video domain while preserving zero-shot generalization, we introduce a lightweight Dual Adapter (DA) [17], which is shared across bot...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

All results are reported in accuracy (%). Method Publication HMDB-51 UCF-101 K-600 Methods with Vision Training ER-ZSAR [3] ICCV’21 35.3±4.6 51.8±2.9 42.1±1.4 JigSawNet [4] TIP’19 39.3±3.9 56.8±2.8 — Methods with Vision-Language Training A5 [5] ECCV’22 44.3±2.2 69.3±4.2 — X-CLIP [20] ECCV’22 46.3±0.6 70.3±2.3 67.1±1.0 Vita-CLIP [7] CVPR’23 48.6±0.6 75.0±0...
[4]

Accuracy in %

HM = harmonic mean of Base and Novel. Accuracy in %. Kinetics400 HMDB-51 Method Base Novel HM Base Novel HM Vanilla CLIP B/16 [1] 53.3 46.8 49.8 53.3 46.8 49.8 ActionCLIP B/16 [6] 69.0 57.2 62.6 69.1 37.3 48.5 XCLIP B/16 [20] 74.1 56.4 64.0 69.4 45.5 55.0 A5 [5] 74.1 56.4 64.0 46.2 16.0 23.8 ViFi-CLIP B/16 [21] 76.461.167.9 73.8 53.3 61.9 ZAR B/16 [11] 75...
[5]

EXPERIMENTS 3.1. Experimental Setup To evaluate our method, we conduct experiments on five widely used benchmarks: Kinetics-400 [22], Kinetics-600 [23], HMDB51 [24], UCF101 [25], and Something-Something V2 (SSv2) [26]. Kinetics- 400 serves as the training set, while HMDB51, UCF101, SSv2, and Kinetics-600 (excluding overlaps with K400) are used for zero-sh...
[6]

These designs jointly enhance the model’s ability to gener- alize from base to novel classes, providing a principled step toward zero-shot video action recognition

CONCLUSION In conclusion, our motion-guided framework effectively disentan- gles motion and global cues, integrates them into semantically aligned representations, and leverages negative prompts for robust learning. These designs jointly enhance the model’s ability to gener- alize from base to novel classes, providing a principled step toward zero-shot vi...
[7]

Learning trans- ferable visual models from natural language supervision,

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al., “Learning trans- ferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763

2021
[8]

Zero-shot learning with semantic output codes,

Mark Palatucci, Dean Pomerleau, Geoffrey E Hinton, and Tom M Mitchell, “Zero-shot learning with semantic output codes,”Advances in neural information processing systems, vol. 22, 2009

2009
[9]

Elaborative rehearsal for zero- shot action recognition,

Shizhe Chen and Dong Huang, “Elaborative rehearsal for zero- shot action recognition,” inProceedings of the IEEE/CVF in- ternational conference on computer vision, 2021, pp. 13638– 13647

2021
[10]

Jigsawnet: Shredded image reassembly using convolutional neural network and loop-based composi- tion,

Canyu Le and Xin Li, “Jigsawnet: Shredded image reassembly using convolutional neural network and loop-based composi- tion,”IEEE Transactions on Image Processing, vol. 28, no. 8, pp. 4000–4015, 2019

2019
[11]

Prompting visual-language models for efficient video understanding,

Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, and Weidi Xie, “Prompting visual-language models for efficient video understanding,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 105–124

2022
[12]

Actionclip: Adapting language-image pre- trained models for video action recognition,

Mengmeng Wang, Jiazheng Xing, Jianbiao Mei, Yong Liu, and Yunliang Jiang, “Actionclip: Adapting language-image pre- trained models for video action recognition,”IEEE transac- tions on neural networks and learning systems, vol. 36, no. 1, pp. 625–637, 2023

2023
[13]

Vita-clip: Video and text adaptive clip via multimodal prompting,

Syed Talal Wasim, Muzammal Naseer, Salman Khan, Fa- had Shahbaz Khan, and Mubarak Shah, “Vita-clip: Video and text adaptive clip via multimodal prompting,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 23034–23044

2023
[14]

Ez-clip: Efficient zeroshot video action recognition,

Shahzad Ahmad, Sukalpa Chanda, and Yogesh S Rawat, “Ez- clip: Efficient zeroshot video action recognition,”arXiv preprint arXiv:2312.08010, 2023

work page arXiv 2023
[15]

Is temporal prompting all we need for limited labeled action recognition?,

Shreyank Gowda, Boyan Gao, Xiao Gu, and Xiabo Jin, “Is temporal prompting all we need for limited labeled action recognition?,” inProceedings of the Computer Vision and Pat- tern Recognition Conference, 2025, pp. 682–692

2025
[16]

Kronecker mask and interpretive prompts are language-action video learners,

Yang JingYi, Zitong YU, Nixiuming, He Jia, and Hui Li, “Kronecker mask and interpretive prompts are language-action video learners,” inThe Thirteenth International Conference on Learning Representations, 2025

2025
[17]

Zar: Zero-shot action recognition with dynamic prompt tuning,

Qiyue Liang, Cheng Lu, Chun Tao, and Jan P Allebach, “Zar: Zero-shot action recognition with dynamic prompt tuning,” Electronic Imaging, vol. 37, pp. 1–10, 2025

2025
[18]

Match, expand and improve: Un- supervised finetuning for zero-shot action recognition with lan- guage knowledge,

Wei Lin, Leonid Karlinsky, Nina Shvetsova, Horst Posseg- ger, Mateusz Kozinski, Rameswar Panda, Rogerio Feris, Hilde Kuehne, and Horst Bischof, “Match, expand and improve: Un- supervised finetuning for zero-shot action recognition with lan- guage knowledge,” inICCV, 2023

2023
[19]

Telling stories for common sense zero-shot action recognition,

Shreyank N Gowda and Laura Sevilla-Lara, “Telling stories for common sense zero-shot action recognition,” inProceedings of the Asian Conference on Computer Vision, 2024, pp. 4577– 4594

2024
[20]

Building a multi-modal spatiotem- poral expert for zero-shot action recognition with clip,

Yating Yu, Congqi Cao, Yueran Zhang, Qinyi Lv, Lingtong Min, and Yanning Zhang, “Building a multi-modal spatiotem- poral expert for zero-shot action recognition with clip,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2025, vol. 39, pp. 9689–9697

2025
[21]

FROSTER: Frozen CLIP is a strong teacher for open- vocabulary action recognition,

Xiaohu Huang, Hao Zhou, Kun Yao, and Kai Han, “FROSTER: Frozen CLIP is a strong teacher for open- vocabulary action recognition,” inThe Twelfth International Conference on Learning Representations, 2024

2024
[22]

Continual learning improves zero-shot action recog- nition,

Shreyank N Gowda, Davide Moltisanti, and Laura Sevilla- Lara, “Continual learning improves zero-shot action recog- nition,” inProceedings of the Asian Conference on Computer Vision, 2024, pp. 3239–3256

2024
[23]

Adapterhub: A framework for adapting transform- ers,

Jonas Pfeiffer, Andreas R ¨uckl´e, Clifton Poth, Aishwarya Ka- math, Ivan Vuli´c, Sebastian Ruder, Kyunghyun Cho, and Iryna Gurevych, “Adapterhub: A framework for adapting transform- ers,” inProceedings of the 2020 conference on empirical meth- ods in natural language processing: system demonstrations, 2020, pp. 46–54

2020
[24]

Understanding the impact of negative prompts: When and how do they take effect?,

Yuanhao Ban, Ruochen Wang, Tianyi Zhou, Minhao Cheng, Boqing Gong, and Cho-Jui Hsieh, “Understanding the impact of negative prompts: When and how do they take effect?,” in european conference on computer vision. Springer, 2024, pp. 190–206

2024
[25]

Deep adaptive wavelet network,

Maria Ximena Bastidas Rodriguez, Adrien Gruson, Luisa Polania, Shin Fujieda, Flavio Prieto, Kohei Takayama, and Toshiya Hachisuka, “Deep adaptive wavelet network,” inPro- ceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2020, pp. 3111–3119

2020
[26]

Expanding language-image pretrained models for gen- eral video recognition,

Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, and Haibin Ling, “Expanding language-image pretrained models for gen- eral video recognition,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 1–18

2022
[27]

Fine-tuned clip models are efficient video learners,

Hanoona Rasheed, Muhammad Uzair Khattak, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan, “Fine-tuned clip models are efficient video learners,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, 2023, pp. 6545–6554

2023
[28]

The kinetics human action video dataset,

Andrew Zisserman, Joao Carreira, Karen Simonyan, Will Kay, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, et al., “The kinetics human action video dataset,”arXiv preprint arXiv, vol. 1705, 2017

2017
[29]

7, 13, 16

Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman, “A short note about kinetics- 600,”arXiv preprint arXiv:1808.01340, 2018

work page arXiv 2018
[30]

Hmdb: a large video database for human motion recognition,

Hildegard Kuehne, Hueihan Jhuang, Est´ıbaliz Garrote, Tomaso Poggio, and Thomas Serre, “Hmdb: a large video database for human motion recognition,” in2011 International conference on computer vision. IEEE, 2011, pp. 2556–2563

2011
[31]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,”arXiv preprint arXiv:1212.0402, 2012

work page internal anchor Pith review arXiv 2012
[32]

The” something something

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al., “The” something something” video database for learning and evaluating visual common sense,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 5842–5850

2017