Recognition: unknown
CEZSAR: A Contrastive Embedding Method for Zero-Shot Action Recognition
Pith reviewed 2026-05-09 18:46 UTC · model grok-4.3
The pith
A contrastive method aligns videos with text descriptions in a joint space to recognize actions never seen in training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a contrastive model encoding both videos and sentences into a shared embedding space, trained by pulling matching video-description pairs together and pushing apart automatically generated unpaired examples, produces representations that generalize to action classes absent from the training set, reaching state-of-the-art accuracy on the UCF-101 and Kinetics-400 datasets across multiple zero-shot splits.
What carries the argument
The joint video-text embedding space trained with contrastive loss and automatic negative sampling to produce unpaired visual and textual examples.
If this is right
- State-of-the-art zero-shot accuracy on UCF-101 and Kinetics-400 under several training-test splits.
- Action classification at test time using only a textual description of the target class.
- Reduced dependence on collecting labeled video examples for every possible action class.
Where Pith is reading between the lines
- The same contrastive alignment procedure could be reused for zero-shot recognition of objects or events once suitable text descriptions exist.
- Automatic negative sampling may cut the cost of curating large vision-language training sets for other multimodal tasks.
- Applying the method to datasets with larger visual or linguistic differences would test how far the joint space generalizes.
Load-bearing premise
The alignment learned from seen actions and their descriptions will still match unseen actions to their textual descriptions despite differences in appearance and wording.
What would settle it
A controlled test on a fresh zero-shot split of a standard action dataset in which removing the automatic negative sampling step causes accuracy to drop below prior non-contrastive baselines.
Figures
read the original abstract
This paper proposes a novel Zero-Shot Action Recognition~(ZSAR) method based on contrastive learning. In ZSAR, we aim to classify examples from classes that were missing during training. Two well-known problems remain in ZSAR: the semantic gap and the domain shift. A semantic gap occurs because label representations come from the textual domain (i.e., language models) and must be associated with visual representations (i.e., CNNs, RNNs, transformer-based). This multimodal nature implies that the semantic properties of the two spaces are not identical. On the other hand, the domain shift arises from differences between the training and test sets and is inherent to ZSAR once the test set is unknown. One of the most promising methods to address both issues is learning joint embedding spaces. Therefore, we propose a new model that encodes videos and sentences in a joint embedding space, trained by aligning videos with their natural-language descriptions. We design an automatic negative sampling procedure to augment the training dataset and generate unpaired data, i.e., visual appearance and unrelated descriptions. Our results are state-of-the-art on the UCF-101 and Kinetics-400 datasets under several split configurations. Our code is available at https://github.com/valterlej/cezsar.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes CEZSAR, a contrastive embedding method for zero-shot action recognition (ZSAR). Videos and natural-language descriptions are encoded into a shared embedding space and aligned via contrastive learning; an automatic negative sampling procedure generates unpaired video-description pairs to augment training. The approach targets the semantic gap between visual and textual modalities as well as domain shift between seen and unseen action classes. The authors report state-of-the-art results on UCF-101 and Kinetics-400 under multiple train/test splits and release the code.
Significance. If the reported gains hold under rigorous evaluation, the work supplies a straightforward, reproducible contrastive baseline for ZSAR that avoids elaborate architectures while directly addressing modality alignment and negative sampling. Public code release strengthens the contribution by enabling direct verification and extension.
minor comments (3)
- §4 (Experiments): the exact train/test splits, number of runs, and full baseline comparisons (including recent contrastive and generative ZSAR methods) should be tabulated with mean and standard deviation to support the SOTA claim.
- §3.2 (Negative sampling): clarify whether the automatic procedure can inadvertently sample from classes that appear in the test set under any of the reported splits; a short ablation would strengthen the domain-shift argument.
- Figure 2 and §3.1: the joint embedding diagram would benefit from explicit notation for the video and text encoders and the temperature parameter used in the contrastive loss.
Simulated Author's Rebuttal
We thank the referee for the positive summary, the recognition of our straightforward contrastive baseline, and the recommendation for minor revision. We are grateful for the emphasis on reproducibility and code release.
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper presents a contrastive embedding method for zero-shot action recognition that trains a joint video-text space by aligning videos with their natural-language descriptions and using automatic negative sampling to generate unpaired data. This follows standard contrastive learning objectives without any self-definitional reductions, fitted parameters renamed as predictions, or load-bearing self-citations that collapse the central claims to tautologies. The SOTA empirical results on UCF-101 and Kinetics-400 under multiple splits are independent evaluations outside the training construction, with no equations or steps that reduce by construction to the inputs. The approach is self-contained as a direct application of multimodal alignment to address semantic gap and domain shift.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
In: British Machine Vision Conference (BMVC)
Balntas,V.,Riba,E.,Ponsa,D.,Mikolajczyk,K.:Learninglocalfeaturedescriptors with triplets and shallow convolutional neural networks. In: British Machine Vision Conference (BMVC). pp. 1–11 (2016).https://doi.org/10.5244/C.30.119
-
[2]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
Brattoli, B., Tighe, J., Zhdanov, F., Perona, P., Chalupka, K.: Rethinking zero- shot video classification: End-to-end training for realistic applications. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4613–4623 (Jun 2020).https://doi.org/10.1109/CVPR42600.2020.00467
-
[3]
In: British Machine Vision Conference (BMVC)
Bretti, C., Mettes, P.: Zero-shot action recognition from diverse object-scene com- positions. In: British Machine Vision Conference (BMVC). pp. 1–14 (Nov 2021)
2021
-
[4]
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kineticsdataset.In:IEEEConferenceonComputerVisionandPatternRecognition (CVPR). pp. 4724–4733 (Jul 2017).https://doi.org/10.1109/CVPR.2017.502
-
[5]
In: IEEE/CVF International Conference on Computer Vision (ICCV)
Chen, S., Huang, D.: Elaborative rehearsal for zero-shot action recognition. In: IEEE/CVF International Conference on Computer Vision (ICCV). pp. 13638– 13647 (Oct 2021) 14 Estevam et al
2021
-
[6]
Chen, X., et al.: AnyDoor: Zero-shot object-level image customization. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6593–6602 (2024).https://doi.org/10.1109/CVPR52733.2024.00630
-
[7]
Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, withapplicationtofaceverification.In:IEEEComputerVisionandPatternRecog- nition (CVPR). pp. 539–546 (2005).https://doi.org/10.1109/CVPR.2005.202
-
[8]
Doshi, K., et al.: A Multimodal Benchmark and Improved Architecture for Zero Shot Learning . In: IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 2010–2019 (2024).https://doi.org/10.1109/WACV57701.2024.00202
-
[9]
IEEE Signal Processing Letters29, 1843–1847 (2022)
Estevam, V., Laroca, R., Pedrini, H., Menotti, D.: Global semantic descriptors for zero-shot action recognition. IEEE Signal Processing Letters29, 1843–1847 (2022). https://doi.org/10.1109/LSP.2022.3200605
-
[10]
Estevam, V., Laroca, R., Pedrini, H., Menotti, D.: Dense video captioning us- ing unsupervised semantic information. Journal of Visual Communication and Image Representation107, 104385 (2025).https://doi.org/10.1016/j.jvcir. 2024.104385
-
[11]
Estevam,V.,Pedrini,H.,Menotti,D.:Zero-shotactionrecognitioninvideos:Asur- vey. Neurocomputing439, 159–175 (2021).https://doi.org/10.1016/j.neucom. 2021.01.036
-
[12]
Multimedia Tools and Applications83, 28147–28173 (2024).https://doi.org/10.1007/s11042-023-16566-5
Estevam, V., et al.: Tell me what you see: A zero-shot action recognition method based on natural language descriptions. Multimedia Tools and Applications83, 28147–28173 (2024).https://doi.org/10.1007/s11042-023-16566-5
-
[13]
Gowda, S.N.: Synthetic sample selection for generalized zero-shot learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).pp.58–67(2023).https://doi.org/10.1109/CVPRW59228.2023.00011
-
[14]
In: Asian Conference on Computer Vision (ACCV)
Gowda, S.N., Moltisanti, D., Sevilla-Lara, L.: Continual learning improves zero- shot action recognition. In: Asian Conference on Computer Vision (ACCV). p. 403–421 (2024).https://doi.org/10.1007/978-981-96-0908-6_23
-
[15]
In: European Conf
Gowda, S.N., Sevilla-Lara, L., Keller, F., Rohrbach, M.: CLASTER: Clustering with reinforcement learning for zero-shot action recognition. In: European Conf. on Computer Vision (ECCV). pp. 187–203 (2022)
2022
-
[16]
Han, Z., Fu, Z., Chen, S., Yang, J.: Contrastive embedding for generalized zero-shot learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2371–2381 (2021).https://doi.org/10.1109/CVPR46437.2021.00240
-
[17]
Huang, K., Miralles-Pechuán., L., Mckeever., S.: Combining text and image knowl- edgewithGANsforzero-shotactionrecognitioninvideos.In:InternationalConfer- ence on Computer Vision Theory and Applications (VISAPP). pp. 623–631 (2022). https://doi.org/10.5220/0010903100003124
-
[18]
SN Computer Science 4(4), 375 (2023).https://doi.org/10.1007/s42979-023-01803-3
Huang, K., Miralles-Pechuán, L., Mckeever, S.: Enhancing zero-shot action recog- nition in videos by combining gans with text and images. SN Computer Science 4(4), 375 (2023).https://doi.org/10.1007/s42979-023-01803-3
-
[19]
In: International Conference on Neural Infor- mation Processing Systems (NeurIPS)
Kerrigan, A., Duarte, K., Rawat, Y., Shah, M.: Reformulating zero-shot action recognition for multi-label actions. In: International Conference on Neural Infor- mation Processing Systems (NeurIPS). vol. 34, pp. 25566–25577 (2021)
2021
-
[20]
In: AAAI Conference on Artificial Intelligence (2021)
Kim, T.S., et al.: DASZL: Dynamic action signatures for zero-shot learning. In: AAAI Conference on Artificial Intelligence (2021)
2021
-
[21]
In: Proceedings of the IEEE International Conference on Computer Vision (ICCV)
Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Niebles, J.C.: Dense-captioning events in videos. In: IEEE International Conference on Computer Vision (ICCV). pp. 706–715 (2017).https://doi.org/10.1109/ICCV.2017.83 A Contrastive Embedding Method for Zero-Shot Action Recognition 15
-
[22]
Expert Systems with Applications255, 124786 (2024).https://doi.org/10.1016/j.eswa.2024.124786
Lee, J.C., Lee, D.G.: ESC-ZSAR: Expanded semantics from categories with cross- attention for zero-shot action recognition. Expert Systems with Applications255, 124786 (2024).https://doi.org/10.1016/j.eswa.2024.124786
-
[23]
In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Li, X., Yang, X., Wei, K., Deng, C., Yang, M.: Siamese contrastive embedding net- work for compositional zero-shot learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 9326–9335 (June 2022)
2022
-
[24]
In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Lin, C.C., et al.: Cross-modal representation learning for zero-shot action recog- nition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19978–19988 (June 2022)
2022
-
[25]
Liu, J., Kuipers, B., Savarese, S.: Recognizing human actions by attributes. In: IEEEConferenceonComputerVisionandPatternRecognition(CVPR).pp.3337– 3344 (2011).https://doi.org/10.1109/CVPR.2011.5995353
-
[26]
Pattern Recognition Let- ters155, 77–83 (2022).https://doi.org/10.1016/j.patrec.2022.02.002
Ma, P., Lu, H., Yang, B., Ran, W.: GAN-MVAE: A discriminative latent feature generation framework for generalized zero-shot learning. Pattern Recognition Let- ters155, 77–83 (2022).https://doi.org/10.1016/j.patrec.2022.02.002
-
[27]
Mettes,P.,Thong,W.,Snoek,C.:Objectpriorsforclassifyingandlocalizingunseen actions. International Journal of Computer Vision129, 1954–1971 (2021).https: //doi.org/10.1007/s11263-021-01454-y
-
[28]
In: IEEE International Conference on Computer Vision (ICCV)
Mettes, P., Snoek, C.G.M.: Spatial-aware object embeddings for zero-shot localiza- tion and classification of actions. In: IEEE International Conference on Computer Vision (ICCV). pp. 4453–4462 (2017).https://doi.org/10.1109/ICCV.2017.476
-
[29]
In: International Conference on Machine Learning (ICML)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (ICML). vol. 139, pp. 8748–8763 (2021)
2021
-
[30]
In: Conference on Empirical Methods in Natural Language Pro- cessing (EMNLP)
Reimers, N., Gurevych, I.: Sentence-BERT: Sentence embeddings using siamese BERT-networks. In: Conference on Empirical Methods in Natural Language Pro- cessing (EMNLP). pp. 3982–3992 (2019)
2019
-
[31]
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
Soomro, K., Zamir, A.R., Shah, M.: UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprintarXiv:1212.0402, 1–6 (2012)
work page internal anchor Pith review arXiv 2012
-
[32]
In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Sun, S., et al.: CLIP as RNN: Segment countless visual concepts without training endeavor. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 13171–13182 (2024)
2024
-
[33]
In: International Conf
Vaswani, A., et al.: Attention is all you need. In: International Conf. on Neural Information Processing (NeurIPS). pp. 6000–6010 (2017)
2017
-
[34]
Wang, Q., Chen, K.: Zero-shot visual recognition via bidirectional latent embed- ding. Intl. Journal of Computer Vision124(3), 356–383 (2017).https://doi.org/ 10.1007/s11263-017-1027-5
-
[35]
Wu, W., et al.: Bidirectional cross-modal knowledge exploration for video recogni- tion with pre-trained vision-language models. In: IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR). pp. 6620–6630 (2023).https: //doi.org/10.1109/CVPR52729.2023.00640
-
[36]
In: Merlo, P., Tiedemann, J., Tsarfaty, R
Xu, H., et al.: VideoCLIP: Contrastive pre-training for zero-shot video-text un- derstanding. In: Conference on Empirical Methods in Natural Language Process- ing (EMNLP). pp. 6787–6800 (Nov 2021).https://doi.org/10.18653/v1/2021. emnlp-main.544
-
[37]
Xue, Y., Whitecross, K., Mirzasoleiman, B.: Investigating why contrastive learning benefitsrobustnessagainstlabelnoise.In:InternationalConf.onMachineLearning (ICML). vol. 162, pp. 24851–24871 (2022)
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.