Zero-Shot Sign Language Recognition: Can Textual Data Uncover Sign Languages?
Pith reviewed 2026-05-24 17:06 UTC · model grok-4.3
The pith
Textual dictionary descriptions enable zero-shot recognition of unseen sign language signs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By leveraging the descriptive text embeddings along with these spatio-temporal representations within a zero-shot learning framework, we show that textual data can indeed be useful in uncovering sign languages.
What carries the argument
Zero-shot learning framework that aligns spatio-temporal visual features from 3D-CNNs and bidirectional LSTMs with embeddings of dictionary textual descriptions.
If this is right
- Sign language recognition can extend to new classes using existing dictionary texts instead of new video labeling.
- The framework handles datasets where many classes have few training examples.
- Semantic alignment between text and visual features supports transfer to unseen sign classes.
- The ASL-Text dataset and approach establish a starting point for zero-shot sign language work.
Where Pith is reading between the lines
- Existing dictionary resources could lower the cost of expanding sign language vocabularies in recognition systems.
- Text-based bridging may generalize to other gesture or action domains that carry descriptive metadata.
- The method suggests dictionary-style text captures enough shared structure for visual transfer in sequential gesture tasks.
Load-bearing premise
Textual descriptions from sign language dictionaries provide a sufficiently aligned semantic representation to enable effective knowledge transfer from seen to unseen visual sign classes.
What would settle it
If accuracy on unseen signs drops to chance level when text embeddings are included compared to a visual-only baseline, the claim that textual data aids recognition would not hold.
Figures
read the original abstract
We introduce the problem of zero-shot sign language recognition (ZSSLR), where the goal is to leverage models learned over the seen sign class examples to recognize the instances of unseen signs. To this end, we propose to utilize the readily available descriptions in sign language dictionaries as an intermediate-level semantic representation for knowledge transfer. We introduce a new benchmark dataset called ASL-Text that consists of 250 sign language classes and their accompanying textual descriptions. Compared to the ZSL datasets in other domains (such as object recognition), our dataset consists of limited number of training examples for a large number of classes, which imposes a significant challenge. We propose a framework that operates over the body and hand regions by means of 3D-CNNs, and models longer temporal relationships via bidirectional LSTMs. By leveraging the descriptive text embeddings along with these spatio-temporal representations within a zero-shot learning framework, we show that textual data can indeed be useful in uncovering sign languages. We anticipate that the introduced approach and the accompanying dataset will provide a basis for further exploration of this new zero-shot learning problem.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the zero-shot sign language recognition (ZSSLR) problem, where models learned on seen sign classes are used to recognize unseen signs. It proposes leveraging textual descriptions from sign language dictionaries as intermediate semantic representations for knowledge transfer, introduces the ASL-Text benchmark dataset with 250 classes and accompanying text, and describes a visual pipeline using 3D-CNNs on body/hand regions plus bidirectional LSTMs for temporal modeling. The central claim is that combining these spatio-temporal visual features with text embeddings in a ZSL framework demonstrates the utility of textual data for uncovering sign languages.
Significance. If the experimental results support the claim, the work would be significant for defining a new ZSL task in sign language with the realistic constraint of limited examples per class, for releasing the ASL-Text dataset as a community benchmark, and for exploring dictionary text as a semantic bridge in a domain where visual data collection for new classes is costly. The choice of 3D-CNN + biLSTM for spatio-temporal features is a standard and appropriate architectural decision for the visual side.
major comments (2)
- [Abstract] Abstract: The abstract asserts that the proposed framework shows textual data is useful ('we show that textual data can indeed be useful in uncovering sign languages'), but supplies no metrics, baselines, implementation details, or error analysis; the central claim cannot be verified from the available text.
- [Abstract] Abstract / Proposed framework: The assumption that textual descriptions from sign language dictionaries provide a sufficiently aligned semantic representation for effective ZSL transfer is load-bearing, yet the abstract gives no indication of how the text embeddings are aligned to the 3D-CNN + biLSTM features or whether misalignment due to high-level dictionary text (vs. fine-grained visual details) was diagnosed.
Simulated Author's Rebuttal
We thank the referee for their comments on our manuscript. We address each major comment below with references to the full paper where relevant.
read point-by-point responses
-
Referee: [Abstract] Abstract: The abstract asserts that the proposed framework shows textual data is useful ('we show that textual data can indeed be useful in uncovering sign languages'), but supplies no metrics, baselines, implementation details, or error analysis; the central claim cannot be verified from the available text.
Authors: We agree that the abstract, constrained by length, states the outcome at a high level without quantitative support. The full manuscript reports the experimental results, including accuracy metrics on seen and unseen classes, comparisons against baselines that omit textual embeddings, and analysis of the limited-examples-per-class setting. To make the central claim more verifiable from the abstract itself, we will revise it to include a representative performance figure. revision: yes
-
Referee: [Abstract] Abstract / Proposed framework: The assumption that textual descriptions from sign language dictionaries provide a sufficiently aligned semantic representation for effective ZSL transfer is load-bearing, yet the abstract gives no indication of how the text embeddings are aligned to the 3D-CNN + biLSTM features or whether misalignment due to high-level dictionary text (vs. fine-grained visual details) was diagnosed.
Authors: The alignment occurs by mapping the spatio-temporal visual features (3D-CNN on body/hand regions followed by bi-LSTM) into the text embedding space via a learned compatibility function inside the zero-shot framework, as described in the methods. The ASL-Text experiments demonstrate positive transfer to unseen signs, which serves as empirical evidence that dictionary text provides usable semantic bridging despite its higher-level character. The paper explicitly notes the challenge of fine-grained visual details versus dictionary text in the dataset and results sections. The abstract does not detail this mechanism due to space limits; we can add a brief clause on semantic embedding alignment if the editor permits. revision: partial
Circularity Check
No circularity: empirical ZSL framework with external dataset evaluation
full rationale
The paper defines a new ZSSLR task, releases the ASL-Text dataset with 250 classes and dictionary text, extracts visual features via 3D-CNN + biLSTM, and applies standard ZSL with text embeddings. No equations, fitted parameters, or self-citations are shown that reduce any claimed prediction to an input by construction. The central result is an empirical demonstration on held-out unseen classes, which is externally falsifiable and does not rely on self-definitional mappings, uniqueness theorems from the same authors, or renaming of prior results. This is a self-contained empirical contribution.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Textual descriptions from sign language dictionaries provide an effective intermediate semantic representation for zero-shot visual transfer
Reference graph
Works this paper leans on
-
[1]
Label-embedding for attribute-based classification
Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. Label-embedding for attribute-based classification. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 819–826, 2013
work page 2013
-
[2]
Learning sign language by watching tv (using weakly aligned subtitles)
Patrick Buehler, Andrew Zisserman, and Mark Everingham. Learning sign language by watching tv (using weakly aligned subtitles). In Proc. IEEE Conf. Comput. Vis. Pattern Recog. , pages 2961–2968. IEEE, 2009
work page 2009
-
[3]
Subunets: End-to- end hand shape and continuous sign language recognition
Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, and Richard Bowden. Subunets: End-to- end hand shape and continuous sign language recognition. In Proc. IEEE Int. Conf. on Computer Vision, pages 3075–3084. IEEE, 2017. BILGE ET AL.: ZERO-SHOT SIGN LANGUAGE RECOGNITION 11
work page 2017
-
[4]
OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields
Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields. InarXiv preprint arXiv:1812.08008, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[5]
Quo vadis, action recognition? a new model and the kinetics dataset
Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 6299–6308, 2017
work page 2017
-
[6]
Learning phrase representations using rnn encoder-decoder for statistical machine translation
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. EMNLP, 2014
work page 2014
-
[7]
Neu- ral sign language translation
Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, Hermann Ney, and Richard Bowden. Neu- ral sign language translation. InProc. IEEE Conf. Comput. Vis. Pattern Recog., pages 7784–7793, 2018
work page 2018
-
[8]
Random House Webster’s Concise American Sign Language Dictionary
Elaine Costello. Random House Webster’s Concise American Sign Language Dictionary . Ran- dom House, 1999
work page 1999
-
[9]
Runpeng Cui, Hu Liu, and Changshui Zhang. Recurrent convolutional neural networks for con- tinuous sign language recognition by staged optimization. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 7361–7369, 2017
work page 2017
-
[10]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[11]
Write a classifier: Zero-shot learning using purely textual descriptions
Mohamed Elhoseiny, Babak Saleh, and Ahmed Elgammal. Write a classifier: Zero-shot learning using purely textual descriptions. InProc. IEEE Int. Conf. on Computer Vision, pages 2584–2591, 2013
work page 2013
-
[12]
Aligning asl for statistical translation using a discriminative word model
Ali Farhadi and David Forsyth. Aligning asl for statistical translation using a discriminative word model. In Proc. IEEE Conf. Comput. Vis. Pattern Recog. , volume 2, pages 1471–1476. IEEE, 2006
work page 2006
-
[13]
Transfer learning in sign language
Ali Farhadi, David Forsyth, and Ryan White. Transfer learning in sign language. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 1–8. IEEE, 2007
work page 2007
-
[14]
Describing objects by their attributes
Ali Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. Describing objects by their attributes. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 1778–1785. IEEE, 2009
work page 2009
-
[15]
Vittorio Ferrari and Andrew Zisserman. Learning visual attributes. In Proc. Adv. Neural Inf. Process. Syst., pages 433–440, 2008
work page 2008
-
[16]
Devise: A deep visual-semantic embedding model
Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al. Devise: A deep visual-semantic embedding model. In Proc. Adv. Neural Inf. Process. Syst. , pages 2121–2129, 2013
work page 2013
-
[17]
Transductive multi-view embedding for zero-shot recognition and annotation
Yanwei Fu, Timothy M Hospedales, Tao Xiang, Zhenyong Fu, and Shaogang Gong. Transductive multi-view embedding for zero-shot recognition and annotation. In Proc. European Conf. on Computer Vision, pages 584–599. Springer, 2014
work page 2014
-
[18]
Learning multimodal latent attributes
Yanwei Fu, Timothy M Hospedales, Tao Xiang, and Shaogang Gong. Learning multimodal latent attributes. IEEE transactions on pattern analysis and machine intelligence, 36(2):303–316, 2014
work page 2014
-
[19]
Framewise phoneme classification with bidirectional lstm and other neural network architectures
Alex Graves and Jürgen Schmidhuber. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks, 18(5-6):602–610, 2005. 12 BILGE ET AL.: ZERO-SHOT SIGN LANGUAGE RECOGNITION
work page 2005
-
[20]
Isolated sign language recognition using hidden markov mod- els
Kirsti Grobel and Marcell Assan. Isolated sign language recognition using hidden markov mod- els. In IEEE International Conference on Systems, Man, and Cybernetics. Computational Cyber- netics and Simulation, volume 1, pages 162–167. IEEE, 1997
work page 1997
-
[21]
Video2vec embeddings recog- nize events when examples are scarce
Amirhossein Habibian, Thomas Mensink, and Cees GM Snoek. Video2vec embeddings recog- nize events when examples are scarce. IEEE Trans. Pattern Anal. Mach. Intell. , 39(10):2089– 2103, 2017
work page 2089
-
[22]
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8): 1735–1780, 1997
work page 1997
-
[23]
Sign language recognition using model-based tracking and a 3d hopfield neural network
Chung-Lin Huang and Wen-Yi Huang. Sign language recognition using model-based tracking and a 3d hopfield neural network. Machine vision and applications, 10(5-6):292–307, 1998
work page 1998
-
[24]
Sign language recognition using 3d convolutional neural networks
Jie Huang, Wengang Zhou, Houqiang Li, and Weiping Li. Sign language recognition using 3d convolutional neural networks. In IEEE Int. Conf. on Multimedia and Expo (ICME), pages 1–6. IEEE, 2015
work page 2015
-
[25]
Objects2action: Classi- fying and localizing actions without any video example
Mihir Jain, Jan C van Gemert, Thomas Mensink, and Cees GM Snoek. Objects2action: Classi- fying and localizing actions without any video example. In Proc. IEEE Int. Conf. on Computer Vision, pages 4588–4596, 2015
work page 2015
-
[26]
Daniel Kelly, John Mc Donald, and Charles Markham. Weakly supervised training of a sign lan- guage recognition system using multiple instance learning density matrices. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 41(2):526–541, 2011
work page 2011
-
[27]
Semantic autoencoder for zero-shot learning
Elyor Kodirov, Tao Xiang, and Shaogang Gong. Semantic autoencoder for zero-shot learning. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 3174–3183, 2017
work page 2017
-
[28]
Oscar Koller, Jens Forster, and Hermann Ney. Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers. Comput. Vis. Image Understand., 141:108–125, 2015
work page 2015
-
[29]
Deep sign: hybrid cnn-hmm for continuous sign language recognition
Oscar Koller, O Zargaran, Hermann Ney, and Richard Bowden. Deep sign: hybrid cnn-hmm for continuous sign language recognition. In British Machine Vision Conference, 2016
work page 2016
-
[30]
Learning to detect unseen object classes by between-class attribute transfer
Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In Proc. IEEE Conf. Comput. Vis. Pattern Recog. , pages 951–958. IEEE, 2009
work page 2009
-
[31]
Attribute-based classification for zero-shot visual object categorization
Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(3):453–465, 2014
work page 2014
-
[32]
Predicting deep zero-shot convolutional neural networks using textual descriptions
Jimmy Lei Ba, Kevin Swersky, Sanja Fidler, et al. Predicting deep zero-shot convolutional neural networks using textual descriptions. In Proc. IEEE Int. Conf. on Computer Vision, pages 4247– 4255, 2015
work page 2015
-
[33]
Recognizing human actions by attributes
Jingen Liu, Benjamin Kuipers, and Silvio Savarese. Recognizing human actions by attributes. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 3337–3344. IEEE, 2011
work page 2011
-
[34]
Hard zero shot learning for gesture recognition
Naveen Madapana and Juan P Wachs. Hard zero shot learning for gesture recognition. In IAPR International Conference on Pattern Recognition, pages 3574–3579. IEEE, 2018
work page 2018
-
[35]
Costa: Co-occurrence statistics for zero-shot classification
Thomas Mensink, Efstratios Gavves, and Cees GM Snoek. Costa: Co-occurrence statistics for zero-shot classification. In Proc. IEEE Conf. Comput. Vis. Pattern Recog. , pages 2441–2448, 2014. BILGE ET AL.: ZERO-SHOT SIGN LANGUAGE RECOGNITION 13
work page 2014
-
[36]
Distributed represen- tations of words and phrases and their compositionality
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed represen- tations of words and phrases and their compositionality. In Proc. Adv. Neural Inf. Process. Syst., pages 3111–3119, 2013
work page 2013
-
[37]
Pavlo Molchanov, Xiaodong Yang, Shalini Gupta, Kihwan Kim, Stephen Tyree, and Jan Kautz. Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 4207–4215, 2016
work page 2016
-
[38]
Gesture recognition: Focus on the hands
Pradyumna Narayana, Ross Beveridge, and Bruce A Draper. Gesture recognition: Focus on the hands. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 5235–5244, 2018
work page 2018
-
[39]
Sunita Nayak, Sudeep Sarkar, and Barbara Loeding. Automated extraction of signs from con- tinuous sign language sentences using iterated conditional modes. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 2583–2590. IEEE, 2009
work page 2009
-
[40]
Challenges in development of the american sign language lexicon video dataset (asllvd) corpus
Carol Neidle, Ashwin Thangali, and Stan Sclaroff. Challenges in development of the american sign language lexicon video dataset (asllvd) corpus. In Proc. 5th Workshop on the Representa- tion and Processing of Sign Languages: Interactions between Corpus and Lexicon, Language Resources and Evaluation Conference (LREC) 2012, 2012
work page 2012
-
[41]
Zero-shot learning by convex combination of semantic embeddings
Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram Singer, Jonathon Shlens, Andrea Frome, Greg S Corrado, and Jeffrey Dean. Zero-shot learning by convex combination of semantic embeddings. Proc. Int. Conf. Learn. Represent., 2014
work page 2014
-
[42]
Devi Parikh and Kristen Grauman. Relative attributes. In Proc. IEEE Int. Conf. on Computer Vision, pages 503–510. IEEE, 2011
work page 2011
-
[43]
Sun attribute database: Discovering, annotating, and rec- ognizing scene attributes
Genevieve Patterson and James Hays. Sun attribute database: Discovering, annotating, and rec- ognizing scene attributes. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 2751–2758. IEEE, 2012
work page 2012
-
[44]
Glove: Global vectors for word representation
Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proc. of conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014
work page 2014
-
[45]
Large-scale learning of sign language by watching tv (using co-occurrences)
Tomas Pfister, James Charles, and Andrew Zisserman. Large-scale learning of sign language by watching tv (using co-occurrences). In British Machine Vision Conference, 2013
work page 2013
-
[46]
Domain-adaptive discriminative one-shot learning of gestures
Tomas Pfister, James Charles, and Andrew Zisserman. Domain-adaptive discriminative one-shot learning of gestures. In Proc. European Conf. on Computer Vision , pages 814–829. Springer, 2014
work page 2014
-
[47]
Sign classification in sign language corpora with deep neural networks
Lionel Pigou, Mieke Van Herreweghe, and Joni Dambre. Sign classification in sign language corpora with deep neural networks. In International Conference on Language Resources and Evaluation (LREC) Workshop, pages 175–178, 2016
work page 2016
-
[48]
Zero- shot action recognition with error-correcting output codes
Jie Qin, Li Liu, Ling Shao, Fumin Shen, Bingbing Ni, Jiaxin Chen, and Yunhong Wang. Zero- shot action recognition with error-correcting output codes. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 2833–2842, 2017
work page 2017
-
[49]
Evaluating knowledge transfer and zero- shot learning in a large-scale setting
Marcus Rohrbach, Michael Stark, and Bernt Schiele. Evaluating knowledge transfer and zero- shot learning in a large-scale setting. In Proc. IEEE Conf. Comput. Vis. Pattern Recog. , pages 1641–1648. IEEE, 2011
work page 2011
-
[50]
An embarrassingly simple approach to zero-shot learning
Bernardino Romera-Paredes and Philip Torr. An embarrassingly simple approach to zero-shot learning. In Proc. Int. Conf. Mach. Learn., pages 2152–2161, 2015. 14 BILGE ET AL.: ZERO-SHOT SIGN LANGUAGE RECOGNITION
work page 2015
-
[51]
Zero-shot learning through cross-modal transfer
Richard Socher, Milind Ganjoo, Christopher D Manning, and Andrew Ng. Zero-shot learning through cross-modal transfer. In Proc. Adv. Neural Inf. Process. Syst., pages 935–943, 2013
work page 2013
-
[52]
Sign language structure: An outline of the visual communication systems of the american deaf
William C Stokoe Jr. Sign language structure: An outline of the visual communication systems of the american deaf. Journal of deaf studies and deaf education, 10(1):3–37, 2005
work page 2005
-
[53]
Sign language production using neural machine translation and generative adversarial networks
Stephanie Stoll, Necati Cihan Camgoz, Simon Hadfield, and Richard Bowden. Sign language production using neural machine translation and generative adversarial networks. In British Ma- chine Vision Conference. British Machine Vision Association, 2018
work page 2018
-
[54]
Gencer Sumbul, Ramazan Gokberk Cinbis, and Selim Aksoy. Fine-grained object recognition and zero-shot learning in remote sensing imagery.IEEE Transactions on Geoscience and Remote Sensing, 56(2):770–779, 2018
work page 2018
-
[55]
Going deeper with convolutions
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 1–9, 2015
work page 2015
-
[56]
Recognition of sign language motion images
Shinichi Tamura and Shingo Kawasaki. Recognition of sign language motion images. Pattern recognition, 21(4):343–353, 1988
work page 1988
-
[57]
Recognizing unfamiliar gestures for human-robot interac- tion through zero-shot learning
Wil Thomason and Ross A Knepper. Recognizing unfamiliar gestures for human-robot interac- tion through zero-shot learning. In International Symposium on Experimental Robotics , pages 841–852. Springer, 2016
work page 2016
-
[58]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proc. Adv. Neural Inf. Process. Syst., pages 5998–6008, 2017
work page 2017
-
[59]
The caltech- ucsd birds-200-2011 dataset
Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech- ucsd birds-200-2011 dataset. 2011
work page 2011
-
[60]
Isolated sign language recognition with grassmann covariance matrices
Hanjie Wang, Xiujuan Chai, Xiaopeng Hong, Guoying Zhao, and Xilin Chen. Isolated sign language recognition with grassmann covariance matrices. ACM Transactions on Accessible Computing (TACCESS), 8(4):14, 2016
work page 2016
-
[61]
Alternative semantic representations for zero-shot human action recognition
Qian Wang and Ke Chen. Alternative semantic representations for zero-shot human action recognition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 87–102. Springer, 2017
work page 2017
-
[62]
Large scale image annotation: learning to rank with joint word-image embeddings
Jason Weston, Samy Bengio, and Nicolas Usunier. Large scale image annotation: learning to rank with joint word-image embeddings. Machine learning, 81(1):21–35, 2010
work page 2010
-
[63]
Vision-based gesture recognition: A review
Ying Wu and Thomas S Huang. Vision-based gesture recognition: A review. In International Gesture Workshop, pages 103–115. Springer, 1999
work page 1999
-
[64]
Zero-shot learning-the good, the bad and the ugly
Yongqin Xian, Bernt Schiele, and Zeynep Akata. Zero-shot learning-the good, the bad and the ugly. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 4582–4591, 2017
work page 2017
-
[65]
Xun Xu, Timothy M. Hospedales, and Shaogang Gong. Semantic embedding space for zero- shot action recognition. 2015 IEEE International Conference on Image Processing (ICIP), pages 63–67, 2015
work page 2015
-
[66]
Transductive zero-shot action recognition by word-vector embedding
Xun Xu, Timothy Hospedales, and Shaogang Gong. Transductive zero-shot action recognition by word-vector embedding. International Journal of Computer Vision, 123(3):309–333, 2017
work page 2017
-
[67]
Towards universal representation for unseen action recognition
Yi Zhu, Yang Long, Yu Guan, Shawn Newsam, and Ling Shao. Towards universal representation for unseen action recognition. In Proc. IEEE Conf. Comput. Vis. Pattern Recog. , pages 9436– 9445, 2018
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.