pith. sign in

arxiv: 1907.10292 · v1 · pith:DBD2OPPRnew · submitted 2019-07-24 · 💻 cs.CV

Zero-Shot Sign Language Recognition: Can Textual Data Uncover Sign Languages?

Pith reviewed 2026-05-24 17:06 UTC · model grok-4.3

classification 💻 cs.CV
keywords zero-shot sign language recognitiontextual embeddingsASL-Text dataset3D-CNNbidirectional LSTMknowledge transfersign language dictionaries
0
0 comments X

The pith

Textual dictionary descriptions enable zero-shot recognition of unseen sign language signs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces zero-shot sign language recognition, a setting where models trained on some signs must identify new ones with no video examples available. It proposes using textual descriptions from sign language dictionaries as an intermediate semantic representation to transfer knowledge across classes. A new benchmark dataset called ASL-Text supplies 250 sign classes along with their dictionary descriptions to test the idea under conditions of limited training examples per class. Visual features are extracted from body and hand regions with 3D-CNNs and modeled over time with bidirectional LSTMs, then aligned with text embeddings inside a zero-shot framework. The central result is that this combination demonstrates textual data can support recognition of signs never seen in training videos.

Core claim

By leveraging the descriptive text embeddings along with these spatio-temporal representations within a zero-shot learning framework, we show that textual data can indeed be useful in uncovering sign languages.

What carries the argument

Zero-shot learning framework that aligns spatio-temporal visual features from 3D-CNNs and bidirectional LSTMs with embeddings of dictionary textual descriptions.

If this is right

  • Sign language recognition can extend to new classes using existing dictionary texts instead of new video labeling.
  • The framework handles datasets where many classes have few training examples.
  • Semantic alignment between text and visual features supports transfer to unseen sign classes.
  • The ASL-Text dataset and approach establish a starting point for zero-shot sign language work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Existing dictionary resources could lower the cost of expanding sign language vocabularies in recognition systems.
  • Text-based bridging may generalize to other gesture or action domains that carry descriptive metadata.
  • The method suggests dictionary-style text captures enough shared structure for visual transfer in sequential gesture tasks.

Load-bearing premise

Textual descriptions from sign language dictionaries provide a sufficiently aligned semantic representation to enable effective knowledge transfer from seen to unseen visual sign classes.

What would settle it

If accuracy on unseen signs drops to chance level when text embeddings are included compared to a visual-only baseline, the claim that textual data aids recognition would not hold.

Figures

Figures reproduced from arXiv: 1907.10292 by Nazli Ikizler-Cinbis, Ramazan Gokberk Cinbis, Yunus Can Bilge.

Figure 1
Figure 1. Figure 1: Overview of the proposed zero-shot sign language recognition (ZSSLR) approach. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example sequences and corresponding textual descriptions from the ASL-Text [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: t-SNE visualization of sign descriptions using BERT-[ [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Example predictions of our proposed model. The first four rows show examples [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
read the original abstract

We introduce the problem of zero-shot sign language recognition (ZSSLR), where the goal is to leverage models learned over the seen sign class examples to recognize the instances of unseen signs. To this end, we propose to utilize the readily available descriptions in sign language dictionaries as an intermediate-level semantic representation for knowledge transfer. We introduce a new benchmark dataset called ASL-Text that consists of 250 sign language classes and their accompanying textual descriptions. Compared to the ZSL datasets in other domains (such as object recognition), our dataset consists of limited number of training examples for a large number of classes, which imposes a significant challenge. We propose a framework that operates over the body and hand regions by means of 3D-CNNs, and models longer temporal relationships via bidirectional LSTMs. By leveraging the descriptive text embeddings along with these spatio-temporal representations within a zero-shot learning framework, we show that textual data can indeed be useful in uncovering sign languages. We anticipate that the introduced approach and the accompanying dataset will provide a basis for further exploration of this new zero-shot learning problem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces the zero-shot sign language recognition (ZSSLR) problem, where models learned on seen sign classes are used to recognize unseen signs. It proposes leveraging textual descriptions from sign language dictionaries as intermediate semantic representations for knowledge transfer, introduces the ASL-Text benchmark dataset with 250 classes and accompanying text, and describes a visual pipeline using 3D-CNNs on body/hand regions plus bidirectional LSTMs for temporal modeling. The central claim is that combining these spatio-temporal visual features with text embeddings in a ZSL framework demonstrates the utility of textual data for uncovering sign languages.

Significance. If the experimental results support the claim, the work would be significant for defining a new ZSL task in sign language with the realistic constraint of limited examples per class, for releasing the ASL-Text dataset as a community benchmark, and for exploring dictionary text as a semantic bridge in a domain where visual data collection for new classes is costly. The choice of 3D-CNN + biLSTM for spatio-temporal features is a standard and appropriate architectural decision for the visual side.

major comments (2)
  1. [Abstract] Abstract: The abstract asserts that the proposed framework shows textual data is useful ('we show that textual data can indeed be useful in uncovering sign languages'), but supplies no metrics, baselines, implementation details, or error analysis; the central claim cannot be verified from the available text.
  2. [Abstract] Abstract / Proposed framework: The assumption that textual descriptions from sign language dictionaries provide a sufficiently aligned semantic representation for effective ZSL transfer is load-bearing, yet the abstract gives no indication of how the text embeddings are aligned to the 3D-CNN + biLSTM features or whether misalignment due to high-level dictionary text (vs. fine-grained visual details) was diagnosed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their comments on our manuscript. We address each major comment below with references to the full paper where relevant.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract asserts that the proposed framework shows textual data is useful ('we show that textual data can indeed be useful in uncovering sign languages'), but supplies no metrics, baselines, implementation details, or error analysis; the central claim cannot be verified from the available text.

    Authors: We agree that the abstract, constrained by length, states the outcome at a high level without quantitative support. The full manuscript reports the experimental results, including accuracy metrics on seen and unseen classes, comparisons against baselines that omit textual embeddings, and analysis of the limited-examples-per-class setting. To make the central claim more verifiable from the abstract itself, we will revise it to include a representative performance figure. revision: yes

  2. Referee: [Abstract] Abstract / Proposed framework: The assumption that textual descriptions from sign language dictionaries provide a sufficiently aligned semantic representation for effective ZSL transfer is load-bearing, yet the abstract gives no indication of how the text embeddings are aligned to the 3D-CNN + biLSTM features or whether misalignment due to high-level dictionary text (vs. fine-grained visual details) was diagnosed.

    Authors: The alignment occurs by mapping the spatio-temporal visual features (3D-CNN on body/hand regions followed by bi-LSTM) into the text embedding space via a learned compatibility function inside the zero-shot framework, as described in the methods. The ASL-Text experiments demonstrate positive transfer to unseen signs, which serves as empirical evidence that dictionary text provides usable semantic bridging despite its higher-level character. The paper explicitly notes the challenge of fine-grained visual details versus dictionary text in the dataset and results sections. The abstract does not detail this mechanism due to space limits; we can add a brief clause on semantic embedding alignment if the editor permits. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical ZSL framework with external dataset evaluation

full rationale

The paper defines a new ZSSLR task, releases the ASL-Text dataset with 250 classes and dictionary text, extracts visual features via 3D-CNN + biLSTM, and applies standard ZSL with text embeddings. No equations, fitted parameters, or self-citations are shown that reduce any claimed prediction to an input by construction. The central result is an empirical demonstration on held-out unseen classes, which is externally falsifiable and does not rely on self-definitional mappings, uniqueness theorems from the same authors, or renaming of prior results. This is a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that dictionary text forms a usable semantic bridge; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Textual descriptions from sign language dictionaries provide an effective intermediate semantic representation for zero-shot visual transfer
    This premise is required for the knowledge transfer mechanism to function.

pith-pipeline@v0.9.0 · 5731 in / 1019 out tokens · 23746 ms · 2026-05-24T17:06:35.081788+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 2 internal anchors

  1. [1]

    Label-embedding for attribute-based classification

    Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. Label-embedding for attribute-based classification. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 819–826, 2013

  2. [2]

    Learning sign language by watching tv (using weakly aligned subtitles)

    Patrick Buehler, Andrew Zisserman, and Mark Everingham. Learning sign language by watching tv (using weakly aligned subtitles). In Proc. IEEE Conf. Comput. Vis. Pattern Recog. , pages 2961–2968. IEEE, 2009

  3. [3]

    Subunets: End-to- end hand shape and continuous sign language recognition

    Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, and Richard Bowden. Subunets: End-to- end hand shape and continuous sign language recognition. In Proc. IEEE Int. Conf. on Computer Vision, pages 3075–3084. IEEE, 2017. BILGE ET AL.: ZERO-SHOT SIGN LANGUAGE RECOGNITION 11

  4. [4]

    OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields

    Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields. InarXiv preprint arXiv:1812.08008, 2018

  5. [5]

    Quo vadis, action recognition? a new model and the kinetics dataset

    Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 6299–6308, 2017

  6. [6]

    Learning phrase representations using rnn encoder-decoder for statistical machine translation

    Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. EMNLP, 2014

  7. [7]

    Neu- ral sign language translation

    Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, Hermann Ney, and Richard Bowden. Neu- ral sign language translation. InProc. IEEE Conf. Comput. Vis. Pattern Recog., pages 7784–7793, 2018

  8. [8]

    Random House Webster’s Concise American Sign Language Dictionary

    Elaine Costello. Random House Webster’s Concise American Sign Language Dictionary . Ran- dom House, 1999

  9. [9]

    Recurrent convolutional neural networks for con- tinuous sign language recognition by staged optimization

    Runpeng Cui, Hu Liu, and Changshui Zhang. Recurrent convolutional neural networks for con- tinuous sign language recognition by staged optimization. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 7361–7369, 2017

  10. [10]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

  11. [11]

    Write a classifier: Zero-shot learning using purely textual descriptions

    Mohamed Elhoseiny, Babak Saleh, and Ahmed Elgammal. Write a classifier: Zero-shot learning using purely textual descriptions. InProc. IEEE Int. Conf. on Computer Vision, pages 2584–2591, 2013

  12. [12]

    Aligning asl for statistical translation using a discriminative word model

    Ali Farhadi and David Forsyth. Aligning asl for statistical translation using a discriminative word model. In Proc. IEEE Conf. Comput. Vis. Pattern Recog. , volume 2, pages 1471–1476. IEEE, 2006

  13. [13]

    Transfer learning in sign language

    Ali Farhadi, David Forsyth, and Ryan White. Transfer learning in sign language. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 1–8. IEEE, 2007

  14. [14]

    Describing objects by their attributes

    Ali Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. Describing objects by their attributes. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 1778–1785. IEEE, 2009

  15. [15]

    Learning visual attributes

    Vittorio Ferrari and Andrew Zisserman. Learning visual attributes. In Proc. Adv. Neural Inf. Process. Syst., pages 433–440, 2008

  16. [16]

    Devise: A deep visual-semantic embedding model

    Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al. Devise: A deep visual-semantic embedding model. In Proc. Adv. Neural Inf. Process. Syst. , pages 2121–2129, 2013

  17. [17]

    Transductive multi-view embedding for zero-shot recognition and annotation

    Yanwei Fu, Timothy M Hospedales, Tao Xiang, Zhenyong Fu, and Shaogang Gong. Transductive multi-view embedding for zero-shot recognition and annotation. In Proc. European Conf. on Computer Vision, pages 584–599. Springer, 2014

  18. [18]

    Learning multimodal latent attributes

    Yanwei Fu, Timothy M Hospedales, Tao Xiang, and Shaogang Gong. Learning multimodal latent attributes. IEEE transactions on pattern analysis and machine intelligence, 36(2):303–316, 2014

  19. [19]

    Framewise phoneme classification with bidirectional lstm and other neural network architectures

    Alex Graves and Jürgen Schmidhuber. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks, 18(5-6):602–610, 2005. 12 BILGE ET AL.: ZERO-SHOT SIGN LANGUAGE RECOGNITION

  20. [20]

    Isolated sign language recognition using hidden markov mod- els

    Kirsti Grobel and Marcell Assan. Isolated sign language recognition using hidden markov mod- els. In IEEE International Conference on Systems, Man, and Cybernetics. Computational Cyber- netics and Simulation, volume 1, pages 162–167. IEEE, 1997

  21. [21]

    Video2vec embeddings recog- nize events when examples are scarce

    Amirhossein Habibian, Thomas Mensink, and Cees GM Snoek. Video2vec embeddings recog- nize events when examples are scarce. IEEE Trans. Pattern Anal. Mach. Intell. , 39(10):2089– 2103, 2017

  22. [22]

    Long short-term memory

    Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8): 1735–1780, 1997

  23. [23]

    Sign language recognition using model-based tracking and a 3d hopfield neural network

    Chung-Lin Huang and Wen-Yi Huang. Sign language recognition using model-based tracking and a 3d hopfield neural network. Machine vision and applications, 10(5-6):292–307, 1998

  24. [24]

    Sign language recognition using 3d convolutional neural networks

    Jie Huang, Wengang Zhou, Houqiang Li, and Weiping Li. Sign language recognition using 3d convolutional neural networks. In IEEE Int. Conf. on Multimedia and Expo (ICME), pages 1–6. IEEE, 2015

  25. [25]

    Objects2action: Classi- fying and localizing actions without any video example

    Mihir Jain, Jan C van Gemert, Thomas Mensink, and Cees GM Snoek. Objects2action: Classi- fying and localizing actions without any video example. In Proc. IEEE Int. Conf. on Computer Vision, pages 4588–4596, 2015

  26. [26]

    Weakly supervised training of a sign lan- guage recognition system using multiple instance learning density matrices

    Daniel Kelly, John Mc Donald, and Charles Markham. Weakly supervised training of a sign lan- guage recognition system using multiple instance learning density matrices. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 41(2):526–541, 2011

  27. [27]

    Semantic autoencoder for zero-shot learning

    Elyor Kodirov, Tao Xiang, and Shaogang Gong. Semantic autoencoder for zero-shot learning. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 3174–3183, 2017

  28. [28]

    Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers

    Oscar Koller, Jens Forster, and Hermann Ney. Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers. Comput. Vis. Image Understand., 141:108–125, 2015

  29. [29]

    Deep sign: hybrid cnn-hmm for continuous sign language recognition

    Oscar Koller, O Zargaran, Hermann Ney, and Richard Bowden. Deep sign: hybrid cnn-hmm for continuous sign language recognition. In British Machine Vision Conference, 2016

  30. [30]

    Learning to detect unseen object classes by between-class attribute transfer

    Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In Proc. IEEE Conf. Comput. Vis. Pattern Recog. , pages 951–958. IEEE, 2009

  31. [31]

    Attribute-based classification for zero-shot visual object categorization

    Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(3):453–465, 2014

  32. [32]

    Predicting deep zero-shot convolutional neural networks using textual descriptions

    Jimmy Lei Ba, Kevin Swersky, Sanja Fidler, et al. Predicting deep zero-shot convolutional neural networks using textual descriptions. In Proc. IEEE Int. Conf. on Computer Vision, pages 4247– 4255, 2015

  33. [33]

    Recognizing human actions by attributes

    Jingen Liu, Benjamin Kuipers, and Silvio Savarese. Recognizing human actions by attributes. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 3337–3344. IEEE, 2011

  34. [34]

    Hard zero shot learning for gesture recognition

    Naveen Madapana and Juan P Wachs. Hard zero shot learning for gesture recognition. In IAPR International Conference on Pattern Recognition, pages 3574–3579. IEEE, 2018

  35. [35]

    Costa: Co-occurrence statistics for zero-shot classification

    Thomas Mensink, Efstratios Gavves, and Cees GM Snoek. Costa: Co-occurrence statistics for zero-shot classification. In Proc. IEEE Conf. Comput. Vis. Pattern Recog. , pages 2441–2448, 2014. BILGE ET AL.: ZERO-SHOT SIGN LANGUAGE RECOGNITION 13

  36. [36]

    Distributed represen- tations of words and phrases and their compositionality

    Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed represen- tations of words and phrases and their compositionality. In Proc. Adv. Neural Inf. Process. Syst., pages 3111–3119, 2013

  37. [37]

    Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network

    Pavlo Molchanov, Xiaodong Yang, Shalini Gupta, Kihwan Kim, Stephen Tyree, and Jan Kautz. Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 4207–4215, 2016

  38. [38]

    Gesture recognition: Focus on the hands

    Pradyumna Narayana, Ross Beveridge, and Bruce A Draper. Gesture recognition: Focus on the hands. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 5235–5244, 2018

  39. [39]

    Automated extraction of signs from con- tinuous sign language sentences using iterated conditional modes

    Sunita Nayak, Sudeep Sarkar, and Barbara Loeding. Automated extraction of signs from con- tinuous sign language sentences using iterated conditional modes. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 2583–2590. IEEE, 2009

  40. [40]

    Challenges in development of the american sign language lexicon video dataset (asllvd) corpus

    Carol Neidle, Ashwin Thangali, and Stan Sclaroff. Challenges in development of the american sign language lexicon video dataset (asllvd) corpus. In Proc. 5th Workshop on the Representa- tion and Processing of Sign Languages: Interactions between Corpus and Lexicon, Language Resources and Evaluation Conference (LREC) 2012, 2012

  41. [41]

    Zero-shot learning by convex combination of semantic embeddings

    Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram Singer, Jonathon Shlens, Andrea Frome, Greg S Corrado, and Jeffrey Dean. Zero-shot learning by convex combination of semantic embeddings. Proc. Int. Conf. Learn. Represent., 2014

  42. [42]

    Relative attributes

    Devi Parikh and Kristen Grauman. Relative attributes. In Proc. IEEE Int. Conf. on Computer Vision, pages 503–510. IEEE, 2011

  43. [43]

    Sun attribute database: Discovering, annotating, and rec- ognizing scene attributes

    Genevieve Patterson and James Hays. Sun attribute database: Discovering, annotating, and rec- ognizing scene attributes. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 2751–2758. IEEE, 2012

  44. [44]

    Glove: Global vectors for word representation

    Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proc. of conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014

  45. [45]

    Large-scale learning of sign language by watching tv (using co-occurrences)

    Tomas Pfister, James Charles, and Andrew Zisserman. Large-scale learning of sign language by watching tv (using co-occurrences). In British Machine Vision Conference, 2013

  46. [46]

    Domain-adaptive discriminative one-shot learning of gestures

    Tomas Pfister, James Charles, and Andrew Zisserman. Domain-adaptive discriminative one-shot learning of gestures. In Proc. European Conf. on Computer Vision , pages 814–829. Springer, 2014

  47. [47]

    Sign classification in sign language corpora with deep neural networks

    Lionel Pigou, Mieke Van Herreweghe, and Joni Dambre. Sign classification in sign language corpora with deep neural networks. In International Conference on Language Resources and Evaluation (LREC) Workshop, pages 175–178, 2016

  48. [48]

    Zero- shot action recognition with error-correcting output codes

    Jie Qin, Li Liu, Ling Shao, Fumin Shen, Bingbing Ni, Jiaxin Chen, and Yunhong Wang. Zero- shot action recognition with error-correcting output codes. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 2833–2842, 2017

  49. [49]

    Evaluating knowledge transfer and zero- shot learning in a large-scale setting

    Marcus Rohrbach, Michael Stark, and Bernt Schiele. Evaluating knowledge transfer and zero- shot learning in a large-scale setting. In Proc. IEEE Conf. Comput. Vis. Pattern Recog. , pages 1641–1648. IEEE, 2011

  50. [50]

    An embarrassingly simple approach to zero-shot learning

    Bernardino Romera-Paredes and Philip Torr. An embarrassingly simple approach to zero-shot learning. In Proc. Int. Conf. Mach. Learn., pages 2152–2161, 2015. 14 BILGE ET AL.: ZERO-SHOT SIGN LANGUAGE RECOGNITION

  51. [51]

    Zero-shot learning through cross-modal transfer

    Richard Socher, Milind Ganjoo, Christopher D Manning, and Andrew Ng. Zero-shot learning through cross-modal transfer. In Proc. Adv. Neural Inf. Process. Syst., pages 935–943, 2013

  52. [52]

    Sign language structure: An outline of the visual communication systems of the american deaf

    William C Stokoe Jr. Sign language structure: An outline of the visual communication systems of the american deaf. Journal of deaf studies and deaf education, 10(1):3–37, 2005

  53. [53]

    Sign language production using neural machine translation and generative adversarial networks

    Stephanie Stoll, Necati Cihan Camgoz, Simon Hadfield, and Richard Bowden. Sign language production using neural machine translation and generative adversarial networks. In British Ma- chine Vision Conference. British Machine Vision Association, 2018

  54. [54]

    Fine-grained object recognition and zero-shot learning in remote sensing imagery.IEEE Transactions on Geoscience and Remote Sensing, 56(2):770–779, 2018

    Gencer Sumbul, Ramazan Gokberk Cinbis, and Selim Aksoy. Fine-grained object recognition and zero-shot learning in remote sensing imagery.IEEE Transactions on Geoscience and Remote Sensing, 56(2):770–779, 2018

  55. [55]

    Going deeper with convolutions

    Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 1–9, 2015

  56. [56]

    Recognition of sign language motion images

    Shinichi Tamura and Shingo Kawasaki. Recognition of sign language motion images. Pattern recognition, 21(4):343–353, 1988

  57. [57]

    Recognizing unfamiliar gestures for human-robot interac- tion through zero-shot learning

    Wil Thomason and Ross A Knepper. Recognizing unfamiliar gestures for human-robot interac- tion through zero-shot learning. In International Symposium on Experimental Robotics , pages 841–852. Springer, 2016

  58. [58]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proc. Adv. Neural Inf. Process. Syst., pages 5998–6008, 2017

  59. [59]

    The caltech- ucsd birds-200-2011 dataset

    Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech- ucsd birds-200-2011 dataset. 2011

  60. [60]

    Isolated sign language recognition with grassmann covariance matrices

    Hanjie Wang, Xiujuan Chai, Xiaopeng Hong, Guoying Zhao, and Xilin Chen. Isolated sign language recognition with grassmann covariance matrices. ACM Transactions on Accessible Computing (TACCESS), 8(4):14, 2016

  61. [61]

    Alternative semantic representations for zero-shot human action recognition

    Qian Wang and Ke Chen. Alternative semantic representations for zero-shot human action recognition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 87–102. Springer, 2017

  62. [62]

    Large scale image annotation: learning to rank with joint word-image embeddings

    Jason Weston, Samy Bengio, and Nicolas Usunier. Large scale image annotation: learning to rank with joint word-image embeddings. Machine learning, 81(1):21–35, 2010

  63. [63]

    Vision-based gesture recognition: A review

    Ying Wu and Thomas S Huang. Vision-based gesture recognition: A review. In International Gesture Workshop, pages 103–115. Springer, 1999

  64. [64]

    Zero-shot learning-the good, the bad and the ugly

    Yongqin Xian, Bernt Schiele, and Zeynep Akata. Zero-shot learning-the good, the bad and the ugly. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 4582–4591, 2017

  65. [65]

    Hospedales, and Shaogang Gong

    Xun Xu, Timothy M. Hospedales, and Shaogang Gong. Semantic embedding space for zero- shot action recognition. 2015 IEEE International Conference on Image Processing (ICIP), pages 63–67, 2015

  66. [66]

    Transductive zero-shot action recognition by word-vector embedding

    Xun Xu, Timothy Hospedales, and Shaogang Gong. Transductive zero-shot action recognition by word-vector embedding. International Journal of Computer Vision, 123(3):309–333, 2017

  67. [67]

    Towards universal representation for unseen action recognition

    Yi Zhu, Yang Long, Yu Guan, Shawn Newsam, and Ling Shao. Towards universal representation for unseen action recognition. In Proc. IEEE Conf. Comput. Vis. Pattern Recog. , pages 9436– 9445, 2018