Zero-Shot Sign Language Recognition: Can Textual Data Uncover Sign Languages?

Nazli Ikizler-Cinbis; Ramazan Gokberk Cinbis; Yunus Can Bilge

arxiv: 1907.10292 · v1 · pith:DBD2OPPRnew · submitted 2019-07-24 · 💻 cs.CV

Zero-Shot Sign Language Recognition: Can Textual Data Uncover Sign Languages?

Yunus Can Bilge , Nazli Ikizler-Cinbis , Ramazan Gokberk Cinbis This is my paper

Pith reviewed 2026-05-24 17:06 UTC · model grok-4.3

classification 💻 cs.CV

keywords zero-shot sign language recognitiontextual embeddingsASL-Text dataset3D-CNNbidirectional LSTMknowledge transfersign language dictionaries

0 comments

The pith

Textual dictionary descriptions enable zero-shot recognition of unseen sign language signs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces zero-shot sign language recognition, a setting where models trained on some signs must identify new ones with no video examples available. It proposes using textual descriptions from sign language dictionaries as an intermediate semantic representation to transfer knowledge across classes. A new benchmark dataset called ASL-Text supplies 250 sign classes along with their dictionary descriptions to test the idea under conditions of limited training examples per class. Visual features are extracted from body and hand regions with 3D-CNNs and modeled over time with bidirectional LSTMs, then aligned with text embeddings inside a zero-shot framework. The central result is that this combination demonstrates textual data can support recognition of signs never seen in training videos.

Core claim

By leveraging the descriptive text embeddings along with these spatio-temporal representations within a zero-shot learning framework, we show that textual data can indeed be useful in uncovering sign languages.

What carries the argument

Zero-shot learning framework that aligns spatio-temporal visual features from 3D-CNNs and bidirectional LSTMs with embeddings of dictionary textual descriptions.

If this is right

Sign language recognition can extend to new classes using existing dictionary texts instead of new video labeling.
The framework handles datasets where many classes have few training examples.
Semantic alignment between text and visual features supports transfer to unseen sign classes.
The ASL-Text dataset and approach establish a starting point for zero-shot sign language work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Existing dictionary resources could lower the cost of expanding sign language vocabularies in recognition systems.
Text-based bridging may generalize to other gesture or action domains that carry descriptive metadata.
The method suggests dictionary-style text captures enough shared structure for visual transfer in sequential gesture tasks.

Load-bearing premise

Textual descriptions from sign language dictionaries provide a sufficiently aligned semantic representation to enable effective knowledge transfer from seen to unseen visual sign classes.

What would settle it

If accuracy on unseen signs drops to chance level when text embeddings are included compared to a visual-only baseline, the claim that textual data aids recognition would not hold.

Figures

Figures reproduced from arXiv: 1907.10292 by Nazli Ikizler-Cinbis, Ramazan Gokberk Cinbis, Yunus Can Bilge.

**Figure 2.** Figure 2: Example sequences and corresponding textual descriptions from the ASL-Text [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: t-SNE visualization of sign descriptions using BERT-[ [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Example predictions of our proposed model. The first four rows show examples [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

read the original abstract

We introduce the problem of zero-shot sign language recognition (ZSSLR), where the goal is to leverage models learned over the seen sign class examples to recognize the instances of unseen signs. To this end, we propose to utilize the readily available descriptions in sign language dictionaries as an intermediate-level semantic representation for knowledge transfer. We introduce a new benchmark dataset called ASL-Text that consists of 250 sign language classes and their accompanying textual descriptions. Compared to the ZSL datasets in other domains (such as object recognition), our dataset consists of limited number of training examples for a large number of classes, which imposes a significant challenge. We propose a framework that operates over the body and hand regions by means of 3D-CNNs, and models longer temporal relationships via bidirectional LSTMs. By leveraging the descriptive text embeddings along with these spatio-temporal representations within a zero-shot learning framework, we show that textual data can indeed be useful in uncovering sign languages. We anticipate that the introduced approach and the accompanying dataset will provide a basis for further exploration of this new zero-shot learning problem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New ZSSLR task and ASL-Text dataset are the main contributions, but the abstract gives no numbers so the claim that text helps remains unverified.

read the letter

The paper introduces zero-shot sign language recognition as a distinct problem and releases ASL-Text, a benchmark with 250 classes plus dictionary descriptions. That framing and the dataset look like the real additions relative to prior ZSL work on objects or actions. The visual side uses 3D-CNNs on body and hand regions plus biLSTMs, which fits the spatio-temporal nature of signs, and the idea of pulling semantic embeddings from readily available dictionary text is straightforward to implement. The authors also flag the practical difficulty of having few examples per class, which matches real sign-language data constraints. Those pieces are useful for anyone working on accessibility or extending ZSL to new domains. The central claim that textual data uncovers sign languages rests on an untested assumption in the provided text: that dictionary descriptions align closely enough with visual sign semantics to support reliable transfer. Dictionary entries are often high-level and miss fine handshape or movement distinctions, while the visual features are low-level and detailed. Without any reported accuracy numbers, baselines, ablation on the text component, or error analysis, it is impossible to tell whether the method actually works or whether any gains come from spurious correlations. The dataset size is modest by design, which makes this verification step even more necessary. This paper is mainly for researchers already active in zero-shot learning or sign-language recognition who need a new benchmark to test ideas on. It deserves a serious referee because the task formulation and data release are concrete and address a real gap, even if the experiments need to be shown and scrutinized before the conclusions can be accepted.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces the zero-shot sign language recognition (ZSSLR) problem, where models learned on seen sign classes are used to recognize unseen signs. It proposes leveraging textual descriptions from sign language dictionaries as intermediate semantic representations for knowledge transfer, introduces the ASL-Text benchmark dataset with 250 classes and accompanying text, and describes a visual pipeline using 3D-CNNs on body/hand regions plus bidirectional LSTMs for temporal modeling. The central claim is that combining these spatio-temporal visual features with text embeddings in a ZSL framework demonstrates the utility of textual data for uncovering sign languages.

Significance. If the experimental results support the claim, the work would be significant for defining a new ZSL task in sign language with the realistic constraint of limited examples per class, for releasing the ASL-Text dataset as a community benchmark, and for exploring dictionary text as a semantic bridge in a domain where visual data collection for new classes is costly. The choice of 3D-CNN + biLSTM for spatio-temporal features is a standard and appropriate architectural decision for the visual side.

major comments (2)

[Abstract] Abstract: The abstract asserts that the proposed framework shows textual data is useful ('we show that textual data can indeed be useful in uncovering sign languages'), but supplies no metrics, baselines, implementation details, or error analysis; the central claim cannot be verified from the available text.
[Abstract] Abstract / Proposed framework: The assumption that textual descriptions from sign language dictionaries provide a sufficiently aligned semantic representation for effective ZSL transfer is load-bearing, yet the abstract gives no indication of how the text embeddings are aligned to the 3D-CNN + biLSTM features or whether misalignment due to high-level dictionary text (vs. fine-grained visual details) was diagnosed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their comments on our manuscript. We address each major comment below with references to the full paper where relevant.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract asserts that the proposed framework shows textual data is useful ('we show that textual data can indeed be useful in uncovering sign languages'), but supplies no metrics, baselines, implementation details, or error analysis; the central claim cannot be verified from the available text.

Authors: We agree that the abstract, constrained by length, states the outcome at a high level without quantitative support. The full manuscript reports the experimental results, including accuracy metrics on seen and unseen classes, comparisons against baselines that omit textual embeddings, and analysis of the limited-examples-per-class setting. To make the central claim more verifiable from the abstract itself, we will revise it to include a representative performance figure. revision: yes
Referee: [Abstract] Abstract / Proposed framework: The assumption that textual descriptions from sign language dictionaries provide a sufficiently aligned semantic representation for effective ZSL transfer is load-bearing, yet the abstract gives no indication of how the text embeddings are aligned to the 3D-CNN + biLSTM features or whether misalignment due to high-level dictionary text (vs. fine-grained visual details) was diagnosed.

Authors: The alignment occurs by mapping the spatio-temporal visual features (3D-CNN on body/hand regions followed by bi-LSTM) into the text embedding space via a learned compatibility function inside the zero-shot framework, as described in the methods. The ASL-Text experiments demonstrate positive transfer to unseen signs, which serves as empirical evidence that dictionary text provides usable semantic bridging despite its higher-level character. The paper explicitly notes the challenge of fine-grained visual details versus dictionary text in the dataset and results sections. The abstract does not detail this mechanism due to space limits; we can add a brief clause on semantic embedding alignment if the editor permits. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical ZSL framework with external dataset evaluation

full rationale

The paper defines a new ZSSLR task, releases the ASL-Text dataset with 250 classes and dictionary text, extracts visual features via 3D-CNN + biLSTM, and applies standard ZSL with text embeddings. No equations, fitted parameters, or self-citations are shown that reduce any claimed prediction to an input by construction. The central result is an empirical demonstration on held-out unseen classes, which is externally falsifiable and does not rely on self-definitional mappings, uniqueness theorems from the same authors, or renaming of prior results. This is a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that dictionary text forms a usable semantic bridge; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Textual descriptions from sign language dictionaries provide an effective intermediate semantic representation for zero-shot visual transfer
This premise is required for the knowledge transfer mechanism to function.

pith-pipeline@v0.9.0 · 5731 in / 1019 out tokens · 23746 ms · 2026-05-24T17:06:35.081788+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 2 internal anchors

[1]

Label-embedding for attribute-based classiﬁcation

Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. Label-embedding for attribute-based classiﬁcation. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 819–826, 2013

work page 2013
[2]

Learning sign language by watching tv (using weakly aligned subtitles)

Patrick Buehler, Andrew Zisserman, and Mark Everingham. Learning sign language by watching tv (using weakly aligned subtitles). In Proc. IEEE Conf. Comput. Vis. Pattern Recog. , pages 2961–2968. IEEE, 2009

work page 2009
[3]

Subunets: End-to- end hand shape and continuous sign language recognition

Necati Cihan Camgoz, Simon Hadﬁeld, Oscar Koller, and Richard Bowden. Subunets: End-to- end hand shape and continuous sign language recognition. In Proc. IEEE Int. Conf. on Computer Vision, pages 3075–3084. IEEE, 2017. BILGE ET AL.: ZERO-SHOT SIGN LANGUAGE RECOGNITION 11

work page 2017
[4]

OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields

Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. OpenPose: realtime multi-person 2D pose estimation using Part Afﬁnity Fields. InarXiv preprint arXiv:1812.08008, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[5]

Quo vadis, action recognition? a new model and the kinetics dataset

Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 6299–6308, 2017

work page 2017
[6]

Learning phrase representations using rnn encoder-decoder for statistical machine translation

Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. EMNLP, 2014

work page 2014
[7]

Neu- ral sign language translation

Necati Cihan Camgoz, Simon Hadﬁeld, Oscar Koller, Hermann Ney, and Richard Bowden. Neu- ral sign language translation. InProc. IEEE Conf. Comput. Vis. Pattern Recog., pages 7784–7793, 2018

work page 2018
[8]

Random House Webster’s Concise American Sign Language Dictionary

Elaine Costello. Random House Webster’s Concise American Sign Language Dictionary . Ran- dom House, 1999

work page 1999
[9]

Recurrent convolutional neural networks for con- tinuous sign language recognition by staged optimization

Runpeng Cui, Hu Liu, and Changshui Zhang. Recurrent convolutional neural networks for con- tinuous sign language recognition by staged optimization. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 7361–7369, 2017

work page 2017
[10]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[11]

Write a classiﬁer: Zero-shot learning using purely textual descriptions

Mohamed Elhoseiny, Babak Saleh, and Ahmed Elgammal. Write a classiﬁer: Zero-shot learning using purely textual descriptions. InProc. IEEE Int. Conf. on Computer Vision, pages 2584–2591, 2013

work page 2013
[12]

Aligning asl for statistical translation using a discriminative word model

Ali Farhadi and David Forsyth. Aligning asl for statistical translation using a discriminative word model. In Proc. IEEE Conf. Comput. Vis. Pattern Recog. , volume 2, pages 1471–1476. IEEE, 2006

work page 2006
[13]

Transfer learning in sign language

Ali Farhadi, David Forsyth, and Ryan White. Transfer learning in sign language. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 1–8. IEEE, 2007

work page 2007
[14]

Describing objects by their attributes

Ali Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. Describing objects by their attributes. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 1778–1785. IEEE, 2009

work page 2009
[15]

Learning visual attributes

Vittorio Ferrari and Andrew Zisserman. Learning visual attributes. In Proc. Adv. Neural Inf. Process. Syst., pages 433–440, 2008

work page 2008
[16]

Devise: A deep visual-semantic embedding model

Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al. Devise: A deep visual-semantic embedding model. In Proc. Adv. Neural Inf. Process. Syst. , pages 2121–2129, 2013

work page 2013
[17]

Transductive multi-view embedding for zero-shot recognition and annotation

Yanwei Fu, Timothy M Hospedales, Tao Xiang, Zhenyong Fu, and Shaogang Gong. Transductive multi-view embedding for zero-shot recognition and annotation. In Proc. European Conf. on Computer Vision, pages 584–599. Springer, 2014

work page 2014
[18]

Learning multimodal latent attributes

Yanwei Fu, Timothy M Hospedales, Tao Xiang, and Shaogang Gong. Learning multimodal latent attributes. IEEE transactions on pattern analysis and machine intelligence, 36(2):303–316, 2014

work page 2014
[19]

Framewise phoneme classiﬁcation with bidirectional lstm and other neural network architectures

Alex Graves and Jürgen Schmidhuber. Framewise phoneme classiﬁcation with bidirectional lstm and other neural network architectures. Neural Networks, 18(5-6):602–610, 2005. 12 BILGE ET AL.: ZERO-SHOT SIGN LANGUAGE RECOGNITION

work page 2005
[20]

Isolated sign language recognition using hidden markov mod- els

Kirsti Grobel and Marcell Assan. Isolated sign language recognition using hidden markov mod- els. In IEEE International Conference on Systems, Man, and Cybernetics. Computational Cyber- netics and Simulation, volume 1, pages 162–167. IEEE, 1997

work page 1997
[21]

Video2vec embeddings recog- nize events when examples are scarce

Amirhossein Habibian, Thomas Mensink, and Cees GM Snoek. Video2vec embeddings recog- nize events when examples are scarce. IEEE Trans. Pattern Anal. Mach. Intell. , 39(10):2089– 2103, 2017

work page 2089
[22]

Long short-term memory

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8): 1735–1780, 1997

work page 1997
[23]

Sign language recognition using model-based tracking and a 3d hopﬁeld neural network

Chung-Lin Huang and Wen-Yi Huang. Sign language recognition using model-based tracking and a 3d hopﬁeld neural network. Machine vision and applications, 10(5-6):292–307, 1998

work page 1998
[24]

Sign language recognition using 3d convolutional neural networks

Jie Huang, Wengang Zhou, Houqiang Li, and Weiping Li. Sign language recognition using 3d convolutional neural networks. In IEEE Int. Conf. on Multimedia and Expo (ICME), pages 1–6. IEEE, 2015

work page 2015
[25]

Objects2action: Classi- fying and localizing actions without any video example

Mihir Jain, Jan C van Gemert, Thomas Mensink, and Cees GM Snoek. Objects2action: Classi- fying and localizing actions without any video example. In Proc. IEEE Int. Conf. on Computer Vision, pages 4588–4596, 2015

work page 2015
[26]

Weakly supervised training of a sign lan- guage recognition system using multiple instance learning density matrices

Daniel Kelly, John Mc Donald, and Charles Markham. Weakly supervised training of a sign lan- guage recognition system using multiple instance learning density matrices. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 41(2):526–541, 2011

work page 2011
[27]

Semantic autoencoder for zero-shot learning

Elyor Kodirov, Tao Xiang, and Shaogang Gong. Semantic autoencoder for zero-shot learning. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 3174–3183, 2017

work page 2017
[28]

Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers

Oscar Koller, Jens Forster, and Hermann Ney. Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers. Comput. Vis. Image Understand., 141:108–125, 2015

work page 2015
[29]

Deep sign: hybrid cnn-hmm for continuous sign language recognition

Oscar Koller, O Zargaran, Hermann Ney, and Richard Bowden. Deep sign: hybrid cnn-hmm for continuous sign language recognition. In British Machine Vision Conference, 2016

work page 2016
[30]

Learning to detect unseen object classes by between-class attribute transfer

Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In Proc. IEEE Conf. Comput. Vis. Pattern Recog. , pages 951–958. IEEE, 2009

work page 2009
[31]

Attribute-based classiﬁcation for zero-shot visual object categorization

Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. Attribute-based classiﬁcation for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(3):453–465, 2014

work page 2014
[32]

Predicting deep zero-shot convolutional neural networks using textual descriptions

Jimmy Lei Ba, Kevin Swersky, Sanja Fidler, et al. Predicting deep zero-shot convolutional neural networks using textual descriptions. In Proc. IEEE Int. Conf. on Computer Vision, pages 4247– 4255, 2015

work page 2015
[33]

Recognizing human actions by attributes

Jingen Liu, Benjamin Kuipers, and Silvio Savarese. Recognizing human actions by attributes. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 3337–3344. IEEE, 2011

work page 2011
[34]

Hard zero shot learning for gesture recognition

Naveen Madapana and Juan P Wachs. Hard zero shot learning for gesture recognition. In IAPR International Conference on Pattern Recognition, pages 3574–3579. IEEE, 2018

work page 2018
[35]

Costa: Co-occurrence statistics for zero-shot classiﬁcation

Thomas Mensink, Efstratios Gavves, and Cees GM Snoek. Costa: Co-occurrence statistics for zero-shot classiﬁcation. In Proc. IEEE Conf. Comput. Vis. Pattern Recog. , pages 2441–2448, 2014. BILGE ET AL.: ZERO-SHOT SIGN LANGUAGE RECOGNITION 13

work page 2014
[36]

Distributed represen- tations of words and phrases and their compositionality

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed represen- tations of words and phrases and their compositionality. In Proc. Adv. Neural Inf. Process. Syst., pages 3111–3119, 2013

work page 2013
[37]

Online detection and classiﬁcation of dynamic hand gestures with recurrent 3d convolutional neural network

Pavlo Molchanov, Xiaodong Yang, Shalini Gupta, Kihwan Kim, Stephen Tyree, and Jan Kautz. Online detection and classiﬁcation of dynamic hand gestures with recurrent 3d convolutional neural network. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 4207–4215, 2016

work page 2016
[38]

Gesture recognition: Focus on the hands

Pradyumna Narayana, Ross Beveridge, and Bruce A Draper. Gesture recognition: Focus on the hands. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 5235–5244, 2018

work page 2018
[39]

Automated extraction of signs from con- tinuous sign language sentences using iterated conditional modes

Sunita Nayak, Sudeep Sarkar, and Barbara Loeding. Automated extraction of signs from con- tinuous sign language sentences using iterated conditional modes. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 2583–2590. IEEE, 2009

work page 2009
[40]

Challenges in development of the american sign language lexicon video dataset (asllvd) corpus

Carol Neidle, Ashwin Thangali, and Stan Sclaroff. Challenges in development of the american sign language lexicon video dataset (asllvd) corpus. In Proc. 5th Workshop on the Representa- tion and Processing of Sign Languages: Interactions between Corpus and Lexicon, Language Resources and Evaluation Conference (LREC) 2012, 2012

work page 2012
[41]

Zero-shot learning by convex combination of semantic embeddings

Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram Singer, Jonathon Shlens, Andrea Frome, Greg S Corrado, and Jeffrey Dean. Zero-shot learning by convex combination of semantic embeddings. Proc. Int. Conf. Learn. Represent., 2014

work page 2014
[42]

Relative attributes

Devi Parikh and Kristen Grauman. Relative attributes. In Proc. IEEE Int. Conf. on Computer Vision, pages 503–510. IEEE, 2011

work page 2011
[43]

Sun attribute database: Discovering, annotating, and rec- ognizing scene attributes

Genevieve Patterson and James Hays. Sun attribute database: Discovering, annotating, and rec- ognizing scene attributes. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 2751–2758. IEEE, 2012

work page 2012
[44]

Glove: Global vectors for word representation

Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proc. of conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014

work page 2014
[45]

Large-scale learning of sign language by watching tv (using co-occurrences)

Tomas Pﬁster, James Charles, and Andrew Zisserman. Large-scale learning of sign language by watching tv (using co-occurrences). In British Machine Vision Conference, 2013

work page 2013
[46]

Domain-adaptive discriminative one-shot learning of gestures

Tomas Pﬁster, James Charles, and Andrew Zisserman. Domain-adaptive discriminative one-shot learning of gestures. In Proc. European Conf. on Computer Vision , pages 814–829. Springer, 2014

work page 2014
[47]

Sign classiﬁcation in sign language corpora with deep neural networks

Lionel Pigou, Mieke Van Herreweghe, and Joni Dambre. Sign classiﬁcation in sign language corpora with deep neural networks. In International Conference on Language Resources and Evaluation (LREC) Workshop, pages 175–178, 2016

work page 2016
[48]

Zero- shot action recognition with error-correcting output codes

Jie Qin, Li Liu, Ling Shao, Fumin Shen, Bingbing Ni, Jiaxin Chen, and Yunhong Wang. Zero- shot action recognition with error-correcting output codes. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 2833–2842, 2017

work page 2017
[49]

Evaluating knowledge transfer and zero- shot learning in a large-scale setting

Marcus Rohrbach, Michael Stark, and Bernt Schiele. Evaluating knowledge transfer and zero- shot learning in a large-scale setting. In Proc. IEEE Conf. Comput. Vis. Pattern Recog. , pages 1641–1648. IEEE, 2011

work page 2011
[50]

An embarrassingly simple approach to zero-shot learning

Bernardino Romera-Paredes and Philip Torr. An embarrassingly simple approach to zero-shot learning. In Proc. Int. Conf. Mach. Learn., pages 2152–2161, 2015. 14 BILGE ET AL.: ZERO-SHOT SIGN LANGUAGE RECOGNITION

work page 2015
[51]

Zero-shot learning through cross-modal transfer

Richard Socher, Milind Ganjoo, Christopher D Manning, and Andrew Ng. Zero-shot learning through cross-modal transfer. In Proc. Adv. Neural Inf. Process. Syst., pages 935–943, 2013

work page 2013
[52]

Sign language structure: An outline of the visual communication systems of the american deaf

William C Stokoe Jr. Sign language structure: An outline of the visual communication systems of the american deaf. Journal of deaf studies and deaf education, 10(1):3–37, 2005

work page 2005
[53]

Sign language production using neural machine translation and generative adversarial networks

Stephanie Stoll, Necati Cihan Camgoz, Simon Hadﬁeld, and Richard Bowden. Sign language production using neural machine translation and generative adversarial networks. In British Ma- chine Vision Conference. British Machine Vision Association, 2018

work page 2018
[54]

Fine-grained object recognition and zero-shot learning in remote sensing imagery.IEEE Transactions on Geoscience and Remote Sensing, 56(2):770–779, 2018

Gencer Sumbul, Ramazan Gokberk Cinbis, and Selim Aksoy. Fine-grained object recognition and zero-shot learning in remote sensing imagery.IEEE Transactions on Geoscience and Remote Sensing, 56(2):770–779, 2018

work page 2018
[55]

Going deeper with convolutions

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 1–9, 2015

work page 2015
[56]

Recognition of sign language motion images

Shinichi Tamura and Shingo Kawasaki. Recognition of sign language motion images. Pattern recognition, 21(4):343–353, 1988

work page 1988
[57]

Recognizing unfamiliar gestures for human-robot interac- tion through zero-shot learning

Wil Thomason and Ross A Knepper. Recognizing unfamiliar gestures for human-robot interac- tion through zero-shot learning. In International Symposium on Experimental Robotics , pages 841–852. Springer, 2016

work page 2016
[58]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proc. Adv. Neural Inf. Process. Syst., pages 5998–6008, 2017

work page 2017
[59]

The caltech- ucsd birds-200-2011 dataset

Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech- ucsd birds-200-2011 dataset. 2011

work page 2011
[60]

Isolated sign language recognition with grassmann covariance matrices

Hanjie Wang, Xiujuan Chai, Xiaopeng Hong, Guoying Zhao, and Xilin Chen. Isolated sign language recognition with grassmann covariance matrices. ACM Transactions on Accessible Computing (TACCESS), 8(4):14, 2016

work page 2016
[61]

Alternative semantic representations for zero-shot human action recognition

Qian Wang and Ke Chen. Alternative semantic representations for zero-shot human action recognition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 87–102. Springer, 2017

work page 2017
[62]

Large scale image annotation: learning to rank with joint word-image embeddings

Jason Weston, Samy Bengio, and Nicolas Usunier. Large scale image annotation: learning to rank with joint word-image embeddings. Machine learning, 81(1):21–35, 2010

work page 2010
[63]

Vision-based gesture recognition: A review

Ying Wu and Thomas S Huang. Vision-based gesture recognition: A review. In International Gesture Workshop, pages 103–115. Springer, 1999

work page 1999
[64]

Zero-shot learning-the good, the bad and the ugly

Yongqin Xian, Bernt Schiele, and Zeynep Akata. Zero-shot learning-the good, the bad and the ugly. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 4582–4591, 2017

work page 2017
[65]

Hospedales, and Shaogang Gong

Xun Xu, Timothy M. Hospedales, and Shaogang Gong. Semantic embedding space for zero- shot action recognition. 2015 IEEE International Conference on Image Processing (ICIP), pages 63–67, 2015

work page 2015
[66]

Transductive zero-shot action recognition by word-vector embedding

Xun Xu, Timothy Hospedales, and Shaogang Gong. Transductive zero-shot action recognition by word-vector embedding. International Journal of Computer Vision, 123(3):309–333, 2017

work page 2017
[67]

Towards universal representation for unseen action recognition

Yi Zhu, Yang Long, Yu Guan, Shawn Newsam, and Ling Shao. Towards universal representation for unseen action recognition. In Proc. IEEE Conf. Comput. Vis. Pattern Recog. , pages 9436– 9445, 2018

work page 2018

[1] [1]

Label-embedding for attribute-based classiﬁcation

Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. Label-embedding for attribute-based classiﬁcation. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 819–826, 2013

work page 2013

[2] [2]

Learning sign language by watching tv (using weakly aligned subtitles)

Patrick Buehler, Andrew Zisserman, and Mark Everingham. Learning sign language by watching tv (using weakly aligned subtitles). In Proc. IEEE Conf. Comput. Vis. Pattern Recog. , pages 2961–2968. IEEE, 2009

work page 2009

[3] [3]

Subunets: End-to- end hand shape and continuous sign language recognition

Necati Cihan Camgoz, Simon Hadﬁeld, Oscar Koller, and Richard Bowden. Subunets: End-to- end hand shape and continuous sign language recognition. In Proc. IEEE Int. Conf. on Computer Vision, pages 3075–3084. IEEE, 2017. BILGE ET AL.: ZERO-SHOT SIGN LANGUAGE RECOGNITION 11

work page 2017

[4] [4]

OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields

Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. OpenPose: realtime multi-person 2D pose estimation using Part Afﬁnity Fields. InarXiv preprint arXiv:1812.08008, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[5] [5]

Quo vadis, action recognition? a new model and the kinetics dataset

Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 6299–6308, 2017

work page 2017

[6] [6]

Learning phrase representations using rnn encoder-decoder for statistical machine translation

Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. EMNLP, 2014

work page 2014

[7] [7]

Neu- ral sign language translation

Necati Cihan Camgoz, Simon Hadﬁeld, Oscar Koller, Hermann Ney, and Richard Bowden. Neu- ral sign language translation. InProc. IEEE Conf. Comput. Vis. Pattern Recog., pages 7784–7793, 2018

work page 2018

[8] [8]

Random House Webster’s Concise American Sign Language Dictionary

Elaine Costello. Random House Webster’s Concise American Sign Language Dictionary . Ran- dom House, 1999

work page 1999

[9] [9]

Recurrent convolutional neural networks for con- tinuous sign language recognition by staged optimization

Runpeng Cui, Hu Liu, and Changshui Zhang. Recurrent convolutional neural networks for con- tinuous sign language recognition by staged optimization. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 7361–7369, 2017

work page 2017

[10] [10]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[11] [11]

Write a classiﬁer: Zero-shot learning using purely textual descriptions

Mohamed Elhoseiny, Babak Saleh, and Ahmed Elgammal. Write a classiﬁer: Zero-shot learning using purely textual descriptions. InProc. IEEE Int. Conf. on Computer Vision, pages 2584–2591, 2013

work page 2013

[12] [12]

Aligning asl for statistical translation using a discriminative word model

Ali Farhadi and David Forsyth. Aligning asl for statistical translation using a discriminative word model. In Proc. IEEE Conf. Comput. Vis. Pattern Recog. , volume 2, pages 1471–1476. IEEE, 2006

work page 2006

[13] [13]

Transfer learning in sign language

Ali Farhadi, David Forsyth, and Ryan White. Transfer learning in sign language. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 1–8. IEEE, 2007

work page 2007

[14] [14]

Describing objects by their attributes

Ali Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. Describing objects by their attributes. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 1778–1785. IEEE, 2009

work page 2009

[15] [15]

Learning visual attributes

Vittorio Ferrari and Andrew Zisserman. Learning visual attributes. In Proc. Adv. Neural Inf. Process. Syst., pages 433–440, 2008

work page 2008

[16] [16]

Devise: A deep visual-semantic embedding model

Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al. Devise: A deep visual-semantic embedding model. In Proc. Adv. Neural Inf. Process. Syst. , pages 2121–2129, 2013

work page 2013

[17] [17]

Transductive multi-view embedding for zero-shot recognition and annotation

Yanwei Fu, Timothy M Hospedales, Tao Xiang, Zhenyong Fu, and Shaogang Gong. Transductive multi-view embedding for zero-shot recognition and annotation. In Proc. European Conf. on Computer Vision, pages 584–599. Springer, 2014

work page 2014

[18] [18]

Learning multimodal latent attributes

Yanwei Fu, Timothy M Hospedales, Tao Xiang, and Shaogang Gong. Learning multimodal latent attributes. IEEE transactions on pattern analysis and machine intelligence, 36(2):303–316, 2014

work page 2014

[19] [19]

Framewise phoneme classiﬁcation with bidirectional lstm and other neural network architectures

Alex Graves and Jürgen Schmidhuber. Framewise phoneme classiﬁcation with bidirectional lstm and other neural network architectures. Neural Networks, 18(5-6):602–610, 2005. 12 BILGE ET AL.: ZERO-SHOT SIGN LANGUAGE RECOGNITION

work page 2005

[20] [20]

Isolated sign language recognition using hidden markov mod- els

Kirsti Grobel and Marcell Assan. Isolated sign language recognition using hidden markov mod- els. In IEEE International Conference on Systems, Man, and Cybernetics. Computational Cyber- netics and Simulation, volume 1, pages 162–167. IEEE, 1997

work page 1997

[21] [21]

Video2vec embeddings recog- nize events when examples are scarce

Amirhossein Habibian, Thomas Mensink, and Cees GM Snoek. Video2vec embeddings recog- nize events when examples are scarce. IEEE Trans. Pattern Anal. Mach. Intell. , 39(10):2089– 2103, 2017

work page 2089

[22] [22]

Long short-term memory

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8): 1735–1780, 1997

work page 1997

[23] [23]

Sign language recognition using model-based tracking and a 3d hopﬁeld neural network

Chung-Lin Huang and Wen-Yi Huang. Sign language recognition using model-based tracking and a 3d hopﬁeld neural network. Machine vision and applications, 10(5-6):292–307, 1998

work page 1998

[24] [24]

Sign language recognition using 3d convolutional neural networks

Jie Huang, Wengang Zhou, Houqiang Li, and Weiping Li. Sign language recognition using 3d convolutional neural networks. In IEEE Int. Conf. on Multimedia and Expo (ICME), pages 1–6. IEEE, 2015

work page 2015

[25] [25]

Objects2action: Classi- fying and localizing actions without any video example

Mihir Jain, Jan C van Gemert, Thomas Mensink, and Cees GM Snoek. Objects2action: Classi- fying and localizing actions without any video example. In Proc. IEEE Int. Conf. on Computer Vision, pages 4588–4596, 2015

work page 2015

[26] [26]

Weakly supervised training of a sign lan- guage recognition system using multiple instance learning density matrices

Daniel Kelly, John Mc Donald, and Charles Markham. Weakly supervised training of a sign lan- guage recognition system using multiple instance learning density matrices. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 41(2):526–541, 2011

work page 2011

[27] [27]

Semantic autoencoder for zero-shot learning

Elyor Kodirov, Tao Xiang, and Shaogang Gong. Semantic autoencoder for zero-shot learning. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 3174–3183, 2017

work page 2017

[28] [28]

Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers

Oscar Koller, Jens Forster, and Hermann Ney. Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers. Comput. Vis. Image Understand., 141:108–125, 2015

work page 2015

[29] [29]

Deep sign: hybrid cnn-hmm for continuous sign language recognition

Oscar Koller, O Zargaran, Hermann Ney, and Richard Bowden. Deep sign: hybrid cnn-hmm for continuous sign language recognition. In British Machine Vision Conference, 2016

work page 2016

[30] [30]

Learning to detect unseen object classes by between-class attribute transfer

Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In Proc. IEEE Conf. Comput. Vis. Pattern Recog. , pages 951–958. IEEE, 2009

work page 2009

[31] [31]

Attribute-based classiﬁcation for zero-shot visual object categorization

Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. Attribute-based classiﬁcation for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(3):453–465, 2014

work page 2014

[32] [32]

Predicting deep zero-shot convolutional neural networks using textual descriptions

Jimmy Lei Ba, Kevin Swersky, Sanja Fidler, et al. Predicting deep zero-shot convolutional neural networks using textual descriptions. In Proc. IEEE Int. Conf. on Computer Vision, pages 4247– 4255, 2015

work page 2015

[33] [33]

Recognizing human actions by attributes

Jingen Liu, Benjamin Kuipers, and Silvio Savarese. Recognizing human actions by attributes. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 3337–3344. IEEE, 2011

work page 2011

[34] [34]

Hard zero shot learning for gesture recognition

Naveen Madapana and Juan P Wachs. Hard zero shot learning for gesture recognition. In IAPR International Conference on Pattern Recognition, pages 3574–3579. IEEE, 2018

work page 2018

[35] [35]

Costa: Co-occurrence statistics for zero-shot classiﬁcation

Thomas Mensink, Efstratios Gavves, and Cees GM Snoek. Costa: Co-occurrence statistics for zero-shot classiﬁcation. In Proc. IEEE Conf. Comput. Vis. Pattern Recog. , pages 2441–2448, 2014. BILGE ET AL.: ZERO-SHOT SIGN LANGUAGE RECOGNITION 13

work page 2014

[36] [36]

Distributed represen- tations of words and phrases and their compositionality

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed represen- tations of words and phrases and their compositionality. In Proc. Adv. Neural Inf. Process. Syst., pages 3111–3119, 2013

work page 2013

[37] [37]

Online detection and classiﬁcation of dynamic hand gestures with recurrent 3d convolutional neural network

Pavlo Molchanov, Xiaodong Yang, Shalini Gupta, Kihwan Kim, Stephen Tyree, and Jan Kautz. Online detection and classiﬁcation of dynamic hand gestures with recurrent 3d convolutional neural network. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 4207–4215, 2016

work page 2016

[38] [38]

Gesture recognition: Focus on the hands

Pradyumna Narayana, Ross Beveridge, and Bruce A Draper. Gesture recognition: Focus on the hands. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 5235–5244, 2018

work page 2018

[39] [39]

Automated extraction of signs from con- tinuous sign language sentences using iterated conditional modes

Sunita Nayak, Sudeep Sarkar, and Barbara Loeding. Automated extraction of signs from con- tinuous sign language sentences using iterated conditional modes. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 2583–2590. IEEE, 2009

work page 2009

[40] [40]

Challenges in development of the american sign language lexicon video dataset (asllvd) corpus

Carol Neidle, Ashwin Thangali, and Stan Sclaroff. Challenges in development of the american sign language lexicon video dataset (asllvd) corpus. In Proc. 5th Workshop on the Representa- tion and Processing of Sign Languages: Interactions between Corpus and Lexicon, Language Resources and Evaluation Conference (LREC) 2012, 2012

work page 2012

[41] [41]

Zero-shot learning by convex combination of semantic embeddings

Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram Singer, Jonathon Shlens, Andrea Frome, Greg S Corrado, and Jeffrey Dean. Zero-shot learning by convex combination of semantic embeddings. Proc. Int. Conf. Learn. Represent., 2014

work page 2014

[42] [42]

Relative attributes

Devi Parikh and Kristen Grauman. Relative attributes. In Proc. IEEE Int. Conf. on Computer Vision, pages 503–510. IEEE, 2011

work page 2011

[43] [43]

Sun attribute database: Discovering, annotating, and rec- ognizing scene attributes

Genevieve Patterson and James Hays. Sun attribute database: Discovering, annotating, and rec- ognizing scene attributes. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 2751–2758. IEEE, 2012

work page 2012

[44] [44]

Glove: Global vectors for word representation

Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proc. of conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014

work page 2014

[45] [45]

Large-scale learning of sign language by watching tv (using co-occurrences)

Tomas Pﬁster, James Charles, and Andrew Zisserman. Large-scale learning of sign language by watching tv (using co-occurrences). In British Machine Vision Conference, 2013

work page 2013

[46] [46]

Domain-adaptive discriminative one-shot learning of gestures

Tomas Pﬁster, James Charles, and Andrew Zisserman. Domain-adaptive discriminative one-shot learning of gestures. In Proc. European Conf. on Computer Vision , pages 814–829. Springer, 2014

work page 2014

[47] [47]

Sign classiﬁcation in sign language corpora with deep neural networks

Lionel Pigou, Mieke Van Herreweghe, and Joni Dambre. Sign classiﬁcation in sign language corpora with deep neural networks. In International Conference on Language Resources and Evaluation (LREC) Workshop, pages 175–178, 2016

work page 2016

[48] [48]

Zero- shot action recognition with error-correcting output codes

Jie Qin, Li Liu, Ling Shao, Fumin Shen, Bingbing Ni, Jiaxin Chen, and Yunhong Wang. Zero- shot action recognition with error-correcting output codes. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 2833–2842, 2017

work page 2017

[49] [49]

Evaluating knowledge transfer and zero- shot learning in a large-scale setting

Marcus Rohrbach, Michael Stark, and Bernt Schiele. Evaluating knowledge transfer and zero- shot learning in a large-scale setting. In Proc. IEEE Conf. Comput. Vis. Pattern Recog. , pages 1641–1648. IEEE, 2011

work page 2011

[50] [50]

An embarrassingly simple approach to zero-shot learning

Bernardino Romera-Paredes and Philip Torr. An embarrassingly simple approach to zero-shot learning. In Proc. Int. Conf. Mach. Learn., pages 2152–2161, 2015. 14 BILGE ET AL.: ZERO-SHOT SIGN LANGUAGE RECOGNITION

work page 2015

[51] [51]

Zero-shot learning through cross-modal transfer

Richard Socher, Milind Ganjoo, Christopher D Manning, and Andrew Ng. Zero-shot learning through cross-modal transfer. In Proc. Adv. Neural Inf. Process. Syst., pages 935–943, 2013

work page 2013

[52] [52]

Sign language structure: An outline of the visual communication systems of the american deaf

William C Stokoe Jr. Sign language structure: An outline of the visual communication systems of the american deaf. Journal of deaf studies and deaf education, 10(1):3–37, 2005

work page 2005

[53] [53]

Sign language production using neural machine translation and generative adversarial networks

Stephanie Stoll, Necati Cihan Camgoz, Simon Hadﬁeld, and Richard Bowden. Sign language production using neural machine translation and generative adversarial networks. In British Ma- chine Vision Conference. British Machine Vision Association, 2018

work page 2018

[54] [54]

Fine-grained object recognition and zero-shot learning in remote sensing imagery.IEEE Transactions on Geoscience and Remote Sensing, 56(2):770–779, 2018

Gencer Sumbul, Ramazan Gokberk Cinbis, and Selim Aksoy. Fine-grained object recognition and zero-shot learning in remote sensing imagery.IEEE Transactions on Geoscience and Remote Sensing, 56(2):770–779, 2018

work page 2018

[55] [55]

Going deeper with convolutions

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 1–9, 2015

work page 2015

[56] [56]

Recognition of sign language motion images

Shinichi Tamura and Shingo Kawasaki. Recognition of sign language motion images. Pattern recognition, 21(4):343–353, 1988

work page 1988

[57] [57]

Recognizing unfamiliar gestures for human-robot interac- tion through zero-shot learning

Wil Thomason and Ross A Knepper. Recognizing unfamiliar gestures for human-robot interac- tion through zero-shot learning. In International Symposium on Experimental Robotics , pages 841–852. Springer, 2016

work page 2016

[58] [58]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proc. Adv. Neural Inf. Process. Syst., pages 5998–6008, 2017

work page 2017

[59] [59]

The caltech- ucsd birds-200-2011 dataset

Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech- ucsd birds-200-2011 dataset. 2011

work page 2011

[60] [60]

Isolated sign language recognition with grassmann covariance matrices

Hanjie Wang, Xiujuan Chai, Xiaopeng Hong, Guoying Zhao, and Xilin Chen. Isolated sign language recognition with grassmann covariance matrices. ACM Transactions on Accessible Computing (TACCESS), 8(4):14, 2016

work page 2016

[61] [61]

Alternative semantic representations for zero-shot human action recognition

Qian Wang and Ke Chen. Alternative semantic representations for zero-shot human action recognition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 87–102. Springer, 2017

work page 2017

[62] [62]

Large scale image annotation: learning to rank with joint word-image embeddings

Jason Weston, Samy Bengio, and Nicolas Usunier. Large scale image annotation: learning to rank with joint word-image embeddings. Machine learning, 81(1):21–35, 2010

work page 2010

[63] [63]

Vision-based gesture recognition: A review

Ying Wu and Thomas S Huang. Vision-based gesture recognition: A review. In International Gesture Workshop, pages 103–115. Springer, 1999

work page 1999

[64] [64]

Zero-shot learning-the good, the bad and the ugly

Yongqin Xian, Bernt Schiele, and Zeynep Akata. Zero-shot learning-the good, the bad and the ugly. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 4582–4591, 2017

work page 2017

[65] [65]

Hospedales, and Shaogang Gong

Xun Xu, Timothy M. Hospedales, and Shaogang Gong. Semantic embedding space for zero- shot action recognition. 2015 IEEE International Conference on Image Processing (ICIP), pages 63–67, 2015

work page 2015

[66] [66]

Transductive zero-shot action recognition by word-vector embedding

Xun Xu, Timothy Hospedales, and Shaogang Gong. Transductive zero-shot action recognition by word-vector embedding. International Journal of Computer Vision, 123(3):309–333, 2017

work page 2017

[67] [67]

Towards universal representation for unseen action recognition

Yi Zhu, Yang Long, Yu Guan, Shawn Newsam, and Ling Shao. Towards universal representation for unseen action recognition. In Proc. IEEE Conf. Comput. Vis. Pattern Recog. , pages 9436– 9445, 2018

work page 2018