pith. sign in

arxiv: 1710.03255 · v2 · pith:IENBVFIPnew · submitted 2017-10-09 · 💻 cs.CL · cs.CV

Multitask training with unlabeled data for end-to-end sign language fingerspelling recognition

classification 💻 cs.CL cs.CV
keywords fingerspellingmodelrecognitiondataframe-levellabelstrainingfeature
0
0 comments X
read the original abstract

We address the problem of automatic American Sign Language fingerspelling recognition from video. Prior work has largely relied on frame-level labels, hand-crafted features, or other constraints, and has been hampered by the scarcity of data for this task. We introduce a model for fingerspelling recognition that addresses these issues. The model consists of an auto-encoder-based feature extractor and an attention-based neural encoder-decoder, which are trained jointly. The model receives a sequence of image frames and outputs the fingerspelled word, without relying on any frame-level training labels or hand-crafted features. In addition, the auto-encoder subcomponent makes it possible to leverage unlabeled data to improve the feature learning. The model achieves 11.6% and 4.4% absolute letter accuracy improvement respectively in signer-independent and signer-adapted fingerspelling recognition over previous approaches that required frame-level training labels.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.