pith. sign in

arxiv: 1904.02147 · v1 · pith:TVC5IAIRnew · submitted 2019-03-31 · 📡 eess.AS · cs.LG· cs.SD

Learning Shared Encoding Representation for End-to-End Speech Recognition Models

classification 📡 eess.AS cs.LGcs.SD
keywords encodingmodelsmulti-taskrepresentationtrainingattention-basedencoderend-to-end
0
0 comments X
read the original abstract

In this work, we learn a shared encoding representation for a multi-task neural network model optimized with connectionist temporal classification (CTC) and conventional framewise cross-entropy training criteria. Our experiments show that the multi-task training not only tackles the complexity of optimizing CTC models such as acoustic-to-word but also results in significant improvement compared to the plain-task training with an optimal setup. Furthermore, we propose to use the encoding representation learned by the multi-task network to initialize the encoder of attention-based models. Thereby, we train a deep attention-based end-to-end model with 10 long short-term memory (LSTM) layers of encoder which produces 12.2\% and 22.6\% word-error-rate on Switchboard and CallHome subsets of the Hub5 2000 evaluation.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.