Towards better decoding and language model integration in sequence to sequence models
read the original abstract
The recently proposed Sequence-to-Sequence (seq2seq) framework advocates replacing complex data processing pipelines, such as an entire automatic speech recognition system, with a single neural network trained in an end-to-end fashion. In this contribution, we analyse an attention-based seq2seq speech recognition system that directly transcribes recordings into characters. We observe two shortcomings: overconfidence in its predictions and a tendency to produce incomplete transcriptions when language models are used. We propose practical solutions to both problems achieving competitive speaker independent word error rates on the Wall Street Journal dataset: without separate language models we reach 10.6% WER, while together with a trigram language model, we reach 6.7% WER.
This paper has not been read by Pith yet.
Forward citations
Cited by 4 Pith papers
-
Response Time Enhances Alignment with Heterogeneous Preferences
Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
-
Non-Intrusive Automatic Speech Recognition Refinement: A Survey
A survey that classifies non-intrusive ASR refinement methods into five categories, reviews domain adaptation and evaluation datasets, proposes standardized metrics, and identifies future research directions.
-
Listen, Attend, Spell and Adapt: Speaker Adapted Sequence-to-Sequence ASR
KLD-based speaker adaptation of seq2seq ASR achieves 25% relative WER reduction, outperforming the 18.7% gain from conventional acoustic model adaptation.
-
Integration of TensorFlow based Acoustic Model with Kaldi WFST Decoder
TensorFlow acoustic models are integrated with Kaldi WFST decoder and achieve equivalent word error rates to native Kaldi models on RM, WSJ, and LibriSpeech.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.