Direct Acoustics-to-Word Models for English Conversational Speech Recognition

· 2017 · cs.CL · arXiv 1703.07754

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Recent work on end-to-end automatic speech recognition (ASR) has shown that the connectionist temporal classification (CTC) loss can be used to convert acoustics to phone or character sequences. Such systems are used with a dictionary and separately-trained Language Model (LM) to produce word sequences. However, they are not truly end-to-end in the sense of mapping acoustics directly to words without an intermediate phone representation. In this paper, we present the first results employing direct acoustics-to-word CTC models on two well-known public benchmark tasks: Switchboard and CallHome. These models do not require an LM or even a decoder at run-time and hence recognize speech with minimal complexity. However, due to the large number of word output units, CTC word models require orders of magnitude more data to train reliably compared to traditional systems. We present some techniques to mitigate this issue. Our CTC word model achieves a word error rate of 13.0%/18.8% on the Hub5-2000 Switchboard/CallHome test sets without any LM or decoder compared with 9.6%/16.0% for phone-based CTC with a 4-gram LM. We also present rescoring results on CTC word model lattices to quantify the performance benefits of a LM, and contrast the performance of word and phone CTC models.

representative citing papers

Gated Embeddings in End-to-End Speech Recognition for Conversational-Context Fusion

cs.CL · 2019-06-27 · unverdicted · novelty 6.0

Gated fusion of fastText and BERT embeddings into an end-to-end ASR model captures multi-sentence conversational context and lowers word error rate on the Switchboard corpus.

LipReading with 3D-2D-CNN BLSTM-HMM and word-CTC models

cs.CV · 2019-06-25 · unverdicted · novelty 4.0

3D-2D-CNN-BLSTM with word-CTC reaches 1.3% WER on GRID seen-speaker lipreading (55% relative gain over LCANet) and 8.6% on unseen speakers (24.5% gain over LipNet).

citing papers explorer

Showing 2 of 2 citing papers.

Gated Embeddings in End-to-End Speech Recognition for Conversational-Context Fusion cs.CL · 2019-06-27 · unverdicted · none · ref 4 · internal anchor
Gated fusion of fastText and BERT embeddings into an end-to-end ASR model captures multi-sentence conversational context and lowers word error rate on the Switchboard corpus.
LipReading with 3D-2D-CNN BLSTM-HMM and word-CTC models cs.CV · 2019-06-25 · unverdicted · none · ref 18 · internal anchor
3D-2D-CNN-BLSTM with word-CTC reaches 1.3% WER on GRID seen-speaker lipreading (55% relative gain over LCANet) and 8.6% on unseen speakers (24.5% gain over LipNet).

Direct Acoustics-to-Word Models for English Conversational Speech Recognition

fields

years

verdicts

representative citing papers

citing papers explorer