pith. sign in

arxiv: 1705.09724 · v1 · pith:FFNFWROJnew · submitted 2017-05-26 · 💻 cs.CL

Semi-Supervised Model Training for Unbounded Conversational Speech Recognition

classification 💻 cs.CL
keywords speechconversationaltrainingmodelrecognitionutterancesacousticaudio
0
0 comments X
read the original abstract

For conversational large-vocabulary continuous speech recognition (LVCSR) tasks, up to about two thousand hours of audio is commonly used to train state of the art models. Collection of labeled conversational audio however, is prohibitively expensive, laborious and error-prone. Furthermore, academic corpora like Fisher English (2004) or Switchboard (1992) are inadequate to train models with sufficient accuracy in the unbounded space of conversational speech. These corpora are also timeworn due to dated acoustic telephony features and the rapid advancement of colloquial vocabulary and idiomatic speech over the last decades. Utilizing the colossal scale of our unlabeled telephony dataset, we propose a technique to construct a modern, high quality conversational speech training corpus on the order of hundreds of millions of utterances (or tens of thousands of hours) for both acoustic and language model training. We describe the data collection, selection and training, evaluating the results of our updated speech recognition system on a test corpus of 7K manually transcribed utterances. We show relative word error rate (WER) reductions of {35%, 19%} on {agent, caller} utterances over our seed model and 5% absolute WER improvements over IBM Watson STT on this conversational speech task.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Lattice-Based Unsupervised Test-Time Adaptation of Neural Network Acoustic Models

    cs.CL 2019-06 unverdicted novelty 6.0

    Lattice-based discriminative adaptation integrated into LF-MMI enables robust unsupervised test-time adaptation of neural acoustic models on mismatched speech data.