End-to-End Speech Recognition with High-Frame-Rate Features Extraction

· 2019 · eess.AS · arXiv 1907.01957

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

State-of-the-art end-to-end automatic speech recognition (ASR) extracts acoustic features from input speech signal every 10 ms which corresponds to a frame rate of 100 frames/second. In this report, we investigate the use of high-frame-rate features extraction in end-to-end ASR. High frame rates of 200 and 400 frames/second are used in the features extraction and provide additional information for end-to-end ASR. The effectiveness of high-frame-rate features extraction is evaluated independently and in combination with speed perturbation based data augmentation. Experiments performed on two speech corpora, Wall Street Journal (WSJ) and CHiME-5, show that using high-frame-rate features extraction yields improved performance for end-to-end ASR, both independently and in combination with speed perturbation. On WSJ corpus, the relative reduction of word error rate (WER) yielded by high-frame-rate features extraction independently and in combination with speed perturbation are up to 21.3% and 24.1%, respectively. On CHiME-5 corpus, the corresponding relative WER reductions are up to 2.8% and 7.9%, respectively, on the test data recorded by microphone arrays and up to 11.8% and 21.2%, respectively, on the test data recorded by binaural microphones.

representative citing papers

End-to-End Speech Recognition with High-Frame-Rate Features Extraction

eess.AS · 2019-07-03 · unverdicted · novelty 4.0

High-frame-rate feature extraction at 200-400 fps improves end-to-end ASR word error rates on WSJ and CHiME-5, with relative reductions up to 24.1% when combined with speed perturbation.

citing papers explorer

Showing 1 of 1 citing paper.

End-to-End Speech Recognition with High-Frame-Rate Features Extraction eess.AS · 2019-07-03 · unverdicted · none · ref 2 · internal anchor
High-frame-rate feature extraction at 200-400 fps improves end-to-end ASR word error rates on WSJ and CHiME-5, with relative reductions up to 24.1% when combined with speed perturbation.

End-to-End Speech Recognition with High-Frame-Rate Features Extraction

fields

years

verdicts

representative citing papers

citing papers explorer