End-to-End Speech Recognition with High-Frame-Rate Features Extraction

Cong-Thanh Do

arxiv: 1907.01957 · v2 · pith:UIMVMKQXnew · submitted 2019-07-03 · 📡 eess.AS · cs.CL· cs.SD

End-to-End Speech Recognition with High-Frame-Rate Features Extraction

Cong-Thanh Do This is my paper

Pith reviewed 2026-05-25 09:28 UTC · model grok-4.3

classification 📡 eess.AS cs.CLcs.SD

keywords end-to-end ASRhigh frame rate featuresacoustic featuresword error rate reductionspeech recognitionspeed perturbationWSJCHiME-5

0 comments

The pith

High frame rates of 200 and 400 per second in feature extraction improve end-to-end speech recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper investigates the impact of using higher than standard frame rates when extracting acoustic features for end-to-end automatic speech recognition systems. Conventional systems use a frame rate of 100 frames per second, but the authors test rates of 200 and 400 frames per second to supply additional information to the model. The approach is tested both by itself and together with speed perturbation data augmentation on the WSJ and CHiME-5 speech corpora. The results indicate relative reductions in word error rate of up to 21.3 percent on WSJ and up to 21.2 percent on CHiME-5 binaural microphone data. A sympathetic reader would care if this simple adjustment to the input representation can consistently lower error rates across different datasets.

Core claim

High frame rates of 200 and 400 frames per second in the features extraction provide additional information for end-to-end ASR and yield improved performance both independently and in combination with speed perturbation, with relative WER reductions of up to 21.3% and 24.1% respectively on the WSJ corpus and up to 11.8% and 21.2% respectively on the CHiME-5 binaural test data.

What carries the argument

High-frame-rate features extraction at rates of 200 and 400 frames per second, which supplies additional temporal information to the end-to-end model.

If this is right

High-frame-rate features extraction improves end-to-end ASR performance independently.
Combining high-frame-rate features with speed perturbation yields further word error rate reductions.
The improvements hold on both the WSJ corpus and the CHiME-5 corpus.
Relative WER reductions reach up to 24.1% on WSJ when both techniques are used.
Larger gains appear on binaural microphone data than on microphone array data in CHiME-5.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The extra frames may allow the model to capture finer acoustic details without needing changes to the neural network architecture.
This preprocessing change could be applied to other end-to-end ASR systems to achieve similar gains.
Further work might explore even higher frame rates or adaptive frame rates based on the input signal.
Such an approach might help in low-resource settings by maximizing information from limited data.

Load-bearing premise

The frames extracted at higher rates supply useful non-redundant information that the end-to-end model can use effectively without adding too much noise or requiring major model adjustments.

What would settle it

Training and evaluating an end-to-end ASR model with features at 100 frames per second versus 200 or 400 frames per second on the WSJ test set and finding no reduction or an increase in word error rate would falsify the claim.

Figures

Figures reproduced from arXiv: 1907.01957 by Cong-Thanh Do.

**Figure 1.** Figure 1: Unnormalized feature matrices extracted from a speech utterance (a) in the WSJ corpus at frame rates of 100 (b), 200 (c), and 400 (d) frames/second, respectively. Utterance-level mean normalization is applied on the features prior to training and test. 3. High-frame-rate features extraction State-of-the-art end-to-end ASR systems typically extract feature vectors every 10ms which corresponds to a frame ra… view at source ↗

**Figure 2.** Figure 2: Hybrid CTC/attention architecture [6, 2] of the endto-end ASR systems used in this report. The shared encoder could include either the pBLSTM or the VGG net + pBLSTM. We use a 6-layer CNN architecture which consists of two consecutive 2D convolutional layers followed by one 2D Maxpooling layer, then another two 2D convolutional layers followed by one 2D max-pooling layer. The 2D filters used in [PITH_F… view at source ↗

read the original abstract

State-of-the-art end-to-end automatic speech recognition (ASR) extracts acoustic features from input speech signal every 10 ms which corresponds to a frame rate of 100 frames/second. In this report, we investigate the use of high-frame-rate features extraction in end-to-end ASR. High frame rates of 200 and 400 frames/second are used in the features extraction and provide additional information for end-to-end ASR. The effectiveness of high-frame-rate features extraction is evaluated independently and in combination with speed perturbation based data augmentation. Experiments performed on two speech corpora, Wall Street Journal (WSJ) and CHiME-5, show that using high-frame-rate features extraction yields improved performance for end-to-end ASR, both independently and in combination with speed perturbation. On WSJ corpus, the relative reduction of word error rate (WER) yielded by high-frame-rate features extraction independently and in combination with speed perturbation are up to 21.3% and 24.1%, respectively. On CHiME-5 corpus, the corresponding relative WER reductions are up to 2.8% and 7.9%, respectively, on the test data recorded by microphone arrays and up to 11.8% and 21.2%, respectively, on the test data recorded by binaural microphones.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The abstract reports sizable WER drops from 200/400 fps features in E2E ASR, but supplies no controls or details to show the gains come from the extra temporal resolution rather than side effects.

read the letter

The paper's central result is that raising the frame rate from the usual 100 fps to 200 or 400 fps during feature extraction improves end-to-end ASR word error rates on both WSJ and CHiME-5, with the largest relative gains (up to 21.3 % on WSJ) appearing when the change is combined with speed perturbation. The work is an empirical check of an idea already used in hybrid systems, now tried inside an E2E pipeline on two standard corpora. That combination with augmentation is the only clear incremental step beyond prior practice. The experiments are run independently and jointly, which at least lets a reader see whether the two changes interact. The numbers are given separately for different microphone conditions on CHiME-5, which adds a small amount of granularity. Beyond that, the abstract contains almost no usable information. It states the frame rates and the WER reductions but gives no model architecture, no description of how the higher-rate features are actually computed or normalized, no mention of how sequence lengths or padding are handled, and no error bars or statistical tests. The stress-test concern is therefore on point: nothing in the reported setup isolates whether the model is actually using the finer time resolution or simply benefiting from altered feature statistics or training dynamics. Without an ablation that keeps everything else fixed, the 21 % relative gain on WSJ remains hard to interpret. This is the sort of short note that might interest people already running E2E ASR experiments and looking for quick feature tweaks. A reader who wants reproducible ideas will find the current text too thin to act on. The work is coherent on its own terms and reports concrete numbers rather than fitted claims, so it clears the bar for serious refereeing if the full manuscript supplies the missing methods and controls.

Referee Report

3 major / 1 minor

Summary. The paper claims that extracting acoustic features at high frame rates of 200 and 400 frames per second (instead of the conventional 100 fps) supplies additional useful information to end-to-end ASR models. Experiments on WSJ and CHiME-5 show relative WER reductions of up to 21.3% (independent) and 24.1% (with speed perturbation) on WSJ, and up to 11.8% and 21.2% on CHiME-5 binaural data.

Significance. If the central empirical claim holds after proper controls and documentation, the result would be significant: it would indicate that the long-standing 10 ms frame-rate convention in ASR discards exploitable temporal detail and that high-frame-rate features can be used directly in E2E systems without architectural overhaul, yielding large gains on a clean corpus such as WSJ.

major comments (3)

[Abstract] Abstract: the reported relative WER reductions (21.3 %, 24.1 %, etc.) are presented without absolute baseline and proposed WER values, without error bars, and without any statistical test, so the magnitude and reliability of the claimed improvements cannot be assessed.
[Experiments] Experiments section (or wherever results are described): no ablation isolates frame rate from correlated changes in sequence length, feature statistics, or training dynamics; therefore the central assumption that the E2E model exploits the extra temporal resolution rather than incidental effects remains untested.
[Abstract] Abstract and methods: the manuscript supplies no description of the E2E model architecture, the precise feature type (MFCC, filterbank, etc.), the training procedure, or how variable-length high-frame-rate sequences are handled, rendering the experimental outcomes unverifiable.

minor comments (1)

[Abstract] The phrase 'features extraction' appears repeatedly; standard terminology is 'feature extraction'.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly to improve clarity and verifiability.

read point-by-point responses

Referee: [Abstract] Abstract: the reported relative WER reductions (21.3 %, 24.1 %, etc.) are presented without absolute baseline and proposed WER values, without error bars, and without any statistical test, so the magnitude and reliability of the claimed improvements cannot be assessed.

Authors: We agree that absolute WER values provide essential context. In the revised manuscript we will report the absolute baseline and improved WER numbers in both the abstract and the results tables. Our runs used fixed seeds and the relative gains were consistent across frame rates; we will add a brief note on this limitation rather than new statistical tests, as recomputing with multiple seeds is outside the scope of a revision. revision: yes
Referee: [Experiments] Experiments section (or wherever results are described): no ablation isolates frame rate from correlated changes in sequence length, feature statistics, or training dynamics; therefore the central assumption that the E2E model exploits the extra temporal resolution rather than incidental effects remains untested.

Authors: All other experimental factors (model architecture, optimizer, data augmentation schedule, loss) were held constant; the sole change was the frame rate used during feature extraction. We acknowledge that an explicit control (e.g., down-sampling 400 fps features back to 100 fps) would further isolate the contribution of temporal resolution. We will add a paragraph discussing this point and note that such an ablation is a natural follow-up experiment. revision: partial
Referee: [Abstract] Abstract and methods: the manuscript supplies no description of the E2E model architecture, the precise feature type (MFCC, filterbank, etc.), the training procedure, or how variable-length high-frame-rate sequences are handled, rendering the experimental outcomes unverifiable.

Authors: The full manuscript already contains these details in Sections 3 and 4 (log-mel filterbank features, attention-based encoder-decoder with CTC, Adam training, padding/masking for variable-length inputs). To satisfy the referee we will move a concise summary of the architecture and feature pipeline into the abstract and ensure the methods section is self-contained. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical experimental results

full rationale

The paper reports measured WER reductions from ASR experiments on WSJ and CHiME-5 using high-frame-rate features (200/400 fps) vs. baseline. No equations, derivations, fitted parameters, or predictions appear in the provided text. Claims rest on direct performance numbers rather than any self-definitional, fitted-input, or self-citation reduction. The skeptic concern about missing ablations is a methodological issue, not circularity per the rules.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the validity of the reported experiments and the premise that higher frame rates supply additional usable information; no free parameters, axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5762 in / 1213 out tokens · 40474 ms · 2026-05-25T09:28:35.867779+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

State-of-the-art end-to-end ASR extracts acoustic features ... every 10 ms which corresponds to a frame rate of 100 frames/second. ... High frame rates of 200 and 400 frames/second ...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 2 internal anchors

[1]

Introduction End-to-end automatic speech recognition (ASR) uses a sin- gle neural network architecture within a deep learning frame- work to perform speech-to-text task [1]. There are two ma- jor approaches for end-to-end ASR; attention-based approach uses an attention mechanism to create required alignments be- tween acoustic frames and output symbols wh...

work page 2018
[2]

Related works Variable frame rate analysis was investigated in hidden Markov model (HMM)-Gaussian mixture model (GMM) based ASR [13, 14, 15]. In this analysis, frame rates higher than 100 frames/second are used for the rapidly-changing speech seg- ments with relatively high energy while frame rates lower than 100 frames/second are used for steady-state sp...

work page internal anchor Pith review Pith/arXiv arXiv 1907
[3]

When high-frame-rate features extraction of 200 and 400 frames/second are used, feature vectors are ex- tracted every 5 and 2.5 ms, respectively

High-frame-rate features extraction State-of-the-art end-to-end ASR systems typically extract fea- ture vectors every 10ms which corresponds to a frame rate of 100 frames/second. When high-frame-rate features extraction of 200 and 400 frames/second are used, feature vectors are ex- tracted every 5 and 2.5 ms, respectively. When the hop size is reduced, mo...

work page
[4]

Speech corpora We carry out experiments on two speech corpora, the Wall Street Journal (WSJ) corpus [11] and the CHiME-5 corpus which was used for the CHiME 2018 speech separation and recognition challenge [12]. These two different ASR tasks, one consisting of clean speech recorded by single microphone (WSJ task) and another consisting of conversational s...

work page 2018
[5]

recipes for this corpus. 4.2. CHiME-5 corpus 4.2.1. Recording scenario CHiME-5 is the ﬁrst large-scale corpus of real multi-speaker conversational speech recorded via commercially available multi-microphone hardware in multiple homes [12]. Natural conversational speech from a dinner party of 4 participants was recorded for transcription. Each party was re...

work page 2018
[6]

Speech recognition systems 5.1.1

Experiments 5.1. Speech recognition systems 5.1.1. Front-end processing Acoustic features are extracted from the training, development, and evaluation sets for training and testing of ASR systems, on the WSJ and CHiME-5 corpora. Utterance-level mean normal- ization is applied on the features. For WSJ, the FBANK+pitch features are extracted from the whole ...

work page
[7]

Experimental results on the WSJ and CHiME-5 corpora showed that improved ASR performance was achieved when using features extraction at a frame rate higher than 100 frames/second

Conclusion This report investigated the use of high-frame-rate features ex- traction in end-to-end speech recognition. Experimental results on the WSJ and CHiME-5 corpora showed that improved ASR performance was achieved when using features extraction at a frame rate higher than 100 frames/second. These results showed that end-to-end ASR using pBLSTM and ...

work page
[8]

Towards end-to-end speech recognition with recurrent neural networks,

A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in Proc. of the 31st International Conference on Machine Learning, Beijing, China, June 2014, pp. 1764–1772

work page 2014
[9]

Hybrid CTC/attention architecture for end-to-end speech recog- nition,

S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “Hybrid CTC/attention architecture for end-to-end speech recog- nition,” IEEE Journal of Selected Topics in Signal Processing , vol. 11, pp. 1240–1253, December 2017

work page 2017
[10]

Attention-based models for speech recognition,

J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y . Ben- gio, “Attention-based models for speech recognition,” inProc. Ad- vances in Neural Information Processing Systems (NIPS) , 2015, pp. 577–585

work page 2015
[11]

Listen, attend and spell: a neural network for large vocabulary conversational speech recognition,

W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: a neural network for large vocabulary conversational speech recognition,” in Proc. IEEE ICASSP , Shanghai, China, March 2016, pp. 4960–4964

work page 2016
[12]

Long short-term memory,

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, pp. 1735–1780, November 1997

work page 1997
[13]

Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM,

T. Hori, S. Watanabe, Y . Zhang, and W. Chan, “Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM,” in Proc. INTERSPEECH, Stock- holm, Sweden, August 2017, pp. 949–953

work page 2017
[14]

Convolutional networks for images, speech, and time series,

Y . LeCun and Y . Bengio, “Convolutional networks for images, speech, and time series,” in The Handbook of Brain Theory and Neural Networks. MIT Press, 1995

work page 1995
[15]

Very deep convolutional net- works for large-scale image recognition,

K. Simonyan and A. Zisserman, “Very deep convolutional net- works for large-scale image recognition,” in Proc. International Conference on Learning Representations, 2015

work page 2015
[16]

ESPnet: end-to-end speech processing toolkit,

S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y . Unno, N. E. Y . Soplin, J. Heymann, M. Wiesner, N. Chen, A. Ren- duchintala, and T. Ochiai, “ESPnet: end-to-end speech processing toolkit,” in Proc. INTERSPEECH, Hyderabad, India, September 2018, pp. 2207–2211

work page 2018
[17]

Audio augmen- tation for speech recognition,

T. Ko, V . Peddinti, D. Povey, and S. Khudanpur, “Audio augmen- tation for speech recognition,” inProc. INTERSPEECH, Dresden, Germany, September 2015, pp. 3586–3589

work page 2015
[18]

The design for the Wall Street Journal-based CSR corpus,

D. B. Paul and J. M. Barker, “The design for the Wall Street Journal-based CSR corpus,” in HLT ’91 Proceedings of the work- shop on Speech and Natural Language , New York, USA, Febru- ary 1992, pp. 357–362

work page 1992
[19]

The ﬁfth ‘CHiME’ speech separation and recognition challenge: dataset, task and baselines,

J. Barker, S. Watanabe, E. Vincent, and J. Trmal, “The ﬁfth ‘CHiME’ speech separation and recognition challenge: dataset, task and baselines,” in Proc. INTERSPEECH, Hyderabad, India, September 2018, pp. 1561–1565

work page 2018
[20]

On the use of variable frame rate analysis in speech recognition,

Q. Zhu and A. Alwan, “On the use of variable frame rate analysis in speech recognition,” in Proc. IEEE ICASSP , Istanbul, Turkey, June 2000, pp. 1783–1786

work page 2000
[21]

Revisiting scenarios and methods for variable frame rate analysis in automatic speech recognition,

J. Macias-Guarasa, J. Ordonez, J. M. Montero, J. Ferreiros, R. Cordoba, and L. F. D. Haro, “Revisiting scenarios and methods for variable frame rate analysis in automatic speech recognition,” in Proc. INTERSPEECH, Geneva, Switzerland, September 2003, pp. 1809–1812

work page 2003
[22]

Low-complexity variable frame rate analysis for speech recognition and voice activity detection,

Z.-H. Tan and B. Lindberg, “Low-complexity variable frame rate analysis for speech recognition and voice activity detection,” IEEE Journal of Selected Topics in Signal Processing , vol. 4, no. 5, pp. 798–807, 2010

work page 2010
[23]

Comparison of parametric repre- sentation for monosyllabic word recognition in continuously spo- ken sentences,

S. B. Davis and P. Mermelstein, “Comparison of parametric repre- sentation for monosyllabic word recognition in continuously spo- ken sentences,”IEEE Trans. on Acoustics, Speech and Signal Pro- cessing, vol. 28, no. 4, pp. 357–366, 1980

work page 1980
[24]

Understanding how deep belief networks perform acoustic modelling,

A.-R. Mohamed, G. Hinton, and G. Penn, “Understanding how deep belief networks perform acoustic modelling,” in Proc. IEEE ICASSP, Kyoto, Japan, March 2012, pp. 4273–4276

work page 2012
[25]

A pitch extraction algorithm tuned for auto- matic speech recognition,

P. Ghahremani, B. BabaAli, D. Povey, K. Riedhammer, J. Trmal, and S. Khudanpur, “A pitch extraction algorithm tuned for auto- matic speech recognition,” inProc. IEEE ICASSP, Florence, Italy, May 2014, pp. 2513–2517

work page 2014
[26]

The Kaldi speech recog- nition toolkit,

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y . Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The Kaldi speech recog- nition toolkit,” in Proc. IEEE ASRU 2011, Hawaii, USA, Decem- ber 2011

work page 2011
[27]

Subband temporal envelope features and data aug- mentation for end-to-end recognition of distant conversational speech,

C.-T. Do, “Subband temporal envelope features and data aug- mentation for end-to-end recognition of distant conversational speech,” in Proc. IEEE ICASSP, Brighton, UK, May 2019

work page 2019
[28]

Acoustic beamform- ing for speaker diarization of meetings,

X. Anguera, C. Wooters, and J. Hernando, “Acoustic beamform- ing for speaker diarization of meetings,” IEEE Trans. on Audio, Speech and Language Processing , vol. 15, no. 7, pp. 2011–2023, 2007

work page 2011
[29]

Chainer: a next- generation open source framework for deep learning,

S. Tokui, K. Oono, S. Hido, and J. Clayton, “Chainer: a next- generation open source framework for deep learning,” in Proc. of NIPS Workshop on Machine Learning Systems (LearningSys) , 2015

work page 2015
[30]

ADADELTA: An Adaptive Learning Rate Method

M. D. Zeiler, “Adadelta: an adaptive learning rate method,” in arXiv preprint arXiv: 1212.5701 , 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012
[31]

On the difﬁculty of train- ing recurrent neural networks,

R. Pascanu, T. Mikolov, and Y . Bengio, “On the difﬁculty of train- ing recurrent neural networks,” in Proc. International Conference on Machine Learning, Atlanta, USA, June 2013, pp. 1310–1318

work page 2013

[1] [1]

Introduction End-to-end automatic speech recognition (ASR) uses a sin- gle neural network architecture within a deep learning frame- work to perform speech-to-text task [1]. There are two ma- jor approaches for end-to-end ASR; attention-based approach uses an attention mechanism to create required alignments be- tween acoustic frames and output symbols wh...

work page 2018

[2] [2]

Related works Variable frame rate analysis was investigated in hidden Markov model (HMM)-Gaussian mixture model (GMM) based ASR [13, 14, 15]. In this analysis, frame rates higher than 100 frames/second are used for the rapidly-changing speech seg- ments with relatively high energy while frame rates lower than 100 frames/second are used for steady-state sp...

work page internal anchor Pith review Pith/arXiv arXiv 1907

[3] [3]

When high-frame-rate features extraction of 200 and 400 frames/second are used, feature vectors are ex- tracted every 5 and 2.5 ms, respectively

High-frame-rate features extraction State-of-the-art end-to-end ASR systems typically extract fea- ture vectors every 10ms which corresponds to a frame rate of 100 frames/second. When high-frame-rate features extraction of 200 and 400 frames/second are used, feature vectors are ex- tracted every 5 and 2.5 ms, respectively. When the hop size is reduced, mo...

work page

[4] [4]

Speech corpora We carry out experiments on two speech corpora, the Wall Street Journal (WSJ) corpus [11] and the CHiME-5 corpus which was used for the CHiME 2018 speech separation and recognition challenge [12]. These two different ASR tasks, one consisting of clean speech recorded by single microphone (WSJ task) and another consisting of conversational s...

work page 2018

[5] [5]

recipes for this corpus. 4.2. CHiME-5 corpus 4.2.1. Recording scenario CHiME-5 is the ﬁrst large-scale corpus of real multi-speaker conversational speech recorded via commercially available multi-microphone hardware in multiple homes [12]. Natural conversational speech from a dinner party of 4 participants was recorded for transcription. Each party was re...

work page 2018

[6] [6]

Speech recognition systems 5.1.1

Experiments 5.1. Speech recognition systems 5.1.1. Front-end processing Acoustic features are extracted from the training, development, and evaluation sets for training and testing of ASR systems, on the WSJ and CHiME-5 corpora. Utterance-level mean normal- ization is applied on the features. For WSJ, the FBANK+pitch features are extracted from the whole ...

work page

[7] [7]

Experimental results on the WSJ and CHiME-5 corpora showed that improved ASR performance was achieved when using features extraction at a frame rate higher than 100 frames/second

Conclusion This report investigated the use of high-frame-rate features ex- traction in end-to-end speech recognition. Experimental results on the WSJ and CHiME-5 corpora showed that improved ASR performance was achieved when using features extraction at a frame rate higher than 100 frames/second. These results showed that end-to-end ASR using pBLSTM and ...

work page

[8] [8]

Towards end-to-end speech recognition with recurrent neural networks,

A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in Proc. of the 31st International Conference on Machine Learning, Beijing, China, June 2014, pp. 1764–1772

work page 2014

[9] [9]

Hybrid CTC/attention architecture for end-to-end speech recog- nition,

S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “Hybrid CTC/attention architecture for end-to-end speech recog- nition,” IEEE Journal of Selected Topics in Signal Processing , vol. 11, pp. 1240–1253, December 2017

work page 2017

[10] [10]

Attention-based models for speech recognition,

J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y . Ben- gio, “Attention-based models for speech recognition,” inProc. Ad- vances in Neural Information Processing Systems (NIPS) , 2015, pp. 577–585

work page 2015

[11] [11]

Listen, attend and spell: a neural network for large vocabulary conversational speech recognition,

W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: a neural network for large vocabulary conversational speech recognition,” in Proc. IEEE ICASSP , Shanghai, China, March 2016, pp. 4960–4964

work page 2016

[12] [12]

Long short-term memory,

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, pp. 1735–1780, November 1997

work page 1997

[13] [13]

Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM,

T. Hori, S. Watanabe, Y . Zhang, and W. Chan, “Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM,” in Proc. INTERSPEECH, Stock- holm, Sweden, August 2017, pp. 949–953

work page 2017

[14] [14]

Convolutional networks for images, speech, and time series,

Y . LeCun and Y . Bengio, “Convolutional networks for images, speech, and time series,” in The Handbook of Brain Theory and Neural Networks. MIT Press, 1995

work page 1995

[15] [15]

Very deep convolutional net- works for large-scale image recognition,

K. Simonyan and A. Zisserman, “Very deep convolutional net- works for large-scale image recognition,” in Proc. International Conference on Learning Representations, 2015

work page 2015

[16] [16]

ESPnet: end-to-end speech processing toolkit,

S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y . Unno, N. E. Y . Soplin, J. Heymann, M. Wiesner, N. Chen, A. Ren- duchintala, and T. Ochiai, “ESPnet: end-to-end speech processing toolkit,” in Proc. INTERSPEECH, Hyderabad, India, September 2018, pp. 2207–2211

work page 2018

[17] [17]

Audio augmen- tation for speech recognition,

T. Ko, V . Peddinti, D. Povey, and S. Khudanpur, “Audio augmen- tation for speech recognition,” inProc. INTERSPEECH, Dresden, Germany, September 2015, pp. 3586–3589

work page 2015

[18] [18]

The design for the Wall Street Journal-based CSR corpus,

D. B. Paul and J. M. Barker, “The design for the Wall Street Journal-based CSR corpus,” in HLT ’91 Proceedings of the work- shop on Speech and Natural Language , New York, USA, Febru- ary 1992, pp. 357–362

work page 1992

[19] [19]

The ﬁfth ‘CHiME’ speech separation and recognition challenge: dataset, task and baselines,

J. Barker, S. Watanabe, E. Vincent, and J. Trmal, “The ﬁfth ‘CHiME’ speech separation and recognition challenge: dataset, task and baselines,” in Proc. INTERSPEECH, Hyderabad, India, September 2018, pp. 1561–1565

work page 2018

[20] [20]

On the use of variable frame rate analysis in speech recognition,

Q. Zhu and A. Alwan, “On the use of variable frame rate analysis in speech recognition,” in Proc. IEEE ICASSP , Istanbul, Turkey, June 2000, pp. 1783–1786

work page 2000

[21] [21]

Revisiting scenarios and methods for variable frame rate analysis in automatic speech recognition,

J. Macias-Guarasa, J. Ordonez, J. M. Montero, J. Ferreiros, R. Cordoba, and L. F. D. Haro, “Revisiting scenarios and methods for variable frame rate analysis in automatic speech recognition,” in Proc. INTERSPEECH, Geneva, Switzerland, September 2003, pp. 1809–1812

work page 2003

[22] [22]

Low-complexity variable frame rate analysis for speech recognition and voice activity detection,

Z.-H. Tan and B. Lindberg, “Low-complexity variable frame rate analysis for speech recognition and voice activity detection,” IEEE Journal of Selected Topics in Signal Processing , vol. 4, no. 5, pp. 798–807, 2010

work page 2010

[23] [23]

Comparison of parametric repre- sentation for monosyllabic word recognition in continuously spo- ken sentences,

S. B. Davis and P. Mermelstein, “Comparison of parametric repre- sentation for monosyllabic word recognition in continuously spo- ken sentences,”IEEE Trans. on Acoustics, Speech and Signal Pro- cessing, vol. 28, no. 4, pp. 357–366, 1980

work page 1980

[24] [24]

Understanding how deep belief networks perform acoustic modelling,

A.-R. Mohamed, G. Hinton, and G. Penn, “Understanding how deep belief networks perform acoustic modelling,” in Proc. IEEE ICASSP, Kyoto, Japan, March 2012, pp. 4273–4276

work page 2012

[25] [25]

A pitch extraction algorithm tuned for auto- matic speech recognition,

P. Ghahremani, B. BabaAli, D. Povey, K. Riedhammer, J. Trmal, and S. Khudanpur, “A pitch extraction algorithm tuned for auto- matic speech recognition,” inProc. IEEE ICASSP, Florence, Italy, May 2014, pp. 2513–2517

work page 2014

[26] [26]

The Kaldi speech recog- nition toolkit,

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y . Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The Kaldi speech recog- nition toolkit,” in Proc. IEEE ASRU 2011, Hawaii, USA, Decem- ber 2011

work page 2011

[27] [27]

Subband temporal envelope features and data aug- mentation for end-to-end recognition of distant conversational speech,

C.-T. Do, “Subband temporal envelope features and data aug- mentation for end-to-end recognition of distant conversational speech,” in Proc. IEEE ICASSP, Brighton, UK, May 2019

work page 2019

[28] [28]

Acoustic beamform- ing for speaker diarization of meetings,

X. Anguera, C. Wooters, and J. Hernando, “Acoustic beamform- ing for speaker diarization of meetings,” IEEE Trans. on Audio, Speech and Language Processing , vol. 15, no. 7, pp. 2011–2023, 2007

work page 2011

[29] [29]

Chainer: a next- generation open source framework for deep learning,

S. Tokui, K. Oono, S. Hido, and J. Clayton, “Chainer: a next- generation open source framework for deep learning,” in Proc. of NIPS Workshop on Machine Learning Systems (LearningSys) , 2015

work page 2015

[30] [30]

ADADELTA: An Adaptive Learning Rate Method

M. D. Zeiler, “Adadelta: an adaptive learning rate method,” in arXiv preprint arXiv: 1212.5701 , 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012

[31] [31]

On the difﬁculty of train- ing recurrent neural networks,

R. Pascanu, T. Mikolov, and Y . Bengio, “On the difﬁculty of train- ing recurrent neural networks,” in Proc. International Conference on Machine Learning, Atlanta, USA, June 2013, pp. 1310–1318

work page 2013