Integration of TensorFlow based Acoustic Model with Kaldi WFST Decoder

Ji-Hwan Kim; Minkyu Lim

arxiv: 1906.11018 · v1 · pith:5JI3AQE7new · submitted 2019-06-21 · 📡 eess.AS · cs.LG· cs.SD

Integration of TensorFlow based Acoustic Model with Kaldi WFST Decoder

Minkyu Lim , Ji-Hwan Kim This is my paper

Pith reviewed 2026-05-25 18:41 UTC · model grok-4.3

classification 📡 eess.AS cs.LGcs.SD

keywords TensorFlowKaldiacoustic modelWFST decoderspeech recognitionDNNintegrationposterior probabilities

0 comments

The pith

TensorFlow acoustic models reach the same performance level as Kaldi models when integrated with the WFST decoder.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that Kaldi features and alignments can be converted into a form usable by TensorFlow, allowing a DNN acoustic model to be trained there and then queried at runtime inside the Kaldi WFST decoder to supply posterior probabilities for lattice generation. This produces a one-pass decoder that supports arbitrary neural network designs from TensorFlow while retaining the WFST decoder's beam search and online decoding capability. Experiments confirm identical performance to a native Kaldi model on the RM, WSJ, and LibriSpeech datasets. A sympathetic reader would care because the approach removes the need to reimplement flexible network architectures inside the Kaldi framework itself.

Core claim

By converting Kaldi features and alignments for TensorFlow training, training the DNN acoustic model, and querying the trained model during Kaldi decoding to obtain posterior probabilities for lattice generation, the TensorFlow based acoustic models trained on the RM, WSJ, and LibriSpeech datasets show the same level of performance as the model trained using the Kaldi framework.

What carries the argument

The one-pass decoder that queries the TensorFlow model for posterior probabilities during beam search inside the Kaldi WFST framework.

If this is right

Arbitrary neural network architectures built in TensorFlow can be applied directly to WFST-based speech recognition.
WFST-based online decoding becomes possible using a TensorFlow acoustic model.
The integrated approach maintains performance parity across the RM, WSJ, and LibriSpeech datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could shorten the time needed to test new network designs inside mature speech pipelines.
Similar conversion layers might allow other deep learning frameworks to interface with the same WFST decoder.
Measuring the added latency from model queries would clarify suitability for low-latency online applications.

Load-bearing premise

Converting Kaldi features and alignments into a format usable by TensorFlow preserves all information required for the resulting acoustic model to reach identical performance when plugged back into the Kaldi decoder.

What would settle it

A word error rate comparison on the LibriSpeech test set that shows the TensorFlow-integrated model exceeding the error rate of the Kaldi baseline would falsify the performance equivalence claim.

Figures

Figures reproduced from arXiv: 1906.11018 by Ji-Hwan Kim, Minkyu Lim.

read the original abstract

While the Kaldi framework provides state-of-the-art components for speech recognition like feature extraction, deep neural network (DNN)-based acoustic models, and a weighted finite state transducer (WFST)-based decoder, it is difficult to implement a new flexible DNN model. By contrast, a general-purpose deep learning framework, such as TensorFlow, can easily build various types of neural network architectures using a tensor-based computation method, but it is difficult to apply them to WFST-based speech recognition. In this study, a TensorFlow-based acoustic model is integrated with a WFST-based Kaldi decoder to combine the two frameworks. The features and alignments used in Kaldi are converted so they can be trained by the TensorFlow model, and the DNN-based acoustic model is then trained. In the integrated Kaldi decoder, the posterior probabilities are calculated by querying the trained TensorFlow model, and a beam search is performed to generate the lattice. The advantages of the proposed one-pass decoder include the application of various types of neural networks to WFST-based speech recognition and WFST-based online decoding using a TensorFlow-based acoustic model. The TensorFlow based acoustic models trained using the RM, WSJ, and LibriSpeech datasets show the same level of performance as the model trained using the Kaldi framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a working recipe to run TensorFlow acoustic models inside Kaldi's WFST decoder via feature and alignment conversion, but the abstract supplies no WER numbers or training details to support the equivalence claim.

read the letter

The paper shows how to convert Kaldi features and alignments into a format TensorFlow can train on, then query the resulting model for posteriors during Kaldi decoding. The main point is that this lets people use arbitrary network architectures while keeping the WFST search and online decoding path intact. That is the concrete contribution: a documented bridge between the two toolkits on RM, WSJ, and LibriSpeech data.

Referee Report

2 major / 2 minor

Summary. The manuscript describes an integration between TensorFlow-based acoustic models and the Kaldi WFST decoder. Kaldi features and alignments are converted for training in TensorFlow; the resulting model is then queried for posterior probabilities inside the Kaldi decoder to perform beam search and lattice generation. The central empirical claim is that TensorFlow models trained on the RM, WSJ, and LibriSpeech corpora achieve the same performance level as native Kaldi acoustic models.

Significance. If the reported performance equivalence is substantiated, the work would enable arbitrary neural-network architectures developed in general-purpose frameworks such as TensorFlow to be used inside an established WFST-based decoder, supporting both one-pass online decoding and more flexible acoustic-model experimentation while retaining Kaldi’s search infrastructure.

major comments (2)

[Abstract] Abstract: the claim that 'The TensorFlow based acoustic models trained using the RM, WSJ, and LibriSpeech datasets show the same level of performance as the model trained using the Kaldi framework' is presented without any WER values, tables, error bars, training hyper-parameters, or statistical tests. Because this equivalence is the load-bearing empirical result, the absence of quantitative evidence prevents verification of the central contribution.
[Method (conversion step)] Conversion procedure (described in the abstract and method outline): no details are supplied on feature normalization, frame-alignment precision, or any checks that the converted data preserve the exact information used by the original Kaldi DNN. The performance-equivalence claim rests on this conversion being information-preserving; without explicit verification (e.g., posterior comparison or ablation on the conversion step), the result cannot be assessed.

minor comments (2)

The manuscript would benefit from a dedicated Experiments section containing side-by-side WER tables for the three corpora.
Clarify whether the TensorFlow network architecture exactly replicates the Kaldi DNN topology or introduces any differences that could affect the comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight the need for stronger quantitative support and methodological transparency. We will revise the manuscript to address both points.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'The TensorFlow based acoustic models trained using the RM, WSJ, and LibriSpeech datasets show the same level of performance as the model trained using the Kaldi framework' is presented without any WER values, tables, error bars, training hyper-parameters, or statistical tests. Because this equivalence is the load-bearing empirical result, the absence of quantitative evidence prevents verification of the central contribution.

Authors: We agree that the abstract and manuscript would benefit from explicit quantitative results. In the revision we will add the WER numbers for both TensorFlow and native Kaldi models on RM, WSJ and LibriSpeech, include a results table with training hyperparameters, and note any available error-bar or significance information. This directly substantiates the equivalence claim. revision: yes
Referee: [Method (conversion step)] Conversion procedure (described in the abstract and method outline): no details are supplied on feature normalization, frame-alignment precision, or any checks that the converted data preserve the exact information used by the original Kaldi DNN. The performance-equivalence claim rests on this conversion being information-preserving; without explicit verification (e.g., posterior comparison or ablation on the conversion step), the result cannot be assessed.

Authors: We accept that the conversion step requires more explicit description and validation. The revised Methods section will detail the feature normalization, alignment precision, and will add verification (posterior comparison between the original Kaldi DNN and the converted TensorFlow inputs, plus a brief ablation on the conversion). These additions will confirm that the conversion is information-preserving. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical integration paper with no derivations or fitted predictions

full rationale

The paper is an engineering description of converting Kaldi features/alignments to TensorFlow format, training a DNN acoustic model in TF, and plugging the resulting posteriors into the Kaldi WFST decoder. No equations, no parameter fitting presented as prediction, and no self-citation chains are used to justify any result. The performance claim is an empirical observation on RM/WSJ/LibriSpeech, not a reduction to inputs by construction. This matches the default case of a self-contained implementation report.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper contains no mathematical model, free parameters, axioms, or invented entities; it is a software integration report.

pith-pipeline@v0.9.0 · 5765 in / 995 out tokens · 23497 ms · 2026-05-25T18:41:50.515544+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 6 internal anchors

[1]

Many researchers and leading companies have actively engaged in studying speech recognition

Introduction Automatic speech recognition (ASR) has been significantly improved in recent years [1 , 2, 3 ]. Many researchers and leading companies have actively engaged in studying speech recognition. One of the reasons for this active research is that various open -source ASR frameworks have been introduced. In the early stages, there was a lot of resea...

work page
[2]

The system is divided into fea ture extraction, GMM training, DNN training, and WFST decoding processes

TensorFlow based Acoustic Modeling The overall procedure to build a Kaldi ASR system is shown in Figure 1. The system is divided into fea ture extraction, GMM training, DNN training, and WFST decoding processes. For the training step, TensorFlow is used for DNN model training and decoding Kaldi is used to obtain the feature and label information for the T...

work page
[3]

The feature reader reads a feature vector in ark format, reconstructs the feature vector according to the left and right context length, and transmits it to nnetcomputer

Integration with Kaldi Decoder The Kaldi decoder consists of the feature reader , nnetcomputer, and lattice generator as shown in Figure 3. The feature reader reads a feature vector in ark format, reconstructs the feature vector according to the left and right context length, and transmits it to nnetcomputer. In nnetcomputer, after the posterior probabili...

work page
[4]

The corpora that were used in the experiments are RM, WSJ, and LibriSpeech, whose total learning lengths are 9 hours, 80 hours, and 960 hours, respectively

Experiments From among the learning materials that are widely used for speech recognition evaluation, three data sets were chosen for our experiments ranging from a small corpus to a large corpus. The corpora that were used in the experiments are RM, WSJ, and LibriSpeech, whose total learning lengths are 9 hours, 80 hours, and 960 hours, respectively. In ...

work page
[5]

The acoustic model training is conducted in TensorFlow, and the features , labels, and other WFST -based decoders are obtained through Kaldi

Conclusion and Future Work We proposed a method in which the gap between Kaldi and TensorFlow is eliminated, an acoustic model is trained in TensorFlow, and the model is integrated into the Kaldi -based decoder. The acoustic model training is conducted in TensorFlow, and the features , labels, and other WFST -based decoders are obtained through Kaldi. To ...

work page
[6]

Acknowledgements This work was supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (No.2017 -0-01772, Development of QA systems for Video Story Understanding to pass the Video Turing Test)

work page 2017
[7]

Deep learning,

Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 436-444, 2015

work page 2015
[8]

Deep learning in neural networks: An overview,

J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural networks, vol. 61, pp-85-117, 2015

work page 2015
[9]

Recent progresses in deep learning based acoustic models,

D. Yu and J. Li , “ Recent progresses in deep learning based acoustic models,” IEEE/CAA Journal of Automatica Sinica, vol. 4, no. 3, pp. 399–412, 2019

work page 2019
[10]

Young, G

S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu,G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev , and P. Woodland , The HTK Book (for version 3.4) , Cambridge University Engineering Department, 2009

work page 2009
[11]

The CMU SPHINX -4 speech recognition system,

P. Lamere, P. Kwok, E. Gouvea, B. Raj, R. Singh, W. Walker, M. Warmuth, and P. Wolf, “The CMU SPHINX -4 speech recognition system,” in ICAASP 2003 – 2003 IEEE International Conference on Acoustics, Speech and Signal Processing, April 6-10, Hong Kong, China, Proceedings, 2003, pp. 2-5

work page 2003
[12]

CNTK: Microsoft's open -source deep-learning toolkit,

F. Seide and A. Agarwal, “CNTK: Microsoft's open -source deep-learning toolkit,” in SIGKDD 2016 – 22nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining , August 13-17, San Francisco, U.S.A., Proceedings, 2016, pp. 2135-2135

work page 2016
[13]

The Kaldi Speech Recognition Toolkit ,

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer , and K. Vesely , “ The Kaldi Speech Recognition Toolkit ,” 2011 IEEE Workshop on Automatic Speech Recognition and Understanding, IEEE Signal Processing Society, 2011

work page 2011
[14]

Weighted finite -state transducers in speech recognition ,

M. Mohri, F. Pereira, and M. Riley, “Weighted finite -state transducers in speech recognition ,” Computer Speech and Language, vol. 16, no. 1, pp. 69-88, 2002

work page 2002
[15]

Improving deep neural network acoustic models using generalized maxout networks,

X. Zhang, J. Trmal, D. Povey, and S. Khudanpur, “Improving deep neural network acoustic models using generalized maxout networks,” in ICASSP 2014 - 2014 IEEE Int ernational Conference on Acoustics, Speech and Signal Processing, May 4- 9, Florence, Italy, Proceedings, 2014, pp. 215-219

work page 2014
[16]

An Exploration of Dropout with LSTMs ,

G. Cheng, V. Peddinti, D. Povey, V. Manohar, S. Khudanpur, and Y. Yan, “An Exploration of Dropout with LSTMs ,” in INTERSPEECH 2017 – 18th Annual Conference of the International Speech Communication Association, August 20-24, Stockholm, Sweden, Proceedings, 2017, pp. 1586-1590

work page 2017
[17]

A time delay neural network architecture for efficient modeling of long temporal contexts,

V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural network architecture for efficient modeling of long temporal contexts,” in INTERSPEECH 2015 – 16th Annual Conference of the International Speech Communication Association, September 6 -10, Dresden, Germany, Proceedings, 2015, pp . 3214-3218

work page 2015
[18]

Tensorflow: A system for large -scale mac hine learning,

M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, and M. Kudlur, “Tensorflow: A system for large -scale mac hine learning,” i n OSDI 2016 - 12th USENIX Symposium on Operating Systems Design and Implementation , November 2 -4, Savannah, U.S.A., 2016, pp. 265-283

work page 2016
[19]

Torch: a modular machine learning software library,

R. Collobert, S. Bengio, and J. Mariéthoz, “Torch: a modular machine learning software library,” Technical Report IDIAP-RR 02-46, 2002

work page 2002
[20]

Introduction to pytorch

N. Ketkar, “Introduction to pytorch. ” In Deep learning with python, Apress, 2017

work page 2017
[21]

Theano: a CPU and GPU math expression compiler ,

J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins J. Turian, D. Farley, and Y. Bengio, “Theano: a CPU and GPU math expression compiler ,” in SciPy 2010 - Proceedings of the Python for Scientific Computing Conference, June 28-July 3, Austin, U.S.A., Proceedings, 2010, pp 3-10

work page 2010
[22]

Towards better decoding and language model integration in sequence to sequence models

J. Chorowski and N. Jaitly, “Towards better decoding and language model integration in seq uence to sequence models ,” arXiv preprint arXiv:1612.02695, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[23]

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Y. Wu, M. Schuster, Z. Chen, V. Le, M. Norouzi, W. Macherey, and J. Klingner, “Google's neural machine translation system: Bridging the gap between human and machine translation ,” arXiv preprint arXiv:1609.08144, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[24]

The PyTorch-Kaldi Speech Recognition Toolkit

M. Ravanelli, T. Parcollet, and Y. Bengio, “The PyTorch -Kaldi Speech Recognition Toolkit,” arXiv preprint arXiv:1811.07453, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[25]

Noise-contrastive estimation: A new estimation principle for unnormalized statistical models,

M. Gutmann and A. Hyvärinen “Noise-contrastive estimation: A new estimation principle for unnormalized statistical models,” in AISTATS 2010 - 13th International Conference on Artificial Intelligence and Statistics , May 13 -15, Sardinia, Italy, Proceedings, 2010, pp. 297-304

work page 2010
[26]

Dropout: a simple way to prevent neural networks from overfitting ,

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting ,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929-1958, 2014

work page 1929
[27]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift ,” arXiv preprint arXiv:1502.03167, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[28]

An overview of gradient descent optimization algorithms

S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv preprint arXiv:1609.04747, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[29]

TensorFlow-Serving: Flexible, High-Performance ML Serving

C. Olston, N. Fiedel, K. Gorovoy, J. Harmsen, L. Lao, F. Li, and J. Soyke, “Tensorflow-serving: Flexible, high -performance ml serving,” arXiv preprint arXiv:1712.06139, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[30]

Boosted MMI for model and feature-space discriminative training ,

D. Povey, D. Kanevsky, B. Kingsbury, B. Ramabhadran, G. Saon, and K. Visweswariah, “ Boosted MMI for model and feature-space discriminative training ,” i n ICASSP 2008 – 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, March 30 -April 4, Las Vegas, U.S.A., Proceedings, 2008, pp. 4057-4060

work page 2008
[31]

The design for the Wall Street Journal - based CSR corpus ,

D. Paul and J. Baker, “The design for the Wall Street Journal - based CSR corpus ,” in Proceedings of the workshop on Speech and Natural Language , February 23 -26, Harriman, U.S.A., Proceedings, 1992

work page 1992
[32]

Librispeech: an ASR corpus based on public domain audio books,

V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an ASR corpus based on public domain audio books,” in ICASSP 2015 – 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, April 19 -24, Brisbane, Australia, Proceedings, 2015, pp. 5206-5210

work page 2015

[1] [1]

Many researchers and leading companies have actively engaged in studying speech recognition

Introduction Automatic speech recognition (ASR) has been significantly improved in recent years [1 , 2, 3 ]. Many researchers and leading companies have actively engaged in studying speech recognition. One of the reasons for this active research is that various open -source ASR frameworks have been introduced. In the early stages, there was a lot of resea...

work page

[2] [2]

The system is divided into fea ture extraction, GMM training, DNN training, and WFST decoding processes

TensorFlow based Acoustic Modeling The overall procedure to build a Kaldi ASR system is shown in Figure 1. The system is divided into fea ture extraction, GMM training, DNN training, and WFST decoding processes. For the training step, TensorFlow is used for DNN model training and decoding Kaldi is used to obtain the feature and label information for the T...

work page

[3] [3]

The feature reader reads a feature vector in ark format, reconstructs the feature vector according to the left and right context length, and transmits it to nnetcomputer

Integration with Kaldi Decoder The Kaldi decoder consists of the feature reader , nnetcomputer, and lattice generator as shown in Figure 3. The feature reader reads a feature vector in ark format, reconstructs the feature vector according to the left and right context length, and transmits it to nnetcomputer. In nnetcomputer, after the posterior probabili...

work page

[4] [4]

The corpora that were used in the experiments are RM, WSJ, and LibriSpeech, whose total learning lengths are 9 hours, 80 hours, and 960 hours, respectively

Experiments From among the learning materials that are widely used for speech recognition evaluation, three data sets were chosen for our experiments ranging from a small corpus to a large corpus. The corpora that were used in the experiments are RM, WSJ, and LibriSpeech, whose total learning lengths are 9 hours, 80 hours, and 960 hours, respectively. In ...

work page

[5] [5]

The acoustic model training is conducted in TensorFlow, and the features , labels, and other WFST -based decoders are obtained through Kaldi

Conclusion and Future Work We proposed a method in which the gap between Kaldi and TensorFlow is eliminated, an acoustic model is trained in TensorFlow, and the model is integrated into the Kaldi -based decoder. The acoustic model training is conducted in TensorFlow, and the features , labels, and other WFST -based decoders are obtained through Kaldi. To ...

work page

[6] [6]

Acknowledgements This work was supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (No.2017 -0-01772, Development of QA systems for Video Story Understanding to pass the Video Turing Test)

work page 2017

[7] [7]

Deep learning,

Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 436-444, 2015

work page 2015

[8] [8]

Deep learning in neural networks: An overview,

J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural networks, vol. 61, pp-85-117, 2015

work page 2015

[9] [9]

Recent progresses in deep learning based acoustic models,

D. Yu and J. Li , “ Recent progresses in deep learning based acoustic models,” IEEE/CAA Journal of Automatica Sinica, vol. 4, no. 3, pp. 399–412, 2019

work page 2019

[10] [10]

Young, G

S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu,G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev , and P. Woodland , The HTK Book (for version 3.4) , Cambridge University Engineering Department, 2009

work page 2009

[11] [11]

The CMU SPHINX -4 speech recognition system,

P. Lamere, P. Kwok, E. Gouvea, B. Raj, R. Singh, W. Walker, M. Warmuth, and P. Wolf, “The CMU SPHINX -4 speech recognition system,” in ICAASP 2003 – 2003 IEEE International Conference on Acoustics, Speech and Signal Processing, April 6-10, Hong Kong, China, Proceedings, 2003, pp. 2-5

work page 2003

[12] [12]

CNTK: Microsoft's open -source deep-learning toolkit,

F. Seide and A. Agarwal, “CNTK: Microsoft's open -source deep-learning toolkit,” in SIGKDD 2016 – 22nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining , August 13-17, San Francisco, U.S.A., Proceedings, 2016, pp. 2135-2135

work page 2016

[13] [13]

The Kaldi Speech Recognition Toolkit ,

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer , and K. Vesely , “ The Kaldi Speech Recognition Toolkit ,” 2011 IEEE Workshop on Automatic Speech Recognition and Understanding, IEEE Signal Processing Society, 2011

work page 2011

[14] [14]

Weighted finite -state transducers in speech recognition ,

M. Mohri, F. Pereira, and M. Riley, “Weighted finite -state transducers in speech recognition ,” Computer Speech and Language, vol. 16, no. 1, pp. 69-88, 2002

work page 2002

[15] [15]

Improving deep neural network acoustic models using generalized maxout networks,

X. Zhang, J. Trmal, D. Povey, and S. Khudanpur, “Improving deep neural network acoustic models using generalized maxout networks,” in ICASSP 2014 - 2014 IEEE Int ernational Conference on Acoustics, Speech and Signal Processing, May 4- 9, Florence, Italy, Proceedings, 2014, pp. 215-219

work page 2014

[16] [16]

An Exploration of Dropout with LSTMs ,

G. Cheng, V. Peddinti, D. Povey, V. Manohar, S. Khudanpur, and Y. Yan, “An Exploration of Dropout with LSTMs ,” in INTERSPEECH 2017 – 18th Annual Conference of the International Speech Communication Association, August 20-24, Stockholm, Sweden, Proceedings, 2017, pp. 1586-1590

work page 2017

[17] [17]

A time delay neural network architecture for efficient modeling of long temporal contexts,

V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural network architecture for efficient modeling of long temporal contexts,” in INTERSPEECH 2015 – 16th Annual Conference of the International Speech Communication Association, September 6 -10, Dresden, Germany, Proceedings, 2015, pp . 3214-3218

work page 2015

[18] [18]

Tensorflow: A system for large -scale mac hine learning,

M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, and M. Kudlur, “Tensorflow: A system for large -scale mac hine learning,” i n OSDI 2016 - 12th USENIX Symposium on Operating Systems Design and Implementation , November 2 -4, Savannah, U.S.A., 2016, pp. 265-283

work page 2016

[19] [19]

Torch: a modular machine learning software library,

R. Collobert, S. Bengio, and J. Mariéthoz, “Torch: a modular machine learning software library,” Technical Report IDIAP-RR 02-46, 2002

work page 2002

[20] [20]

Introduction to pytorch

N. Ketkar, “Introduction to pytorch. ” In Deep learning with python, Apress, 2017

work page 2017

[21] [21]

Theano: a CPU and GPU math expression compiler ,

J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins J. Turian, D. Farley, and Y. Bengio, “Theano: a CPU and GPU math expression compiler ,” in SciPy 2010 - Proceedings of the Python for Scientific Computing Conference, June 28-July 3, Austin, U.S.A., Proceedings, 2010, pp 3-10

work page 2010

[22] [22]

Towards better decoding and language model integration in sequence to sequence models

J. Chorowski and N. Jaitly, “Towards better decoding and language model integration in seq uence to sequence models ,” arXiv preprint arXiv:1612.02695, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[23] [23]

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Y. Wu, M. Schuster, Z. Chen, V. Le, M. Norouzi, W. Macherey, and J. Klingner, “Google's neural machine translation system: Bridging the gap between human and machine translation ,” arXiv preprint arXiv:1609.08144, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[24] [24]

The PyTorch-Kaldi Speech Recognition Toolkit

M. Ravanelli, T. Parcollet, and Y. Bengio, “The PyTorch -Kaldi Speech Recognition Toolkit,” arXiv preprint arXiv:1811.07453, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[25] [25]

Noise-contrastive estimation: A new estimation principle for unnormalized statistical models,

M. Gutmann and A. Hyvärinen “Noise-contrastive estimation: A new estimation principle for unnormalized statistical models,” in AISTATS 2010 - 13th International Conference on Artificial Intelligence and Statistics , May 13 -15, Sardinia, Italy, Proceedings, 2010, pp. 297-304

work page 2010

[26] [26]

Dropout: a simple way to prevent neural networks from overfitting ,

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting ,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929-1958, 2014

work page 1929

[27] [27]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift ,” arXiv preprint arXiv:1502.03167, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[28] [28]

An overview of gradient descent optimization algorithms

S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv preprint arXiv:1609.04747, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[29] [29]

TensorFlow-Serving: Flexible, High-Performance ML Serving

C. Olston, N. Fiedel, K. Gorovoy, J. Harmsen, L. Lao, F. Li, and J. Soyke, “Tensorflow-serving: Flexible, high -performance ml serving,” arXiv preprint arXiv:1712.06139, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[30] [30]

Boosted MMI for model and feature-space discriminative training ,

D. Povey, D. Kanevsky, B. Kingsbury, B. Ramabhadran, G. Saon, and K. Visweswariah, “ Boosted MMI for model and feature-space discriminative training ,” i n ICASSP 2008 – 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, March 30 -April 4, Las Vegas, U.S.A., Proceedings, 2008, pp. 4057-4060

work page 2008

[31] [31]

The design for the Wall Street Journal - based CSR corpus ,

D. Paul and J. Baker, “The design for the Wall Street Journal - based CSR corpus ,” in Proceedings of the workshop on Speech and Natural Language , February 23 -26, Harriman, U.S.A., Proceedings, 1992

work page 1992

[32] [32]

Librispeech: an ASR corpus based on public domain audio books,

V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an ASR corpus based on public domain audio books,” in ICASSP 2015 – 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, April 19 -24, Brisbane, Australia, Proceedings, 2015, pp. 5206-5210

work page 2015