pith. sign in

arxiv: 1906.11018 · v1 · pith:5JI3AQE7new · submitted 2019-06-21 · 📡 eess.AS · cs.LG· cs.SD

Integration of TensorFlow based Acoustic Model with Kaldi WFST Decoder

Pith reviewed 2026-05-25 18:41 UTC · model grok-4.3

classification 📡 eess.AS cs.LGcs.SD
keywords TensorFlowKaldiacoustic modelWFST decoderspeech recognitionDNNintegrationposterior probabilities
0
0 comments X

The pith

TensorFlow acoustic models reach the same performance level as Kaldi models when integrated with the WFST decoder.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that Kaldi features and alignments can be converted into a form usable by TensorFlow, allowing a DNN acoustic model to be trained there and then queried at runtime inside the Kaldi WFST decoder to supply posterior probabilities for lattice generation. This produces a one-pass decoder that supports arbitrary neural network designs from TensorFlow while retaining the WFST decoder's beam search and online decoding capability. Experiments confirm identical performance to a native Kaldi model on the RM, WSJ, and LibriSpeech datasets. A sympathetic reader would care because the approach removes the need to reimplement flexible network architectures inside the Kaldi framework itself.

Core claim

By converting Kaldi features and alignments for TensorFlow training, training the DNN acoustic model, and querying the trained model during Kaldi decoding to obtain posterior probabilities for lattice generation, the TensorFlow based acoustic models trained on the RM, WSJ, and LibriSpeech datasets show the same level of performance as the model trained using the Kaldi framework.

What carries the argument

The one-pass decoder that queries the TensorFlow model for posterior probabilities during beam search inside the Kaldi WFST framework.

If this is right

  • Arbitrary neural network architectures built in TensorFlow can be applied directly to WFST-based speech recognition.
  • WFST-based online decoding becomes possible using a TensorFlow acoustic model.
  • The integrated approach maintains performance parity across the RM, WSJ, and LibriSpeech datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could shorten the time needed to test new network designs inside mature speech pipelines.
  • Similar conversion layers might allow other deep learning frameworks to interface with the same WFST decoder.
  • Measuring the added latency from model queries would clarify suitability for low-latency online applications.

Load-bearing premise

Converting Kaldi features and alignments into a format usable by TensorFlow preserves all information required for the resulting acoustic model to reach identical performance when plugged back into the Kaldi decoder.

What would settle it

A word error rate comparison on the LibriSpeech test set that shows the TensorFlow-integrated model exceeding the error rate of the Kaldi baseline would falsify the performance equivalence claim.

Figures

Figures reproduced from arXiv: 1906.11018 by Ji-Hwan Kim, Minkyu Lim.

Figure 3
Figure 3. Figure 3: shows the structure of the Kaldi decoder [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
read the original abstract

While the Kaldi framework provides state-of-the-art components for speech recognition like feature extraction, deep neural network (DNN)-based acoustic models, and a weighted finite state transducer (WFST)-based decoder, it is difficult to implement a new flexible DNN model. By contrast, a general-purpose deep learning framework, such as TensorFlow, can easily build various types of neural network architectures using a tensor-based computation method, but it is difficult to apply them to WFST-based speech recognition. In this study, a TensorFlow-based acoustic model is integrated with a WFST-based Kaldi decoder to combine the two frameworks. The features and alignments used in Kaldi are converted so they can be trained by the TensorFlow model, and the DNN-based acoustic model is then trained. In the integrated Kaldi decoder, the posterior probabilities are calculated by querying the trained TensorFlow model, and a beam search is performed to generate the lattice. The advantages of the proposed one-pass decoder include the application of various types of neural networks to WFST-based speech recognition and WFST-based online decoding using a TensorFlow-based acoustic model. The TensorFlow based acoustic models trained using the RM, WSJ, and LibriSpeech datasets show the same level of performance as the model trained using the Kaldi framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript describes an integration between TensorFlow-based acoustic models and the Kaldi WFST decoder. Kaldi features and alignments are converted for training in TensorFlow; the resulting model is then queried for posterior probabilities inside the Kaldi decoder to perform beam search and lattice generation. The central empirical claim is that TensorFlow models trained on the RM, WSJ, and LibriSpeech corpora achieve the same performance level as native Kaldi acoustic models.

Significance. If the reported performance equivalence is substantiated, the work would enable arbitrary neural-network architectures developed in general-purpose frameworks such as TensorFlow to be used inside an established WFST-based decoder, supporting both one-pass online decoding and more flexible acoustic-model experimentation while retaining Kaldi’s search infrastructure.

major comments (2)
  1. [Abstract] Abstract: the claim that 'The TensorFlow based acoustic models trained using the RM, WSJ, and LibriSpeech datasets show the same level of performance as the model trained using the Kaldi framework' is presented without any WER values, tables, error bars, training hyper-parameters, or statistical tests. Because this equivalence is the load-bearing empirical result, the absence of quantitative evidence prevents verification of the central contribution.
  2. [Method (conversion step)] Conversion procedure (described in the abstract and method outline): no details are supplied on feature normalization, frame-alignment precision, or any checks that the converted data preserve the exact information used by the original Kaldi DNN. The performance-equivalence claim rests on this conversion being information-preserving; without explicit verification (e.g., posterior comparison or ablation on the conversion step), the result cannot be assessed.
minor comments (2)
  1. The manuscript would benefit from a dedicated Experiments section containing side-by-side WER tables for the three corpora.
  2. Clarify whether the TensorFlow network architecture exactly replicates the Kaldi DNN topology or introduces any differences that could affect the comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight the need for stronger quantitative support and methodological transparency. We will revise the manuscript to address both points.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'The TensorFlow based acoustic models trained using the RM, WSJ, and LibriSpeech datasets show the same level of performance as the model trained using the Kaldi framework' is presented without any WER values, tables, error bars, training hyper-parameters, or statistical tests. Because this equivalence is the load-bearing empirical result, the absence of quantitative evidence prevents verification of the central contribution.

    Authors: We agree that the abstract and manuscript would benefit from explicit quantitative results. In the revision we will add the WER numbers for both TensorFlow and native Kaldi models on RM, WSJ and LibriSpeech, include a results table with training hyperparameters, and note any available error-bar or significance information. This directly substantiates the equivalence claim. revision: yes

  2. Referee: [Method (conversion step)] Conversion procedure (described in the abstract and method outline): no details are supplied on feature normalization, frame-alignment precision, or any checks that the converted data preserve the exact information used by the original Kaldi DNN. The performance-equivalence claim rests on this conversion being information-preserving; without explicit verification (e.g., posterior comparison or ablation on the conversion step), the result cannot be assessed.

    Authors: We accept that the conversion step requires more explicit description and validation. The revised Methods section will detail the feature normalization, alignment precision, and will add verification (posterior comparison between the original Kaldi DNN and the converted TensorFlow inputs, plus a brief ablation on the conversion). These additions will confirm that the conversion is information-preserving. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical integration paper with no derivations or fitted predictions

full rationale

The paper is an engineering description of converting Kaldi features/alignments to TensorFlow format, training a DNN acoustic model in TF, and plugging the resulting posteriors into the Kaldi WFST decoder. No equations, no parameter fitting presented as prediction, and no self-citation chains are used to justify any result. The performance claim is an empirical observation on RM/WSJ/LibriSpeech, not a reduction to inputs by construction. This matches the default case of a self-contained implementation report.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper contains no mathematical model, free parameters, axioms, or invented entities; it is a software integration report.

pith-pipeline@v0.9.0 · 5765 in / 995 out tokens · 23497 ms · 2026-05-25T18:41:50.515544+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 6 internal anchors

  1. [1]

    Many researchers and leading companies have actively engaged in studying speech recognition

    Introduction Automatic speech recognition (ASR) has been significantly improved in recent years [1 , 2, 3 ]. Many researchers and leading companies have actively engaged in studying speech recognition. One of the reasons for this active research is that various open -source ASR frameworks have been introduced. In the early stages, there was a lot of resea...

  2. [2]

    The system is divided into fea ture extraction, GMM training, DNN training, and WFST decoding processes

    TensorFlow based Acoustic Modeling The overall procedure to build a Kaldi ASR system is shown in Figure 1. The system is divided into fea ture extraction, GMM training, DNN training, and WFST decoding processes. For the training step, TensorFlow is used for DNN model training and decoding Kaldi is used to obtain the feature and label information for the T...

  3. [3]

    The feature reader reads a feature vector in ark format, reconstructs the feature vector according to the left and right context length, and transmits it to nnetcomputer

    Integration with Kaldi Decoder The Kaldi decoder consists of the feature reader , nnetcomputer, and lattice generator as shown in Figure 3. The feature reader reads a feature vector in ark format, reconstructs the feature vector according to the left and right context length, and transmits it to nnetcomputer. In nnetcomputer, after the posterior probabili...

  4. [4]

    The corpora that were used in the experiments are RM, WSJ, and LibriSpeech, whose total learning lengths are 9 hours, 80 hours, and 960 hours, respectively

    Experiments From among the learning materials that are widely used for speech recognition evaluation, three data sets were chosen for our experiments ranging from a small corpus to a large corpus. The corpora that were used in the experiments are RM, WSJ, and LibriSpeech, whose total learning lengths are 9 hours, 80 hours, and 960 hours, respectively. In ...

  5. [5]

    The acoustic model training is conducted in TensorFlow, and the features , labels, and other WFST -based decoders are obtained through Kaldi

    Conclusion and Future Work We proposed a method in which the gap between Kaldi and TensorFlow is eliminated, an acoustic model is trained in TensorFlow, and the model is integrated into the Kaldi -based decoder. The acoustic model training is conducted in TensorFlow, and the features , labels, and other WFST -based decoders are obtained through Kaldi. To ...

  6. [6]

    Acknowledgements This work was supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (No.2017 -0-01772, Development of QA systems for Video Story Understanding to pass the Video Turing Test)

  7. [7]

    Deep learning,

    Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 436-444, 2015

  8. [8]

    Deep learning in neural networks: An overview,

    J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural networks, vol. 61, pp-85-117, 2015

  9. [9]

    Recent progresses in deep learning based acoustic models,

    D. Yu and J. Li , “ Recent progresses in deep learning based acoustic models,” IEEE/CAA Journal of Automatica Sinica, vol. 4, no. 3, pp. 399–412, 2019

  10. [10]

    Young, G

    S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu,G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev , and P. Woodland , The HTK Book (for version 3.4) , Cambridge University Engineering Department, 2009

  11. [11]

    The CMU SPHINX -4 speech recognition system,

    P. Lamere, P. Kwok, E. Gouvea, B. Raj, R. Singh, W. Walker, M. Warmuth, and P. Wolf, “The CMU SPHINX -4 speech recognition system,” in ICAASP 2003 – 2003 IEEE International Conference on Acoustics, Speech and Signal Processing, April 6-10, Hong Kong, China, Proceedings, 2003, pp. 2-5

  12. [12]

    CNTK: Microsoft's open -source deep-learning toolkit,

    F. Seide and A. Agarwal, “CNTK: Microsoft's open -source deep-learning toolkit,” in SIGKDD 2016 – 22nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining , August 13-17, San Francisco, U.S.A., Proceedings, 2016, pp. 2135-2135

  13. [13]

    The Kaldi Speech Recognition Toolkit ,

    D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer , and K. Vesely , “ The Kaldi Speech Recognition Toolkit ,” 2011 IEEE Workshop on Automatic Speech Recognition and Understanding, IEEE Signal Processing Society, 2011

  14. [14]

    Weighted finite -state transducers in speech recognition ,

    M. Mohri, F. Pereira, and M. Riley, “Weighted finite -state transducers in speech recognition ,” Computer Speech and Language, vol. 16, no. 1, pp. 69-88, 2002

  15. [15]

    Improving deep neural network acoustic models using generalized maxout networks,

    X. Zhang, J. Trmal, D. Povey, and S. Khudanpur, “Improving deep neural network acoustic models using generalized maxout networks,” in ICASSP 2014 - 2014 IEEE Int ernational Conference on Acoustics, Speech and Signal Processing, May 4- 9, Florence, Italy, Proceedings, 2014, pp. 215-219

  16. [16]

    An Exploration of Dropout with LSTMs ,

    G. Cheng, V. Peddinti, D. Povey, V. Manohar, S. Khudanpur, and Y. Yan, “An Exploration of Dropout with LSTMs ,” in INTERSPEECH 2017 – 18th Annual Conference of the International Speech Communication Association, August 20-24, Stockholm, Sweden, Proceedings, 2017, pp. 1586-1590

  17. [17]

    A time delay neural network architecture for efficient modeling of long temporal contexts,

    V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural network architecture for efficient modeling of long temporal contexts,” in INTERSPEECH 2015 – 16th Annual Conference of the International Speech Communication Association, September 6 -10, Dresden, Germany, Proceedings, 2015, pp . 3214-3218

  18. [18]

    Tensorflow: A system for large -scale mac hine learning,

    M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, and M. Kudlur, “Tensorflow: A system for large -scale mac hine learning,” i n OSDI 2016 - 12th USENIX Symposium on Operating Systems Design and Implementation , November 2 -4, Savannah, U.S.A., 2016, pp. 265-283

  19. [19]

    Torch: a modular machine learning software library,

    R. Collobert, S. Bengio, and J. Mariéthoz, “Torch: a modular machine learning software library,” Technical Report IDIAP-RR 02-46, 2002

  20. [20]

    Introduction to pytorch

    N. Ketkar, “Introduction to pytorch. ” In Deep learning with python, Apress, 2017

  21. [21]

    Theano: a CPU and GPU math expression compiler ,

    J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins J. Turian, D. Farley, and Y. Bengio, “Theano: a CPU and GPU math expression compiler ,” in SciPy 2010 - Proceedings of the Python for Scientific Computing Conference, June 28-July 3, Austin, U.S.A., Proceedings, 2010, pp 3-10

  22. [22]

    Towards better decoding and language model integration in sequence to sequence models

    J. Chorowski and N. Jaitly, “Towards better decoding and language model integration in seq uence to sequence models ,” arXiv preprint arXiv:1612.02695, 2016

  23. [23]

    Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

    Y. Wu, M. Schuster, Z. Chen, V. Le, M. Norouzi, W. Macherey, and J. Klingner, “Google's neural machine translation system: Bridging the gap between human and machine translation ,” arXiv preprint arXiv:1609.08144, 2016

  24. [24]

    The PyTorch-Kaldi Speech Recognition Toolkit

    M. Ravanelli, T. Parcollet, and Y. Bengio, “The PyTorch -Kaldi Speech Recognition Toolkit,” arXiv preprint arXiv:1811.07453, 2018

  25. [25]

    Noise-contrastive estimation: A new estimation principle for unnormalized statistical models,

    M. Gutmann and A. Hyvärinen “Noise-contrastive estimation: A new estimation principle for unnormalized statistical models,” in AISTATS 2010 - 13th International Conference on Artificial Intelligence and Statistics , May 13 -15, Sardinia, Italy, Proceedings, 2010, pp. 297-304

  26. [26]

    Dropout: a simple way to prevent neural networks from overfitting ,

    N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting ,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929-1958, 2014

  27. [27]

    Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

    S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift ,” arXiv preprint arXiv:1502.03167, 2015

  28. [28]

    An overview of gradient descent optimization algorithms

    S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv preprint arXiv:1609.04747, 2016

  29. [29]

    TensorFlow-Serving: Flexible, High-Performance ML Serving

    C. Olston, N. Fiedel, K. Gorovoy, J. Harmsen, L. Lao, F. Li, and J. Soyke, “Tensorflow-serving: Flexible, high -performance ml serving,” arXiv preprint arXiv:1712.06139, 2017

  30. [30]

    Boosted MMI for model and feature-space discriminative training ,

    D. Povey, D. Kanevsky, B. Kingsbury, B. Ramabhadran, G. Saon, and K. Visweswariah, “ Boosted MMI for model and feature-space discriminative training ,” i n ICASSP 2008 – 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, March 30 -April 4, Las Vegas, U.S.A., Proceedings, 2008, pp. 4057-4060

  31. [31]

    The design for the Wall Street Journal - based CSR corpus ,

    D. Paul and J. Baker, “The design for the Wall Street Journal - based CSR corpus ,” in Proceedings of the workshop on Speech and Natural Language , February 23 -26, Harriman, U.S.A., Proceedings, 1992

  32. [32]

    Librispeech: an ASR corpus based on public domain audio books,

    V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an ASR corpus based on public domain audio books,” in ICASSP 2015 – 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, April 19 -24, Brisbane, Australia, Proceedings, 2015, pp. 5206-5210