pith. sign in

arxiv: 1907.01677 · v1 · pith:WCE57WVMnew · submitted 2019-07-02 · 💻 cs.CL · cs.LG

Scalable Multi Corpora Neural Language Models for ASR

Pith reviewed 2026-05-25 10:39 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords neural language modelsautomatic speech recognitionn-best rescoringword error rateheterogeneous corporalatency controlpersonalized bias
0
0 comments X

The pith

Neural language models from multiple corpora cut ASR word error rates by 6.2 percent in second-pass n-best rescoring with only minimal added latency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how neural language models can be made practical for large-scale automatic speech recognition by solving three engineering problems at once. It explains methods for training on text from many different sources, keeping the extra computation time small during decoding, and adding user-specific language adjustments. These changes let the neural model replace or augment conventional n-gram models inside an existing second-pass rescoring stage. A reader would care because the result is higher transcription accuracy without breaking the speed limits of real production systems.

Core claim

Solutions for training neural language models from heterogeneous corpora, limiting latency impact, and handling personalized bias can be combined in a second-pass n-best rescoring framework to produce a 6.2% relative word error rate reduction with only a minimal increase in latency.

What carries the argument

The integrated pipeline of heterogeneous-corpus training, latency control techniques, and personalized bias handling applied to neural language model rescoring of n-best lists.

If this is right

  • Neural language models become usable at scale in production ASR by training across varied text sources.
  • The added computation from neural rescoring stays small enough to preserve real-time performance.
  • User-specific adaptations can be included without erasing the accuracy gains from the neural model.
  • Second-pass n-best rescoring with these neural models outperforms n-gram baselines by the stated margin.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same training and control techniques could be tested in first-pass decoding or in end-to-end ASR models to check for larger gains.
  • If the multi-corpus approach scales linearly, even larger and more diverse text collections might produce further error reductions.
  • The latency controls described may transfer to other sequence generation tasks that require fast neural model use.

Load-bearing premise

The engineering solutions for heterogeneous-corpus training, latency control, and personalized bias can be combined without introducing new error sources that cancel the reported WER gain.

What would settle it

An independent replication on another large ASR dataset that measures either no WER improvement or a latency increase beyond the minimal threshold when the same training and rescoring methods are used.

read the original abstract

Neural language models (NLM) have been shown to outperform conventional n-gram language models by a substantial margin in Automatic Speech Recognition (ASR) and other tasks. There are, however, a number of challenges that need to be addressed for an NLM to be used in a practical large-scale ASR system. In this paper, we present solutions to some of the challenges, including training NLM from heterogenous corpora, limiting latency impact and handling personalized bias in the second-pass rescorer. Overall, we show that we can achieve a 6.2% relative WER reduction using neural LM in a second-pass n-best rescoring framework with a minimal increase in latency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript presents engineering solutions for training neural language models on heterogeneous corpora, limiting latency impact during second-pass n-best rescoring, and incorporating personalized bias. It reports an aggregate result of 6.2% relative WER reduction in an ASR system with only minimal latency increase.

Significance. If the reported WER reduction is shown to be robust under joint evaluation of the components, the work would address practical barriers to deploying NLMs at scale in production ASR, providing concrete guidance on multi-corpus training and real-time constraints.

major comments (2)
  1. [Abstract] Abstract, final sentence: the claim that the three solutions (heterogeneous-corpus NLM training, latency limiting, and personalized bias) can be combined to yield a net 6.2% relative WER reduction is presented without any experimental details, baselines, dataset sizes, ablation studies, or evidence that the components were validated jointly rather than in isolation.
  2. [Abstract] Abstract: no description is given of how latency pruning or bias adaptation affects n-best list statistics or the multi-corpus distribution, leaving open the possibility that interactions among the three solutions introduce offsetting errors that reduce or eliminate the reported aggregate gain.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly note that the abstract is brief and does not include experimental details or discussion of component interactions. We address both points below and will revise the abstract accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract, final sentence: the claim that the three solutions (heterogeneous-corpus NLM training, latency limiting, and personalized bias) can be combined to yield a net 6.2% relative WER reduction is presented without any experimental details, baselines, dataset sizes, ablation studies, or evidence that the components were validated jointly rather than in isolation.

    Authors: The abstract is intended as a concise summary. The manuscript provides the requested details in the Experiments section (dataset sizes from multiple heterogeneous corpora, n-gram and single-corpus baselines) and Results section (ablations for each component plus the joint evaluation of all three solutions together, which produces the reported 6.2% relative WER reduction with minimal latency increase). We will revise the abstract to briefly reference the joint system-level evaluation. revision: yes

  2. Referee: [Abstract] Abstract: no description is given of how latency pruning or bias adaptation affects n-best list statistics or the multi-corpus distribution, leaving open the possibility that interactions among the three solutions introduce offsetting errors that reduce or eliminate the reported aggregate gain.

    Authors: The body of the paper (Sections 4 and 5) explains the design choices for latency pruning (preserving n-best quality) and bias adaptation (integrated into multi-corpus training). The Results section reports the net gain from the fully combined system, indicating that offsetting interactions did not occur in our evaluation. We will add a short clarifying phrase to the abstract noting the joint validation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical WER claim with no derivation or fitted equations

full rationale

The paper reports an empirical 6.2% relative WER reduction from combining heterogeneous-corpus NLM training, latency control, and personalized bias in second-pass rescoring. No equations, predictions, or first-principles derivations are presented that could reduce to inputs by construction. The central result is a measured performance number on ASR tasks, not a derived quantity. No self-citations, ansatzes, or uniqueness theorems are invoked as load-bearing steps. The derivation chain is empty; the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations, no fitted constants, and no postulated entities, so the ledger is empty.

pith-pipeline@v0.9.0 · 5649 in / 1054 out tokens · 32936 ms · 2026-05-25T10:39:41.623290+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 4 internal anchors

  1. [1]

    The most common approach to building LMs for ASR systems is to learn back-off n-gram models on large text corpora

    Introduction Language Models (LM) are a key component in building Auto- matic Speech Recognition (ASR) systems. The most common approach to building LMs for ASR systems is to learn back-off n-gram models on large text corpora. Recurrent Neural Lan- guage Models (NLM) have been shown to consistently outper- form traditional n-gram language models from lang...

  2. [2]

    Methods and Challenges Addressed 2.1. Domain adaptation In a practical ASR system, the LM is often trained on multi- ple heterogenous corpora, comprising a mix of written corpo ra and manually transcribed spoken text corpora from various d o- mains. These corpora may differ in terms of their vocabulary , content, style, argots, etc [7]. We require a solut...

  3. [3]

    The ASR system comprises first-pass LM trained on a variety of in- and out- of-domain corpora, including written text data and transcr ibed speech data

    Experimental Setup In all of the experiments in this paper, we build an ASR sys- tem that targets the message dictation task. The ASR system comprises first-pass LM trained on a variety of in- and out- of-domain corpora, including written text data and transcr ibed speech data. The transcribed speech data is from real user-a gent interactions, and is bucke...

  4. [4]

    The models are quantized to 16-bit fixed-point re p- resentation as described in Section 2.2.2

    In addition, there are residual connections [33] betwe en the layers. The models are quantized to 16-bit fixed-point re p- resentation as described in Section 2.2.2. The NLM is used to rescore 10-best hypotheses generated from first-pass decod ing. From in-domain corpus, we extract the vocabulary of 60k most frequent words. All NLM models use this vocabular...

  5. [5]

    Results and Discussion 4.0.1. Domain adaptation experiments Table 1 shows perplexity results comparing NLMs trained on a single data source against different domain adaptation me th- ods described in Sections 2.1.1 and 2.1.2: mixing multiple c or- pora, applying transfer learning (fine-tuning), and combin ing both methods. First, we confirm that our voicema...

  6. [6]

    Conclusions and Future Work In this work, we addressed several challenges for an NLM to be used in a practical large-scale ASR system. In particular , training an NLM from multiple heterogenous corpora using a novel data mixing strategy, along with transfer learning ba sed on fine-tuning that provided 16.1% relative improvement in per- plexity compared to ...

  7. [7]

    A neur al probabilistic language model,

    Y . Bengio, R. Ducharme, P . Vincent, and C. Janvin, “A neur al probabilistic language model,” Journal of Machine Learning Re- search, vol. 3, no. 6, pp. 1137–1155, 2003

  8. [8]

    Continuous space language models,

    H. Schwenk, “Continuous space language models,” Computer Speech & Language, vol. 21, no. 3, pp. 492–518, 2007

  9. [9]

    Recurrent neural network based language model,

    T. Mikolov, M. Karafit, L. Burget, J. . ernock, and S. Khuda n- pur, “Recurrent neural network based language model,” inINTER- SPEECH, 2010, pp. 1045–1048

  10. [10]

    Exploring the Limits of Language Modeling

    R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y . Wu, “Exploring the limits of language modeling,” arXiv preprint arXiv:1602.02410, 2016

  11. [11]

    Real-time one-pass de - coding with recurrent neural network language model for spe ech recognition,

    T. Hori, Y . Kubo, and A. Nakamura, “Real-time one-pass de - coding with recurrent neural network language model for spe ech recognition,” in ICASSP 2014 - 2014 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 6364–6368

  12. [12]

    Class-based n-gram models of natural language,

    P . F. Brown, P . V . Desouza, R. L. Mercer, V . J. D. Pietra, and J. C. Lai, “Class-based n-gram models of natural language,” Computa- tional linguistics, vol. 18, no. 4, pp. 467–479, 1992

  13. [13]

    Statistical language model adaptati on: review and perspectives,

    J. R. Bellegarda, “Statistical language model adaptati on: review and perspectives,” Speech communication, vol. 42, no. 1, pp. 93– 108, 2004

  14. [14]

    Scalable lan guage model adaptation for spoken dialogue systems,

    A. Gandhe, A. Rastrow, and B. Hoffmeister, “Scalable lan guage model adaptation for spoken dialogue systems,” in 2018 IEEE Spoken Language Technology W orkshop (SLT) , 2018, pp. 907– 912

  15. [15]

    Contextual language mod el adaptation for conversational agents

    A. Raju, B. Hedayatnia, L. Liu, A. Gandhe, C. Khatri, A. Me talli- nou, A. V enkatesh, and A. Rastrow, “Contextual language mod el adaptation for conversational agents.” in Interspeech 2018, 2018, pp. 3333–3337

  16. [16]

    Imagene t classi- fication with deep convolutional neural networks,

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagene t classi- fication with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105

  17. [17]

    Visualizing and understand ing con- volutional networks,

    M. D. Zeiler and R. Fergus, “Visualizing and understand ing con- volutional networks,” in European conference on computer vision. Springer, 2014, pp. 818–833

  18. [18]

    How Transferable are Neural Networks in NLP Applications?

    L. Mou, Z. Meng, R. Y an, G. Li, Y . Xu, L. Zhang, and Z. Jin, “How transferable are neural networks in nlp applications?” arXiv preprint arXiv:1603.06111, 2016

  19. [19]

    Catastrophic interferen ce in connectionist networks: The sequential learning problem,

    M. McCloskey and N. J. Cohen, “Catastrophic interferen ce in connectionist networks: The sequential learning problem, ” Psy- chology of Learning and Motivation , vol. 24, pp. 109–165, 1989

  20. [20]

    Connectionist models of recognition mem ory: con- straints imposed by learning and forgetting functions,

    R. Ratcliff, “Connectionist models of recognition mem ory: con- straints imposed by learning and forgetting functions,” Psycholog- ical Review, vol. 97, no. 2, pp. 285–308, 1990

  21. [21]

    Extensions of recurrent neural network language mode l,

    T. Mikolov, S. Kombrink, L. Burget, J. Cernocky, and S. K hudan- pur, “Extensions of recurrent neural network language mode l,” in 2011 IEEE International Conference on Acoustics, Speech an d Signal Processing (ICASSP), 2011, pp. 5528–5531

  22. [22]

    Hierarchical probabilistic ne ural net- work language model

    F. Morin and Y . Bengio, “Hierarchical probabilistic ne ural net- work language model.” in AISTATS, 2005

  23. [23]

    Efficient training and evaluation of recurrent neural netw ork lan- guage models for automatic speech recognition,

    X. Chen, X. Liu, Y . Wang, M. J. F. Gales, and P . C. Woodland , “Efficient training and evaluation of recurrent neural netw ork lan- guage models for automatic speech recognition,” IEEE Transac- tions on Audio, Speech, and Language Processing, vol. 24, no. 11, pp. 2146–2157, 2016

  24. [24]

    V ariance regula rization of rnnlm for speech recognition,

    Y . Shi, W.-Q. Zhang, M. Cai, and J. Liu, “V ariance regula rization of rnnlm for speech recognition,” in ICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal P ro- cessing (ICASSP), 2014, pp. 4893–4897

  25. [25]

    Fast and robust neural network joint models for sta- tistical machine translation,

    J. Devlin, R. Zbib, Z. Huang, T. Lamar, R. M. Schwartz, an d J. Makhoul, “Fast and robust neural network joint models for sta- tistical machine translation,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics ( V ol- ume 1: Long Papers) , 2014, pp. 1370–1380

  26. [26]

    Strategies for train ing large vocabulary neural language models,

    W. Chen, D. Grangier, and M. Auli, “Strategies for train ing large vocabulary neural language models,” in Proceedings of the 54th Annual Meeting of the Association for Computational Lingui stics (V olume 1: Long Papers), vol. 1, 2016, pp. 1975–1985

  27. [27]

    When and why are log-linear mod els self-normalizing?

    J. Andreas and D. Klein, “When and why are log-linear mod els self-normalizing?” in Proceedings of the 2015 Conference of the North American Chapter of the Association for Computationa l Linguistics: Human Language Technologies, 2015, pp. 244–249

  28. [28]

    A fast and simple algorithm for tra ining neural probabilistic language models,

    A. Mnih and Y . W. Teh, “A fast and simple algorithm for tra ining neural probabilistic language models,” international conference on machine learning, pp. 419–426, 2012

  29. [29]

    Decoding with large-scale neural language models improves translation,

    A. V aswani, Y . Zhao, V . Fossum, and D. Chiang, “Decoding with large-scale neural language models improves translation,” in Pro- ceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 2013, pp. 1387–1392

  30. [30]

    Recur rent neural network language model training with noise contrast ive es- timation for speech recognition,

    X. Chen, X. Liu, M. J. F. Gales, and P . C. Woodland, “Recur rent neural network language model training with noise contrast ive es- timation for speech recognition,” in 2015 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP ), 2015, pp. 5411–5415

  31. [31]

    Simple, fast noise- contrastive estimation for large rnn vocabularies

    B. Zoph, A. V aswani, J. May, and K. Knight, “Simple, fast noise- contrastive estimation for large rnn vocabularies.” in Proceedings of the 2016 Conference of the North American Chapter of the As - sociation for Computational Linguistics: Human Language T ech- nologies, 2016, pp. 1217–1222

  32. [32]

    Self-normalization pro perties of language modeling,

    J. Goldberger and O. Melamud, “Self-normalization pro perties of language modeling,” international conference on computational linguistics, pp. 764–773, 2018

  33. [33]

    Quantizing deep convolutional networks for efficient inference: A whitepaper

    R. Krishnamoorthi, “Quantizing deep convolutional ne t- works for efficient inference: A whitepaper.” arXiv preprint arXiv:1806.08342, 2018

  34. [34]

    V ariational approximation of long-span languagemodels for lvcsr,

    A. Deoras, T. Mikolov, S. Kombrink, M. Karafit, and S. Khu - danpur, “V ariational approximation of long-span languagemodels for lvcsr,” in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2011, pp. 5532–5535

  35. [35]

    Improved recognition of contact names in voic e commands,

    P . Aleksic, C. Allauzen, D. Elson, A. Kracun, D. M. Casad o, and P . J. Moreno, “Improved recognition of contact names in voic e commands,” in 2015 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP) . IEEE, 2015, pp. 5172–5175

  36. [36]

    Improved backing-off for m-gram l an- guage modeling,

    R. Kneser and H. Ney, “Improved backing-off for m-gram l an- guage modeling,” in 1995 International Conference on Acoustics, Speech, and Signal Processing, vol. 1, 1995, pp. 181–184

  37. [37]

    Long short-term mem ory,

    S. Hochreiter and J. Schmidhuber, “Long short-term mem ory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997

  38. [38]

    Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition

    H. Sak, A. W. Senior, and F. Beaufays, “Long short-term m emory based recurrent neural network architectures for large voc abulary speech recognition,” arXiv preprint arXiv:1402.1128, 2014

  39. [39]

    Deep residual learni ng for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learni ng for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2016, pp. 770–778