Scalable Multi Corpora Neural Language Models for ASR

Anirudh Raju; Ariya Rastrow; Denis Filimonov; Gautam Tiwari; Guitang Lan

arxiv: 1907.01677 · v1 · pith:WCE57WVMnew · submitted 2019-07-02 · 💻 cs.CL · cs.LG

Scalable Multi Corpora Neural Language Models for ASR

Anirudh Raju , Denis Filimonov , Gautam Tiwari , Guitang Lan , Ariya Rastrow This is my paper

Pith reviewed 2026-05-25 10:39 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords neural language modelsautomatic speech recognitionn-best rescoringword error rateheterogeneous corporalatency controlpersonalized bias

0 comments

The pith

Neural language models from multiple corpora cut ASR word error rates by 6.2 percent in second-pass n-best rescoring with only minimal added latency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how neural language models can be made practical for large-scale automatic speech recognition by solving three engineering problems at once. It explains methods for training on text from many different sources, keeping the extra computation time small during decoding, and adding user-specific language adjustments. These changes let the neural model replace or augment conventional n-gram models inside an existing second-pass rescoring stage. A reader would care because the result is higher transcription accuracy without breaking the speed limits of real production systems.

Core claim

Solutions for training neural language models from heterogeneous corpora, limiting latency impact, and handling personalized bias can be combined in a second-pass n-best rescoring framework to produce a 6.2% relative word error rate reduction with only a minimal increase in latency.

What carries the argument

The integrated pipeline of heterogeneous-corpus training, latency control techniques, and personalized bias handling applied to neural language model rescoring of n-best lists.

If this is right

Neural language models become usable at scale in production ASR by training across varied text sources.
The added computation from neural rescoring stays small enough to preserve real-time performance.
User-specific adaptations can be included without erasing the accuracy gains from the neural model.
Second-pass n-best rescoring with these neural models outperforms n-gram baselines by the stated margin.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same training and control techniques could be tested in first-pass decoding or in end-to-end ASR models to check for larger gains.
If the multi-corpus approach scales linearly, even larger and more diverse text collections might produce further error reductions.
The latency controls described may transfer to other sequence generation tasks that require fast neural model use.

Load-bearing premise

The engineering solutions for heterogeneous-corpus training, latency control, and personalized bias can be combined without introducing new error sources that cancel the reported WER gain.

What would settle it

An independent replication on another large ASR dataset that measures either no WER improvement or a latency increase beyond the minimal threshold when the same training and rescoring methods are used.

read the original abstract

Neural language models (NLM) have been shown to outperform conventional n-gram language models by a substantial margin in Automatic Speech Recognition (ASR) and other tasks. There are, however, a number of challenges that need to be addressed for an NLM to be used in a practical large-scale ASR system. In this paper, we present solutions to some of the challenges, including training NLM from heterogenous corpora, limiting latency impact and handling personalized bias in the second-pass rescorer. Overall, we show that we can achieve a 6.2% relative WER reduction using neural LM in a second-pass n-best rescoring framework with a minimal increase in latency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript presents engineering solutions for training neural language models on heterogeneous corpora, limiting latency impact during second-pass n-best rescoring, and incorporating personalized bias. It reports an aggregate result of 6.2% relative WER reduction in an ASR system with only minimal latency increase.

Significance. If the reported WER reduction is shown to be robust under joint evaluation of the components, the work would address practical barriers to deploying NLMs at scale in production ASR, providing concrete guidance on multi-corpus training and real-time constraints.

major comments (2)

[Abstract] Abstract, final sentence: the claim that the three solutions (heterogeneous-corpus NLM training, latency limiting, and personalized bias) can be combined to yield a net 6.2% relative WER reduction is presented without any experimental details, baselines, dataset sizes, ablation studies, or evidence that the components were validated jointly rather than in isolation.
[Abstract] Abstract: no description is given of how latency pruning or bias adaptation affects n-best list statistics or the multi-corpus distribution, leaving open the possibility that interactions among the three solutions introduce offsetting errors that reduce or eliminate the reported aggregate gain.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly note that the abstract is brief and does not include experimental details or discussion of component interactions. We address both points below and will revise the abstract accordingly.

read point-by-point responses

Referee: [Abstract] Abstract, final sentence: the claim that the three solutions (heterogeneous-corpus NLM training, latency limiting, and personalized bias) can be combined to yield a net 6.2% relative WER reduction is presented without any experimental details, baselines, dataset sizes, ablation studies, or evidence that the components were validated jointly rather than in isolation.

Authors: The abstract is intended as a concise summary. The manuscript provides the requested details in the Experiments section (dataset sizes from multiple heterogeneous corpora, n-gram and single-corpus baselines) and Results section (ablations for each component plus the joint evaluation of all three solutions together, which produces the reported 6.2% relative WER reduction with minimal latency increase). We will revise the abstract to briefly reference the joint system-level evaluation. revision: yes
Referee: [Abstract] Abstract: no description is given of how latency pruning or bias adaptation affects n-best list statistics or the multi-corpus distribution, leaving open the possibility that interactions among the three solutions introduce offsetting errors that reduce or eliminate the reported aggregate gain.

Authors: The body of the paper (Sections 4 and 5) explains the design choices for latency pruning (preserving n-best quality) and bias adaptation (integrated into multi-corpus training). The Results section reports the net gain from the fully combined system, indicating that offsetting interactions did not occur in our evaluation. We will add a short clarifying phrase to the abstract noting the joint validation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical WER claim with no derivation or fitted equations

full rationale

The paper reports an empirical 6.2% relative WER reduction from combining heterogeneous-corpus NLM training, latency control, and personalized bias in second-pass rescoring. No equations, predictions, or first-principles derivations are presented that could reduce to inputs by construction. The central result is a measured performance number on ASR tasks, not a derived quantity. No self-citations, ansatzes, or uniqueness theorems are invoked as load-bearing steps. The derivation chain is empty; the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations, no fitted constants, and no postulated entities, so the ledger is empty.

pith-pipeline@v0.9.0 · 5649 in / 1054 out tokens · 32936 ms · 2026-05-25T10:39:41.623290+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we construct minibatches stochastically, by drawing samples from each corpus with probability according to its relevance weight... NCE based training... class content with tags <class> and </class>
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Overall, we show that we can achieve a 6.2% relative WER reduction using neural LM in a second-pass n-best rescoring framework

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 4 internal anchors

[1]

The most common approach to building LMs for ASR systems is to learn back-off n-gram models on large text corpora

Introduction Language Models (LM) are a key component in building Auto- matic Speech Recognition (ASR) systems. The most common approach to building LMs for ASR systems is to learn back-off n-gram models on large text corpora. Recurrent Neural Lan- guage Models (NLM) have been shown to consistently outper- form traditional n-gram language models from lang...

work page
[2]

Methods and Challenges Addressed 2.1. Domain adaptation In a practical ASR system, the LM is often trained on multi- ple heterogenous corpora, comprising a mix of written corpo ra and manually transcribed spoken text corpora from various d o- mains. These corpora may differ in terms of their vocabulary , content, style, argots, etc [7]. We require a solut...

work page
[3]

The ASR system comprises ﬁrst-pass LM trained on a variety of in- and out- of-domain corpora, including written text data and transcr ibed speech data

Experimental Setup In all of the experiments in this paper, we build an ASR sys- tem that targets the message dictation task. The ASR system comprises ﬁrst-pass LM trained on a variety of in- and out- of-domain corpora, including written text data and transcr ibed speech data. The transcribed speech data is from real user-a gent interactions, and is bucke...

work page
[4]

The models are quantized to 16-bit ﬁxed-point re p- resentation as described in Section 2.2.2

In addition, there are residual connections [33] betwe en the layers. The models are quantized to 16-bit ﬁxed-point re p- resentation as described in Section 2.2.2. The NLM is used to rescore 10-best hypotheses generated from ﬁrst-pass decod ing. From in-domain corpus, we extract the vocabulary of 60k most frequent words. All NLM models use this vocabular...

work page
[5]

Results and Discussion 4.0.1. Domain adaptation experiments Table 1 shows perplexity results comparing NLMs trained on a single data source against different domain adaptation me th- ods described in Sections 2.1.1 and 2.1.2: mixing multiple c or- pora, applying transfer learning (ﬁne-tuning), and combin ing both methods. First, we conﬁrm that our voicema...

work page
[6]

Conclusions and Future Work In this work, we addressed several challenges for an NLM to be used in a practical large-scale ASR system. In particular , training an NLM from multiple heterogenous corpora using a novel data mixing strategy, along with transfer learning ba sed on ﬁne-tuning that provided 16.1% relative improvement in per- plexity compared to ...

work page
[7]

A neur al probabilistic language model,

Y . Bengio, R. Ducharme, P . Vincent, and C. Janvin, “A neur al probabilistic language model,” Journal of Machine Learning Re- search, vol. 3, no. 6, pp. 1137–1155, 2003

work page 2003
[8]

Continuous space language models,

H. Schwenk, “Continuous space language models,” Computer Speech & Language, vol. 21, no. 3, pp. 492–518, 2007

work page 2007
[9]

Recurrent neural network based language model,

T. Mikolov, M. Karaﬁt, L. Burget, J. . ernock, and S. Khuda n- pur, “Recurrent neural network based language model,” inINTER- SPEECH, 2010, pp. 1045–1048

work page 2010
[10]

Exploring the Limits of Language Modeling

R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y . Wu, “Exploring the limits of language modeling,” arXiv preprint arXiv:1602.02410, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[11]

Real-time one-pass de - coding with recurrent neural network language model for spe ech recognition,

T. Hori, Y . Kubo, and A. Nakamura, “Real-time one-pass de - coding with recurrent neural network language model for spe ech recognition,” in ICASSP 2014 - 2014 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 6364–6368

work page 2014
[12]

Class-based n-gram models of natural language,

P . F. Brown, P . V . Desouza, R. L. Mercer, V . J. D. Pietra, and J. C. Lai, “Class-based n-gram models of natural language,” Computa- tional linguistics, vol. 18, no. 4, pp. 467–479, 1992

work page 1992
[13]

Statistical language model adaptati on: review and perspectives,

J. R. Bellegarda, “Statistical language model adaptati on: review and perspectives,” Speech communication, vol. 42, no. 1, pp. 93– 108, 2004

work page 2004
[14]

Scalable lan guage model adaptation for spoken dialogue systems,

A. Gandhe, A. Rastrow, and B. Hoffmeister, “Scalable lan guage model adaptation for spoken dialogue systems,” in 2018 IEEE Spoken Language Technology W orkshop (SLT) , 2018, pp. 907– 912

work page 2018
[15]

Contextual language mod el adaptation for conversational agents

A. Raju, B. Hedayatnia, L. Liu, A. Gandhe, C. Khatri, A. Me talli- nou, A. V enkatesh, and A. Rastrow, “Contextual language mod el adaptation for conversational agents.” in Interspeech 2018, 2018, pp. 3333–3337

work page 2018
[16]

Imagene t classi- ﬁcation with deep convolutional neural networks,

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagene t classi- ﬁcation with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105

work page 2012
[17]

Visualizing and understand ing con- volutional networks,

M. D. Zeiler and R. Fergus, “Visualizing and understand ing con- volutional networks,” in European conference on computer vision. Springer, 2014, pp. 818–833

work page 2014
[18]

How Transferable are Neural Networks in NLP Applications?

L. Mou, Z. Meng, R. Y an, G. Li, Y . Xu, L. Zhang, and Z. Jin, “How transferable are neural networks in nlp applications?” arXiv preprint arXiv:1603.06111, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[19]

Catastrophic interferen ce in connectionist networks: The sequential learning problem,

M. McCloskey and N. J. Cohen, “Catastrophic interferen ce in connectionist networks: The sequential learning problem, ” Psy- chology of Learning and Motivation , vol. 24, pp. 109–165, 1989

work page 1989
[20]

Connectionist models of recognition mem ory: con- straints imposed by learning and forgetting functions,

R. Ratcliff, “Connectionist models of recognition mem ory: con- straints imposed by learning and forgetting functions,” Psycholog- ical Review, vol. 97, no. 2, pp. 285–308, 1990

work page 1990
[21]

Extensions of recurrent neural network language mode l,

T. Mikolov, S. Kombrink, L. Burget, J. Cernocky, and S. K hudan- pur, “Extensions of recurrent neural network language mode l,” in 2011 IEEE International Conference on Acoustics, Speech an d Signal Processing (ICASSP), 2011, pp. 5528–5531

work page 2011
[22]

Hierarchical probabilistic ne ural net- work language model

F. Morin and Y . Bengio, “Hierarchical probabilistic ne ural net- work language model.” in AISTATS, 2005

work page 2005
[23]

Efﬁcient training and evaluation of recurrent neural netw ork lan- guage models for automatic speech recognition,

X. Chen, X. Liu, Y . Wang, M. J. F. Gales, and P . C. Woodland , “Efﬁcient training and evaluation of recurrent neural netw ork lan- guage models for automatic speech recognition,” IEEE Transac- tions on Audio, Speech, and Language Processing, vol. 24, no. 11, pp. 2146–2157, 2016

work page 2016
[24]

V ariance regula rization of rnnlm for speech recognition,

Y . Shi, W.-Q. Zhang, M. Cai, and J. Liu, “V ariance regula rization of rnnlm for speech recognition,” in ICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal P ro- cessing (ICASSP), 2014, pp. 4893–4897

work page 2014
[25]

Fast and robust neural network joint models for sta- tistical machine translation,

J. Devlin, R. Zbib, Z. Huang, T. Lamar, R. M. Schwartz, an d J. Makhoul, “Fast and robust neural network joint models for sta- tistical machine translation,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics ( V ol- ume 1: Long Papers) , 2014, pp. 1370–1380

work page 2014
[26]

Strategies for train ing large vocabulary neural language models,

W. Chen, D. Grangier, and M. Auli, “Strategies for train ing large vocabulary neural language models,” in Proceedings of the 54th Annual Meeting of the Association for Computational Lingui stics (V olume 1: Long Papers), vol. 1, 2016, pp. 1975–1985

work page 2016
[27]

When and why are log-linear mod els self-normalizing?

J. Andreas and D. Klein, “When and why are log-linear mod els self-normalizing?” in Proceedings of the 2015 Conference of the North American Chapter of the Association for Computationa l Linguistics: Human Language Technologies, 2015, pp. 244–249

work page 2015
[28]

A fast and simple algorithm for tra ining neural probabilistic language models,

A. Mnih and Y . W. Teh, “A fast and simple algorithm for tra ining neural probabilistic language models,” international conference on machine learning, pp. 419–426, 2012

work page 2012
[29]

Decoding with large-scale neural language models improves translation,

A. V aswani, Y . Zhao, V . Fossum, and D. Chiang, “Decoding with large-scale neural language models improves translation,” in Pro- ceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 2013, pp. 1387–1392

work page 2013
[30]

Recur rent neural network language model training with noise contrast ive es- timation for speech recognition,

X. Chen, X. Liu, M. J. F. Gales, and P . C. Woodland, “Recur rent neural network language model training with noise contrast ive es- timation for speech recognition,” in 2015 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP ), 2015, pp. 5411–5415

work page 2015
[31]

Simple, fast noise- contrastive estimation for large rnn vocabularies

B. Zoph, A. V aswani, J. May, and K. Knight, “Simple, fast noise- contrastive estimation for large rnn vocabularies.” in Proceedings of the 2016 Conference of the North American Chapter of the As - sociation for Computational Linguistics: Human Language T ech- nologies, 2016, pp. 1217–1222

work page 2016
[32]

Self-normalization pro perties of language modeling,

J. Goldberger and O. Melamud, “Self-normalization pro perties of language modeling,” international conference on computational linguistics, pp. 764–773, 2018

work page 2018
[33]

Quantizing deep convolutional networks for efficient inference: A whitepaper

R. Krishnamoorthi, “Quantizing deep convolutional ne t- works for efﬁcient inference: A whitepaper.” arXiv preprint arXiv:1806.08342, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[34]

V ariational approximation of long-span languagemodels for lvcsr,

A. Deoras, T. Mikolov, S. Kombrink, M. Karaﬁt, and S. Khu - danpur, “V ariational approximation of long-span languagemodels for lvcsr,” in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2011, pp. 5532–5535

work page 2011
[35]

Improved recognition of contact names in voic e commands,

P . Aleksic, C. Allauzen, D. Elson, A. Kracun, D. M. Casad o, and P . J. Moreno, “Improved recognition of contact names in voic e commands,” in 2015 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP) . IEEE, 2015, pp. 5172–5175

work page 2015
[36]

Improved backing-off for m-gram l an- guage modeling,

R. Kneser and H. Ney, “Improved backing-off for m-gram l an- guage modeling,” in 1995 International Conference on Acoustics, Speech, and Signal Processing, vol. 1, 1995, pp. 181–184

work page 1995
[37]

Long short-term mem ory,

S. Hochreiter and J. Schmidhuber, “Long short-term mem ory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997

work page 1997
[38]

Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition

H. Sak, A. W. Senior, and F. Beaufays, “Long short-term m emory based recurrent neural network architectures for large voc abulary speech recognition,” arXiv preprint arXiv:1402.1128, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[39]

Deep residual learni ng for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learni ng for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2016, pp. 770–778

work page 2016

[1] [1]

The most common approach to building LMs for ASR systems is to learn back-off n-gram models on large text corpora

Introduction Language Models (LM) are a key component in building Auto- matic Speech Recognition (ASR) systems. The most common approach to building LMs for ASR systems is to learn back-off n-gram models on large text corpora. Recurrent Neural Lan- guage Models (NLM) have been shown to consistently outper- form traditional n-gram language models from lang...

work page

[2] [2]

Methods and Challenges Addressed 2.1. Domain adaptation In a practical ASR system, the LM is often trained on multi- ple heterogenous corpora, comprising a mix of written corpo ra and manually transcribed spoken text corpora from various d o- mains. These corpora may differ in terms of their vocabulary , content, style, argots, etc [7]. We require a solut...

work page

[3] [3]

The ASR system comprises ﬁrst-pass LM trained on a variety of in- and out- of-domain corpora, including written text data and transcr ibed speech data

Experimental Setup In all of the experiments in this paper, we build an ASR sys- tem that targets the message dictation task. The ASR system comprises ﬁrst-pass LM trained on a variety of in- and out- of-domain corpora, including written text data and transcr ibed speech data. The transcribed speech data is from real user-a gent interactions, and is bucke...

work page

[4] [4]

The models are quantized to 16-bit ﬁxed-point re p- resentation as described in Section 2.2.2

In addition, there are residual connections [33] betwe en the layers. The models are quantized to 16-bit ﬁxed-point re p- resentation as described in Section 2.2.2. The NLM is used to rescore 10-best hypotheses generated from ﬁrst-pass decod ing. From in-domain corpus, we extract the vocabulary of 60k most frequent words. All NLM models use this vocabular...

work page

[5] [5]

Results and Discussion 4.0.1. Domain adaptation experiments Table 1 shows perplexity results comparing NLMs trained on a single data source against different domain adaptation me th- ods described in Sections 2.1.1 and 2.1.2: mixing multiple c or- pora, applying transfer learning (ﬁne-tuning), and combin ing both methods. First, we conﬁrm that our voicema...

work page

[6] [6]

Conclusions and Future Work In this work, we addressed several challenges for an NLM to be used in a practical large-scale ASR system. In particular , training an NLM from multiple heterogenous corpora using a novel data mixing strategy, along with transfer learning ba sed on ﬁne-tuning that provided 16.1% relative improvement in per- plexity compared to ...

work page

[7] [7]

A neur al probabilistic language model,

Y . Bengio, R. Ducharme, P . Vincent, and C. Janvin, “A neur al probabilistic language model,” Journal of Machine Learning Re- search, vol. 3, no. 6, pp. 1137–1155, 2003

work page 2003

[8] [8]

Continuous space language models,

H. Schwenk, “Continuous space language models,” Computer Speech & Language, vol. 21, no. 3, pp. 492–518, 2007

work page 2007

[9] [9]

Recurrent neural network based language model,

T. Mikolov, M. Karaﬁt, L. Burget, J. . ernock, and S. Khuda n- pur, “Recurrent neural network based language model,” inINTER- SPEECH, 2010, pp. 1045–1048

work page 2010

[10] [10]

Exploring the Limits of Language Modeling

R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y . Wu, “Exploring the limits of language modeling,” arXiv preprint arXiv:1602.02410, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[11] [11]

Real-time one-pass de - coding with recurrent neural network language model for spe ech recognition,

T. Hori, Y . Kubo, and A. Nakamura, “Real-time one-pass de - coding with recurrent neural network language model for spe ech recognition,” in ICASSP 2014 - 2014 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 6364–6368

work page 2014

[12] [12]

Class-based n-gram models of natural language,

P . F. Brown, P . V . Desouza, R. L. Mercer, V . J. D. Pietra, and J. C. Lai, “Class-based n-gram models of natural language,” Computa- tional linguistics, vol. 18, no. 4, pp. 467–479, 1992

work page 1992

[13] [13]

Statistical language model adaptati on: review and perspectives,

J. R. Bellegarda, “Statistical language model adaptati on: review and perspectives,” Speech communication, vol. 42, no. 1, pp. 93– 108, 2004

work page 2004

[14] [14]

Scalable lan guage model adaptation for spoken dialogue systems,

A. Gandhe, A. Rastrow, and B. Hoffmeister, “Scalable lan guage model adaptation for spoken dialogue systems,” in 2018 IEEE Spoken Language Technology W orkshop (SLT) , 2018, pp. 907– 912

work page 2018

[15] [15]

Contextual language mod el adaptation for conversational agents

A. Raju, B. Hedayatnia, L. Liu, A. Gandhe, C. Khatri, A. Me talli- nou, A. V enkatesh, and A. Rastrow, “Contextual language mod el adaptation for conversational agents.” in Interspeech 2018, 2018, pp. 3333–3337

work page 2018

[16] [16]

Imagene t classi- ﬁcation with deep convolutional neural networks,

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagene t classi- ﬁcation with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105

work page 2012

[17] [17]

Visualizing and understand ing con- volutional networks,

M. D. Zeiler and R. Fergus, “Visualizing and understand ing con- volutional networks,” in European conference on computer vision. Springer, 2014, pp. 818–833

work page 2014

[18] [18]

How Transferable are Neural Networks in NLP Applications?

L. Mou, Z. Meng, R. Y an, G. Li, Y . Xu, L. Zhang, and Z. Jin, “How transferable are neural networks in nlp applications?” arXiv preprint arXiv:1603.06111, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[19] [19]

Catastrophic interferen ce in connectionist networks: The sequential learning problem,

M. McCloskey and N. J. Cohen, “Catastrophic interferen ce in connectionist networks: The sequential learning problem, ” Psy- chology of Learning and Motivation , vol. 24, pp. 109–165, 1989

work page 1989

[20] [20]

Connectionist models of recognition mem ory: con- straints imposed by learning and forgetting functions,

R. Ratcliff, “Connectionist models of recognition mem ory: con- straints imposed by learning and forgetting functions,” Psycholog- ical Review, vol. 97, no. 2, pp. 285–308, 1990

work page 1990

[21] [21]

Extensions of recurrent neural network language mode l,

T. Mikolov, S. Kombrink, L. Burget, J. Cernocky, and S. K hudan- pur, “Extensions of recurrent neural network language mode l,” in 2011 IEEE International Conference on Acoustics, Speech an d Signal Processing (ICASSP), 2011, pp. 5528–5531

work page 2011

[22] [22]

Hierarchical probabilistic ne ural net- work language model

F. Morin and Y . Bengio, “Hierarchical probabilistic ne ural net- work language model.” in AISTATS, 2005

work page 2005

[23] [23]

Efﬁcient training and evaluation of recurrent neural netw ork lan- guage models for automatic speech recognition,

X. Chen, X. Liu, Y . Wang, M. J. F. Gales, and P . C. Woodland , “Efﬁcient training and evaluation of recurrent neural netw ork lan- guage models for automatic speech recognition,” IEEE Transac- tions on Audio, Speech, and Language Processing, vol. 24, no. 11, pp. 2146–2157, 2016

work page 2016

[24] [24]

V ariance regula rization of rnnlm for speech recognition,

Y . Shi, W.-Q. Zhang, M. Cai, and J. Liu, “V ariance regula rization of rnnlm for speech recognition,” in ICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal P ro- cessing (ICASSP), 2014, pp. 4893–4897

work page 2014

[25] [25]

Fast and robust neural network joint models for sta- tistical machine translation,

J. Devlin, R. Zbib, Z. Huang, T. Lamar, R. M. Schwartz, an d J. Makhoul, “Fast and robust neural network joint models for sta- tistical machine translation,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics ( V ol- ume 1: Long Papers) , 2014, pp. 1370–1380

work page 2014

[26] [26]

Strategies for train ing large vocabulary neural language models,

W. Chen, D. Grangier, and M. Auli, “Strategies for train ing large vocabulary neural language models,” in Proceedings of the 54th Annual Meeting of the Association for Computational Lingui stics (V olume 1: Long Papers), vol. 1, 2016, pp. 1975–1985

work page 2016

[27] [27]

When and why are log-linear mod els self-normalizing?

J. Andreas and D. Klein, “When and why are log-linear mod els self-normalizing?” in Proceedings of the 2015 Conference of the North American Chapter of the Association for Computationa l Linguistics: Human Language Technologies, 2015, pp. 244–249

work page 2015

[28] [28]

A fast and simple algorithm for tra ining neural probabilistic language models,

A. Mnih and Y . W. Teh, “A fast and simple algorithm for tra ining neural probabilistic language models,” international conference on machine learning, pp. 419–426, 2012

work page 2012

[29] [29]

Decoding with large-scale neural language models improves translation,

A. V aswani, Y . Zhao, V . Fossum, and D. Chiang, “Decoding with large-scale neural language models improves translation,” in Pro- ceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 2013, pp. 1387–1392

work page 2013

[30] [30]

Recur rent neural network language model training with noise contrast ive es- timation for speech recognition,

X. Chen, X. Liu, M. J. F. Gales, and P . C. Woodland, “Recur rent neural network language model training with noise contrast ive es- timation for speech recognition,” in 2015 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP ), 2015, pp. 5411–5415

work page 2015

[31] [31]

Simple, fast noise- contrastive estimation for large rnn vocabularies

B. Zoph, A. V aswani, J. May, and K. Knight, “Simple, fast noise- contrastive estimation for large rnn vocabularies.” in Proceedings of the 2016 Conference of the North American Chapter of the As - sociation for Computational Linguistics: Human Language T ech- nologies, 2016, pp. 1217–1222

work page 2016

[32] [32]

Self-normalization pro perties of language modeling,

J. Goldberger and O. Melamud, “Self-normalization pro perties of language modeling,” international conference on computational linguistics, pp. 764–773, 2018

work page 2018

[33] [33]

Quantizing deep convolutional networks for efficient inference: A whitepaper

R. Krishnamoorthi, “Quantizing deep convolutional ne t- works for efﬁcient inference: A whitepaper.” arXiv preprint arXiv:1806.08342, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[34] [34]

V ariational approximation of long-span languagemodels for lvcsr,

A. Deoras, T. Mikolov, S. Kombrink, M. Karaﬁt, and S. Khu - danpur, “V ariational approximation of long-span languagemodels for lvcsr,” in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2011, pp. 5532–5535

work page 2011

[35] [35]

Improved recognition of contact names in voic e commands,

P . Aleksic, C. Allauzen, D. Elson, A. Kracun, D. M. Casad o, and P . J. Moreno, “Improved recognition of contact names in voic e commands,” in 2015 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP) . IEEE, 2015, pp. 5172–5175

work page 2015

[36] [36]

Improved backing-off for m-gram l an- guage modeling,

R. Kneser and H. Ney, “Improved backing-off for m-gram l an- guage modeling,” in 1995 International Conference on Acoustics, Speech, and Signal Processing, vol. 1, 1995, pp. 181–184

work page 1995

[37] [37]

Long short-term mem ory,

S. Hochreiter and J. Schmidhuber, “Long short-term mem ory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997

work page 1997

[38] [38]

Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition

H. Sak, A. W. Senior, and F. Beaufays, “Long short-term m emory based recurrent neural network architectures for large voc abulary speech recognition,” arXiv preprint arXiv:1402.1128, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[39] [39]

Deep residual learni ng for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learni ng for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2016, pp. 770–778

work page 2016