Lattice-Based Unsupervised Test-Time Adaptation of Neural Network Acoustic Models

Joachim Fainberg; Ondrej Klejch; Peter Bell; Steve Renals

arxiv: 1906.11521 · v1 · pith:BJ6EQI5Lnew · submitted 2019-06-27 · 💻 cs.CL · cs.SD· eess.AS

Lattice-Based Unsupervised Test-Time Adaptation of Neural Network Acoustic Models

Ondrej Klejch , Joachim Fainberg , Peter Bell , Steve Renals This is my paper

Pith reviewed 2026-05-25 15:05 UTC · model grok-4.3

classification 💻 cs.CL cs.SDeess.AS

keywords unsupervised adaptationtest-time adaptationacoustic model adaptationlattice-based adaptationLF-MMIneural network acoustic modelsspeech recognitiondiscriminative adaptation

0 comments

The pith

Using lattices from first-pass decoding for discriminative adaptation of neural acoustic models allows adapting more parameters without overfitting even when initial transcriptions have over 50% word error rate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that unsupervised test-time adaptation of neural network acoustic models can be done effectively by using lattices rather than one-best transcriptions. This approach integrates into the lattice-free maximum mutual information framework and avoids the need for heavy regularization against errors in initial transcripts. A reader would care because it addresses the mismatch between training and test conditions in speech recognition across varied tasks including TED talks, multi-genre broadcasts, and low-resource Somali. The method succeeds where one-best methods fail due to overfitting.

Core claim

Discriminative adaptation using lattices obtained from a first pass decoding can be integrated into the LF-MMI framework, enabling adaptation of many more parameters without observing overfitting, and remaining successful even when the initial transcription has a WER in excess of 50% on tasks such as TED talks, MGB, and Somali.

What carries the argument

Lattice-based discriminative adaptation within the LF-MMI framework, where lattices from unadapted model decoding provide the supervision for adaptation transforms.

If this is right

Adaptation of many more parameters becomes possible without overfitting.
Method works on tasks with varying difficulty including high-error initial transcripts.
Integrates readily into existing LF-MMI training framework.
Reduces mismatch between training and testing conditions in acoustic models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach could reduce the need for supervised adaptation data in new domains.
Extending lattice use might improve robustness in other unsupervised learning settings for sequence models.
Testable on additional low-resource languages to confirm generalization.
Potential for combining with other adaptation techniques like speaker adaptation.

Load-bearing premise

Lattices generated by an unadapted model on test data contain enough correct discriminative information to guide adaptation without the model overfitting to transcription errors, even at word error rates above 50%.

What would settle it

Running the adaptation on a dataset where initial WER exceeds 50% and observing that the adapted model shows higher WER than the unadapted baseline or the one-best adaptation method.

read the original abstract

Acoustic model adaptation to unseen test recordings aims to reduce the mismatch between training and testing conditions. Most adaptation schemes for neural network models require the use of an initial one-best transcription for the test data, generated by an unadapted model, in order to estimate the adaptation transform. It has been found that adaptation methods using discriminative objective functions - such as cross-entropy loss - often require careful regularisation to avoid over-fitting to errors in the one-best transcriptions. In this paper we solve this problem by performing discriminative adaptation using lattices obtained from a first pass decoding, an approach that can be readily integrated into the lattice-free maximum mutual information (LF-MMI) framework. We investigate this approach on three transcription tasks of varying difficulty: TED talks, multi-genre broadcast (MGB) and a low-resource language (Somali). We find that our proposed approach enables many more parameters to be adapted without over-fitting being observed, and is successful even when the initial transcription has a WER in excess of 50%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Lattice-based LF-MMI adaptation lets you update far more parameters at test time without the usual overfitting, even above 50% initial WER.

read the letter

The main point is that this paper replaces one-best transcripts with lattices from a first-pass decode when doing unsupervised discriminative adaptation inside the LF-MMI framework. That change is what lets them adapt many more parameters without the overfitting that normally appears when the initial transcription is noisy. They test the idea on TED talks, MGB, and Somali, three tasks that differ in difficulty and resource level, and report that the lattice version stays stable even when the starting WER is over 50%.

The work is a straightforward extension of existing LF-MMI practice rather than a completely new framework. What it does well is show that the lattice signal is rich enough to support heavier adaptation at test time without extra regularization tricks. The three-task setup is useful because it checks whether the benefit holds when the initial model is already weak.

The soft spots are mostly about missing detail in the summary. The abstract gives no numbers, no direct one-best versus lattice ablations, and no lattice-density stats, so the size of the improvement and the exact conditions under which it appears are not visible yet. If the full tables show only small gains or if the lattices are unusually dense, the practical advantage shrinks. Computation cost of handling the lattices at test time is also not addressed in the summary.

This is for people already working with LF-MMI or lattice-based training in speech recognition who need test-time adaptation on mismatched or low-resource data. A reader who wants a simple, integrable fix for the overfitting problem will find something usable here.

I would send it to peer review. The core idea is testable and the experimental scope is reasonable; the referee can check whether the reported gains survive proper controls.

Referee Report

0 major / 2 minor

Summary. The paper proposes lattice-based unsupervised test-time adaptation of neural network acoustic models within the lattice-free maximum mutual information (LF-MMI) framework. Instead of relying on one-best transcriptions from an unadapted model, it uses lattices from first-pass decoding to perform discriminative adaptation. The central claim is that this enables adaptation of many more parameters without overfitting and remains effective even when initial WER exceeds 50%, as demonstrated on three tasks of varying difficulty: TED talks, multi-genre broadcast (MGB), and Somali.

Significance. If the results hold, the approach offers a practical way to increase the number of adaptable parameters in test-time adaptation while avoiding the overfitting issues common with one-best transcriptions and discriminative objectives. The integration with the established LF-MMI framework is a strength, as is the evaluation across tasks spanning high-resource to low-resource settings. The empirical demonstration that the method works above 50% initial WER directly addresses a key practical limitation in unsupervised adaptation.

minor comments (2)

[Abstract] The abstract and introduction would benefit from a brief statement of the exact number of parameters adapted in the lattice-based vs. one-best conditions to make the 'many more parameters' claim immediately quantifiable.
[Experimental results] Figure or table captions should explicitly note the initial WER of the unadapted model for each task so readers can directly verify the >50% WER regime without cross-referencing the text.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The report lists no specific major comments.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central approach integrates lattices from first-pass decoding into the established LF-MMI framework for unsupervised adaptation. No derivation step reduces by construction to fitted parameters or self-referential definitions; the method is presented as an extension of prior LF-MMI work with external lattice inputs, and empirical results on TED, MGB, and Somali are reported as independent validation rather than tautological outcomes. No self-citation chains, ansatzes smuggled via citation, or uniqueness theorems imported from the authors appear in the load-bearing claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review provides insufficient detail for exhaustive ledger; main domain assumption is the utility of first-pass lattices for adaptation.

axioms (1)

domain assumption Lattices from unadapted first-pass decoding contain useful information for discriminative adaptation even with WER exceeding 50%.
Central to enabling the method without overfitting to transcription errors.

pith-pipeline@v0.9.0 · 5716 in / 1141 out tokens · 25401 ms · 2026-05-25T15:05:14.959648+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 1 internal anchor

[1]

In feature-space adaptation , trans- formations of acoustic features are estimated to maximise t he log-likelihood of the adaptation data [1, 2]

Introduction Acoustic model adaptation aims to improve automatic speech recognition (ASR) accuracy by reducing the mismatch betwee n training and test conditions. In feature-space adaptation , trans- formations of acoustic features are estimated to maximise t he log-likelihood of the adaptation data [1, 2]. A subset of the weights of a neural network acou...

work page
[2]

Methods 2.1. Lattice supervision and LF-MMI Discriminative training using criteria such as maximum mut ual information (MMI) [26] has been shown to be sensitive to the accuracy of the transcripts [12, 27]. In lieu of better trans cripts, a range of transcript ﬁltering approaches have previously b een explored [12, 13, 14]. In unsupervised or semi-supervis...

work page
[3]

surprise language

Experiments We conducted test-time model adaptation experiments on thr ee datasets: the TED-LIUM corpus of TED talks [23, 24], multi- genre TV broadcasts from the MGB 1 Challenge [25] and a corpus of Somali from the IARPA MA TERIAL programme. All models were trained and adapted using the Kaldi toolkit [36] . We describe the respective baseline models in s...

work page 2012
[4]

Results We conducted the ﬁrst set of experiments on the TED-LIUM dataset. Adaptation of the model without i-vectors using la t- tices achieves 10 − 15% relative improvement when adapting LHUC parameters and 9 − 14% relative improvement when adapting all parameters, whereas improvements when adapti ng using best path and all adaptation data were much small...

work page
[5]

Conclusions In this paper we compared unsupervised model adaptation us- ing a lattice with the best path obtained from the ﬁrst pass de - coding as supervision. Our experiments show that using the lattice as supervision achieves better results than using t he best path, even when conﬁdence-based data selection is used to re - move transcripts with many po...

work page
[6]

Maximum likelihood l in- ear regression for speaker adaptation of continuous density hidden markov models,

C. J. Leggetter and P . C. Woodland, “Maximum likelihood l in- ear regression for speaker adaptation of continuous density hidden markov models,” Computer speech & language , vol. 9, no. 2, pp. 171–185, 1995

work page 1995
[7]

Maximum likelihood linear transformations f or HMM-based speech recognition,

M. Gales, “Maximum likelihood linear transformations f or HMM-based speech recognition,” Computer speech & language , vol. 12, no. 2, pp. 75–98, 1998

work page 1998
[8]

Learning hidden u nit contri- butions for unsupervised acoustic model adaptation,

P . Swietojanski, J. Li, and S. Renals, “Learning hidden u nit contri- butions for unsupervised acoustic model adaptation,” IEEE Trans- actions on Audio, Speech, and Language Processing , vol. 14, pp. 1450–1463, 2016

work page 2016
[9]

Singular val ue de- composition based low-footprint speaker adaptation and pe rson- alization for deep neural network,

J. Xue, J. Li, D. Y u, M. Seltzer, and Y . Gong, “Singular val ue de- composition based low-footprint speaker adaptation and pe rson- alization for deep neural network,” in ICASSP, 2014

work page 2014
[10]

Speaker adaptation of context dependent deep n eural networks,

H. Liao, “Speaker adaptation of context dependent deep n eural networks,” in ICASSP, 2013

work page 2013
[11]

KL-divergence re gu- larized deep neural network adaptation for improved large v ocab- ulary speech recognition,

D. Y u, K. Y ao, H. Su, G. Li, and F. Seide, “KL-divergence re gu- larized deep neural network adaptation for improved large v ocab- ulary speech recognition,” in ICASSP, 2013

work page 2013
[12]

Front-end factor analysis for speaker veriﬁcation,

N. Dehak, P . J. Kenny, R. Dehak, P . Dumouchel, and P . Ouell et, “Front-end factor analysis for speaker veriﬁcation,” IEEE Trans- actions on Audio, Speech, and Language Processing , vol. 19, no. 4, pp. 788–798, 2011

work page 2011
[13]

Speaker adaptation of neural network acoustic models using i-vecto rs,

G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, “Speaker adaptation of neural network acoustic models using i-vecto rs,” in ASRU, 2013

work page 2013
[14]

Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminativ e learning of speaker code,

O. Abdel-Hamid and H. Jiang, “Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminativ e learning of speaker code,” in ICASSP, 2013

work page 2013
[15]

On combining i-vectors and dis- criminative adaptation methods for unsupervised speaker n ormal- ization in DNN acoustic models,

L. Samarakoon and K. C. Sim, “On combining i-vectors and dis- criminative adaptation methods for unsupervised speaker n ormal- ization in DNN acoustic models,” in IEEE ICASSP, 2016

work page 2016
[16]

Speaker adaptation for continuous den sity HMMs: A review,

P . C. Woodland, “Speaker adaptation for continuous den sity HMMs: A review,” in ISCA W orkshop on Adaptation Methods for Speech Recognition, 2001

work page 2001
[17]

Discri minative training of acoustic models applied to domains with unrelia ble transcripts [speech recognition applications],

L. Mathias, G. Y egnanarayanan, and J. Fritsch, “Discri minative training of acoustic models applied to domains with unrelia ble transcripts [speech recognition applications],” in ICASSP, 2005

work page 2005
[18]

Investiga ting data selection for minimum phone error training of acoustic mode ls,

S.-H. Liu, F.-H. Chu, S.-H. Lin, and B. Chen, “Investiga ting data selection for minimum phone error training of acoustic mode ls,” in Multimedia and Expo, 2007 IEEE International Conference on. IEEE, 2007

work page 2007
[19]

Semi-Supervised Model Training for Unbounded Conversational Speech Recognition

S. Walker, M. Pedersen, I. Orife, and J. Flaks, “Semi-su pervised model training for unbounded conversational speech recognition,” arXiv preprint arXiv:1705.09724, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[20]

Semi-supervised DNN training with word selection for ASR,

K. V esel´ y, L. Burget, and J. ˇCernock´ y, “Semi-supervised DNN training with word selection for ASR,” in Interspeech, 2017

work page 2017
[21]

Dnn adaptation by automatic quality estimation of asr hypotheses,

D. Falavigna, M. Matassoni, S. Jalalvand, M. Negri, and M. Turchi, “Dnn adaptation by automatic quality estimation of asr hypotheses,” Computer Speech & Language, vol. 46, pp. 585– 604, 2017

work page 2017
[22]

Learning hidden unit co ntri- butions for unsupervised speaker adaptation of neural netw ork acoustic models,

P . Swietojanski and S. Renals, “Learning hidden unit co ntri- butions for unsupervised speaker adaptation of neural netw ork acoustic models,” in SLT, 2014

work page 2014
[23]

Subspace lhuc for fast adap ta- tion of deep neural network acoustic models

L. Samarakoon and K. C. Sim, “Subspace lhuc for fast adap ta- tion of deep neural network acoustic models.” in INTERSPEECH, 2016

work page 2016
[24]

Extended low-rank plus diagonal adaptation for deep and recurrent neural networks ,

Y . Zhao, J. Li, K. Kumar, and Y . Gong, “Extended low-rank plus diagonal adaptation for deep and recurrent neural networks ,” in ICASSP, 2017

work page 2017
[25]

Regularized adaptation of discrim inative classiﬁers,

X. Li and J. Bilmes, “Regularized adaptation of discrim inative classiﬁers,” in ICASSP, 2006

work page 2006
[26]

Lattice- based unsu- pervised acoustic model training,

T. Fraga-Silva, J.-L. Gauvain, and L. Lamel, “Lattice- based unsu- pervised acoustic model training,” in ICASSP, 2011

work page 2011
[27]

Semi - supervised training of acoustic models using lattice-free MMI,

V . Manohar, H. Hadian, D. Povey, and S. Khudanpur, “Semi - supervised training of acoustic models using lattice-free MMI,” in ICASSP, 2018

work page 2018
[28]

TED-LIUM:an auto- matic speech recognition dedicated corpus,

A. Rousseau, P . Del´ eglise, and Y . Est` eve, “TED-LIUM:an auto- matic speech recognition dedicated corpus,” in LREC, 2012

work page 2012
[29]

Enhancing the TED- LIUM corpus with selected data for language modeling and mor e TED talks,

A. Rousseau, P . Del´ eglise, and Y . Est` eve, “Enhancing the TED- LIUM corpus with selected data for language modeling and mor e TED talks,” in LREC, 2014

work page 2014
[30]

The MGB challenge: Evaluating multi-genre broadcast medi a recognition,

P . Bell, M. Gales, T. Hain, J. Kilgour, P . Lanchantin, X. Liu, A. McParland, S. Renals, O. Saz, M. Wester, and P . C. Woodland, “The MGB challenge: Evaluating multi-genre broadcast medi a recognition,” in ASRU, 2015

work page 2015
[31]

Maximum mu - tual information estimation of hidden markov model paramet ers for speech recognition,

L. Bahl, P . Brown, P . De Souza, and R. Mercer, “Maximum mu - tual information estimation of hidden markov model paramet ers for speech recognition,” in Acoustics, Speech, and Signal Pro- cessing, IEEE International Conference on ICASSP’86. , vol. 11. IEEE, 1986, pp. 49–52

work page 1986
[32]

Unsupervis ed training and directed manual transcription for L VCSR,

K. Y u, M. Gales, L. Wang, and P . C. Woodland, “Unsupervis ed training and directed manual transcription for L VCSR,” Speech Communication, vol. 52, no. 7-8, pp. 652–663, 2010

work page 2010
[33]

Lattice-based u n- supervised mllr for speaker adaptation,

M. Padmanabhan, G. Saon, and G. Zweig, “Lattice-based u n- supervised mllr for speaker adaptation,” in ASR2000-automatic speech recognition: challenges for the New Millenium ISCA T u- torial and Research W orkshop (ITRW), 2000

work page 2000
[34]

Discriminative training for large vocabula ry speech recognition,

D. Povey, “Discriminative training for large vocabula ry speech recognition,” Ph.D. dissertation, University of Cambridg e, 2005

work page 2005
[35]

Purely sequence-trained neu- ral networks for ASR based on lattice-free MMI,

D. Povey, V . Peddinti, D. Galvez, P . Ghahrmani, V . Manoh ar, X. Na, Y . Wang, and S. Khudanpur, “Purely sequence-trained neu- ral networks for ASR based on lattice-free MMI,” Interspeech, 2016

work page 2016
[36]

Sequen ce- discriminative training of deep neural networks,

K. V esel´ y, A. Ghoshal, L. Burget, and D. Povey, “Sequen ce- discriminative training of deep neural networks,” in Interspeech, 2013

work page 2013
[37]

A novel loss functio n for the overall risk criterion based discriminative training of HM M mod- els,

J. Kaiser, B. Horvat, and Z. Kacic, “A novel loss functio n for the overall risk criterion based discriminative training of HM M mod- els,” in Sixth International Conference on Spoken Language Pro- cessing, 2000

work page 2000
[38]

End-t o-end speech recognition using lattice-free MMI,

H. Hadian, H. Sameti, D. Povey, and S. Khudanpur, “End-t o-end speech recognition using lattice-free MMI,” Interspeech, 2018

work page 2018
[39]

A compact model for speaker-adaptive training,

T. Anastasakos, J. McDonough, R. Schwartz, and J. Makho ul, “A compact model for speaker-adaptive training,” in ICSLP, 1996, pp. 1137–1140

work page 1996
[40]

SA T-LHUC: Speaker adap tive training for learning hidden unit contributions,

P . Swietojanski and S. Renals, “SA T-LHUC: Speaker adap tive training for learning hidden unit contributions,” in ICASSP, 2016

work page 2016
[41]

The Kaldi speech recog- nition toolkit,

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembe k, N. Goel, M. Hannemann, P . Motl´ ıˇ cek, Y . Qian, P . Schwarz, J. Silovsk´ y, G. Stemmer, and K. V esel´ y, “The Kaldi speech recog- nition toolkit,” in ASRU, 2011

work page 2011
[42]

A time delay ne ural network architecture for efﬁcient modeling of long tempora l con- texts

V . Peddinti, D. Povey, and S. Khudanpur, “A time delay ne ural network architecture for efﬁcient modeling of long tempora l con- texts.” in Interspeech, 2015

work page 2015
[43]

Overview of the IWSLT 2012 evaluation campaign,

M. Federico, M. Cettolo, L. Bentivogli, M. Paul, and S. S t¨ uker, “Overview of the IWSLT 2012 evaluation campaign,” in IWSLT, 2012

work page 2012
[44]

Semi-orthogonal low-rank matrix factor iza- tion for deep neural networks,

D. Povey, G. Cheng, Y . Wang, K. Li, H. Xu, M. Y armohamadi, and S. Khudanpur, “Semi-orthogonal low-rank matrix factor iza- tion for deep neural networks,” Interspeech, 2018

work page 2018
[45]

A pitch extraction algorithm tuned for au to- matic speech recognition,

P . Ghahremani, B. BabaAli, D. Povey, K. Riedhammer, J. T rmal, and S. Khudanpur, “A pitch extraction algorithm tuned for au to- matic speech recognition,” in ICASSP, 2014

work page 2014
[46]

Multilingual representations for low resource speech recognition and ke yword search,

J. Cui, B. Kingsbury, B. Ramabhadran, A. Sethy, K. Audkh asi, X. Cui, E. Kislal, L. Mangu, M. Nussbaum-Thom, M. Picheny, Z. T ¨ uske, P . Golik, R. Schl¨ uter, H. Ney, M. J. F. Gales, K. M. Knill, A. Ragni, H. Wang, and P . C. Woodland, “Multilingual representations for low resource speech recognition and ke yword search,” in IEEE ICASSP, 2016

work page 2016
[47]

Low-resource s peech recognition and keyword-spotting,

M. J. F. Gales, K. M. Knill, and A. Ragni, “Low-resource s peech recognition and keyword-spotting,” in SPECOM, 2017

work page 2017

[1] [1]

In feature-space adaptation , trans- formations of acoustic features are estimated to maximise t he log-likelihood of the adaptation data [1, 2]

Introduction Acoustic model adaptation aims to improve automatic speech recognition (ASR) accuracy by reducing the mismatch betwee n training and test conditions. In feature-space adaptation , trans- formations of acoustic features are estimated to maximise t he log-likelihood of the adaptation data [1, 2]. A subset of the weights of a neural network acou...

work page

[2] [2]

Methods 2.1. Lattice supervision and LF-MMI Discriminative training using criteria such as maximum mut ual information (MMI) [26] has been shown to be sensitive to the accuracy of the transcripts [12, 27]. In lieu of better trans cripts, a range of transcript ﬁltering approaches have previously b een explored [12, 13, 14]. In unsupervised or semi-supervis...

work page

[3] [3]

surprise language

Experiments We conducted test-time model adaptation experiments on thr ee datasets: the TED-LIUM corpus of TED talks [23, 24], multi- genre TV broadcasts from the MGB 1 Challenge [25] and a corpus of Somali from the IARPA MA TERIAL programme. All models were trained and adapted using the Kaldi toolkit [36] . We describe the respective baseline models in s...

work page 2012

[4] [4]

Results We conducted the ﬁrst set of experiments on the TED-LIUM dataset. Adaptation of the model without i-vectors using la t- tices achieves 10 − 15% relative improvement when adapting LHUC parameters and 9 − 14% relative improvement when adapting all parameters, whereas improvements when adapti ng using best path and all adaptation data were much small...

work page

[5] [5]

Conclusions In this paper we compared unsupervised model adaptation us- ing a lattice with the best path obtained from the ﬁrst pass de - coding as supervision. Our experiments show that using the lattice as supervision achieves better results than using t he best path, even when conﬁdence-based data selection is used to re - move transcripts with many po...

work page

[6] [6]

Maximum likelihood l in- ear regression for speaker adaptation of continuous density hidden markov models,

C. J. Leggetter and P . C. Woodland, “Maximum likelihood l in- ear regression for speaker adaptation of continuous density hidden markov models,” Computer speech & language , vol. 9, no. 2, pp. 171–185, 1995

work page 1995

[7] [7]

Maximum likelihood linear transformations f or HMM-based speech recognition,

M. Gales, “Maximum likelihood linear transformations f or HMM-based speech recognition,” Computer speech & language , vol. 12, no. 2, pp. 75–98, 1998

work page 1998

[8] [8]

Learning hidden u nit contri- butions for unsupervised acoustic model adaptation,

P . Swietojanski, J. Li, and S. Renals, “Learning hidden u nit contri- butions for unsupervised acoustic model adaptation,” IEEE Trans- actions on Audio, Speech, and Language Processing , vol. 14, pp. 1450–1463, 2016

work page 2016

[9] [9]

Singular val ue de- composition based low-footprint speaker adaptation and pe rson- alization for deep neural network,

J. Xue, J. Li, D. Y u, M. Seltzer, and Y . Gong, “Singular val ue de- composition based low-footprint speaker adaptation and pe rson- alization for deep neural network,” in ICASSP, 2014

work page 2014

[10] [10]

Speaker adaptation of context dependent deep n eural networks,

H. Liao, “Speaker adaptation of context dependent deep n eural networks,” in ICASSP, 2013

work page 2013

[11] [11]

KL-divergence re gu- larized deep neural network adaptation for improved large v ocab- ulary speech recognition,

D. Y u, K. Y ao, H. Su, G. Li, and F. Seide, “KL-divergence re gu- larized deep neural network adaptation for improved large v ocab- ulary speech recognition,” in ICASSP, 2013

work page 2013

[12] [12]

Front-end factor analysis for speaker veriﬁcation,

N. Dehak, P . J. Kenny, R. Dehak, P . Dumouchel, and P . Ouell et, “Front-end factor analysis for speaker veriﬁcation,” IEEE Trans- actions on Audio, Speech, and Language Processing , vol. 19, no. 4, pp. 788–798, 2011

work page 2011

[13] [13]

Speaker adaptation of neural network acoustic models using i-vecto rs,

G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, “Speaker adaptation of neural network acoustic models using i-vecto rs,” in ASRU, 2013

work page 2013

[14] [14]

Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminativ e learning of speaker code,

O. Abdel-Hamid and H. Jiang, “Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminativ e learning of speaker code,” in ICASSP, 2013

work page 2013

[15] [15]

On combining i-vectors and dis- criminative adaptation methods for unsupervised speaker n ormal- ization in DNN acoustic models,

L. Samarakoon and K. C. Sim, “On combining i-vectors and dis- criminative adaptation methods for unsupervised speaker n ormal- ization in DNN acoustic models,” in IEEE ICASSP, 2016

work page 2016

[16] [16]

Speaker adaptation for continuous den sity HMMs: A review,

P . C. Woodland, “Speaker adaptation for continuous den sity HMMs: A review,” in ISCA W orkshop on Adaptation Methods for Speech Recognition, 2001

work page 2001

[17] [17]

Discri minative training of acoustic models applied to domains with unrelia ble transcripts [speech recognition applications],

L. Mathias, G. Y egnanarayanan, and J. Fritsch, “Discri minative training of acoustic models applied to domains with unrelia ble transcripts [speech recognition applications],” in ICASSP, 2005

work page 2005

[18] [18]

Investiga ting data selection for minimum phone error training of acoustic mode ls,

S.-H. Liu, F.-H. Chu, S.-H. Lin, and B. Chen, “Investiga ting data selection for minimum phone error training of acoustic mode ls,” in Multimedia and Expo, 2007 IEEE International Conference on. IEEE, 2007

work page 2007

[19] [19]

Semi-Supervised Model Training for Unbounded Conversational Speech Recognition

S. Walker, M. Pedersen, I. Orife, and J. Flaks, “Semi-su pervised model training for unbounded conversational speech recognition,” arXiv preprint arXiv:1705.09724, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[20] [20]

Semi-supervised DNN training with word selection for ASR,

K. V esel´ y, L. Burget, and J. ˇCernock´ y, “Semi-supervised DNN training with word selection for ASR,” in Interspeech, 2017

work page 2017

[21] [21]

Dnn adaptation by automatic quality estimation of asr hypotheses,

D. Falavigna, M. Matassoni, S. Jalalvand, M. Negri, and M. Turchi, “Dnn adaptation by automatic quality estimation of asr hypotheses,” Computer Speech & Language, vol. 46, pp. 585– 604, 2017

work page 2017

[22] [22]

Learning hidden unit co ntri- butions for unsupervised speaker adaptation of neural netw ork acoustic models,

P . Swietojanski and S. Renals, “Learning hidden unit co ntri- butions for unsupervised speaker adaptation of neural netw ork acoustic models,” in SLT, 2014

work page 2014

[23] [23]

Subspace lhuc for fast adap ta- tion of deep neural network acoustic models

L. Samarakoon and K. C. Sim, “Subspace lhuc for fast adap ta- tion of deep neural network acoustic models.” in INTERSPEECH, 2016

work page 2016

[24] [24]

Extended low-rank plus diagonal adaptation for deep and recurrent neural networks ,

Y . Zhao, J. Li, K. Kumar, and Y . Gong, “Extended low-rank plus diagonal adaptation for deep and recurrent neural networks ,” in ICASSP, 2017

work page 2017

[25] [25]

Regularized adaptation of discrim inative classiﬁers,

X. Li and J. Bilmes, “Regularized adaptation of discrim inative classiﬁers,” in ICASSP, 2006

work page 2006

[26] [26]

Lattice- based unsu- pervised acoustic model training,

T. Fraga-Silva, J.-L. Gauvain, and L. Lamel, “Lattice- based unsu- pervised acoustic model training,” in ICASSP, 2011

work page 2011

[27] [27]

Semi - supervised training of acoustic models using lattice-free MMI,

V . Manohar, H. Hadian, D. Povey, and S. Khudanpur, “Semi - supervised training of acoustic models using lattice-free MMI,” in ICASSP, 2018

work page 2018

[28] [28]

TED-LIUM:an auto- matic speech recognition dedicated corpus,

A. Rousseau, P . Del´ eglise, and Y . Est` eve, “TED-LIUM:an auto- matic speech recognition dedicated corpus,” in LREC, 2012

work page 2012

[29] [29]

Enhancing the TED- LIUM corpus with selected data for language modeling and mor e TED talks,

A. Rousseau, P . Del´ eglise, and Y . Est` eve, “Enhancing the TED- LIUM corpus with selected data for language modeling and mor e TED talks,” in LREC, 2014

work page 2014

[30] [30]

The MGB challenge: Evaluating multi-genre broadcast medi a recognition,

P . Bell, M. Gales, T. Hain, J. Kilgour, P . Lanchantin, X. Liu, A. McParland, S. Renals, O. Saz, M. Wester, and P . C. Woodland, “The MGB challenge: Evaluating multi-genre broadcast medi a recognition,” in ASRU, 2015

work page 2015

[31] [31]

Maximum mu - tual information estimation of hidden markov model paramet ers for speech recognition,

L. Bahl, P . Brown, P . De Souza, and R. Mercer, “Maximum mu - tual information estimation of hidden markov model paramet ers for speech recognition,” in Acoustics, Speech, and Signal Pro- cessing, IEEE International Conference on ICASSP’86. , vol. 11. IEEE, 1986, pp. 49–52

work page 1986

[32] [32]

Unsupervis ed training and directed manual transcription for L VCSR,

K. Y u, M. Gales, L. Wang, and P . C. Woodland, “Unsupervis ed training and directed manual transcription for L VCSR,” Speech Communication, vol. 52, no. 7-8, pp. 652–663, 2010

work page 2010

[33] [33]

Lattice-based u n- supervised mllr for speaker adaptation,

M. Padmanabhan, G. Saon, and G. Zweig, “Lattice-based u n- supervised mllr for speaker adaptation,” in ASR2000-automatic speech recognition: challenges for the New Millenium ISCA T u- torial and Research W orkshop (ITRW), 2000

work page 2000

[34] [34]

Discriminative training for large vocabula ry speech recognition,

D. Povey, “Discriminative training for large vocabula ry speech recognition,” Ph.D. dissertation, University of Cambridg e, 2005

work page 2005

[35] [35]

Purely sequence-trained neu- ral networks for ASR based on lattice-free MMI,

D. Povey, V . Peddinti, D. Galvez, P . Ghahrmani, V . Manoh ar, X. Na, Y . Wang, and S. Khudanpur, “Purely sequence-trained neu- ral networks for ASR based on lattice-free MMI,” Interspeech, 2016

work page 2016

[36] [36]

Sequen ce- discriminative training of deep neural networks,

K. V esel´ y, A. Ghoshal, L. Burget, and D. Povey, “Sequen ce- discriminative training of deep neural networks,” in Interspeech, 2013

work page 2013

[37] [37]

A novel loss functio n for the overall risk criterion based discriminative training of HM M mod- els,

J. Kaiser, B. Horvat, and Z. Kacic, “A novel loss functio n for the overall risk criterion based discriminative training of HM M mod- els,” in Sixth International Conference on Spoken Language Pro- cessing, 2000

work page 2000

[38] [38]

End-t o-end speech recognition using lattice-free MMI,

H. Hadian, H. Sameti, D. Povey, and S. Khudanpur, “End-t o-end speech recognition using lattice-free MMI,” Interspeech, 2018

work page 2018

[39] [39]

A compact model for speaker-adaptive training,

T. Anastasakos, J. McDonough, R. Schwartz, and J. Makho ul, “A compact model for speaker-adaptive training,” in ICSLP, 1996, pp. 1137–1140

work page 1996

[40] [40]

SA T-LHUC: Speaker adap tive training for learning hidden unit contributions,

P . Swietojanski and S. Renals, “SA T-LHUC: Speaker adap tive training for learning hidden unit contributions,” in ICASSP, 2016

work page 2016

[41] [41]

The Kaldi speech recog- nition toolkit,

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembe k, N. Goel, M. Hannemann, P . Motl´ ıˇ cek, Y . Qian, P . Schwarz, J. Silovsk´ y, G. Stemmer, and K. V esel´ y, “The Kaldi speech recog- nition toolkit,” in ASRU, 2011

work page 2011

[42] [42]

A time delay ne ural network architecture for efﬁcient modeling of long tempora l con- texts

V . Peddinti, D. Povey, and S. Khudanpur, “A time delay ne ural network architecture for efﬁcient modeling of long tempora l con- texts.” in Interspeech, 2015

work page 2015

[43] [43]

Overview of the IWSLT 2012 evaluation campaign,

M. Federico, M. Cettolo, L. Bentivogli, M. Paul, and S. S t¨ uker, “Overview of the IWSLT 2012 evaluation campaign,” in IWSLT, 2012

work page 2012

[44] [44]

Semi-orthogonal low-rank matrix factor iza- tion for deep neural networks,

D. Povey, G. Cheng, Y . Wang, K. Li, H. Xu, M. Y armohamadi, and S. Khudanpur, “Semi-orthogonal low-rank matrix factor iza- tion for deep neural networks,” Interspeech, 2018

work page 2018

[45] [45]

A pitch extraction algorithm tuned for au to- matic speech recognition,

P . Ghahremani, B. BabaAli, D. Povey, K. Riedhammer, J. T rmal, and S. Khudanpur, “A pitch extraction algorithm tuned for au to- matic speech recognition,” in ICASSP, 2014

work page 2014

[46] [46]

Multilingual representations for low resource speech recognition and ke yword search,

J. Cui, B. Kingsbury, B. Ramabhadran, A. Sethy, K. Audkh asi, X. Cui, E. Kislal, L. Mangu, M. Nussbaum-Thom, M. Picheny, Z. T ¨ uske, P . Golik, R. Schl¨ uter, H. Ney, M. J. F. Gales, K. M. Knill, A. Ragni, H. Wang, and P . C. Woodland, “Multilingual representations for low resource speech recognition and ke yword search,” in IEEE ICASSP, 2016

work page 2016

[47] [47]

Low-resource s peech recognition and keyword-spotting,

M. J. F. Gales, K. M. Knill, and A. Ragni, “Low-resource s peech recognition and keyword-spotting,” in SPECOM, 2017

work page 2017