Lattice-Based Unsupervised Test-Time Adaptation of Neural Network Acoustic Models
Pith reviewed 2026-05-25 15:05 UTC · model grok-4.3
The pith
Using lattices from first-pass decoding for discriminative adaptation of neural acoustic models allows adapting more parameters without overfitting even when initial transcriptions have over 50% word error rate.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Discriminative adaptation using lattices obtained from a first pass decoding can be integrated into the LF-MMI framework, enabling adaptation of many more parameters without observing overfitting, and remaining successful even when the initial transcription has a WER in excess of 50% on tasks such as TED talks, MGB, and Somali.
What carries the argument
Lattice-based discriminative adaptation within the LF-MMI framework, where lattices from unadapted model decoding provide the supervision for adaptation transforms.
If this is right
- Adaptation of many more parameters becomes possible without overfitting.
- Method works on tasks with varying difficulty including high-error initial transcripts.
- Integrates readily into existing LF-MMI training framework.
- Reduces mismatch between training and testing conditions in acoustic models.
Where Pith is reading between the lines
- This approach could reduce the need for supervised adaptation data in new domains.
- Extending lattice use might improve robustness in other unsupervised learning settings for sequence models.
- Testable on additional low-resource languages to confirm generalization.
- Potential for combining with other adaptation techniques like speaker adaptation.
Load-bearing premise
Lattices generated by an unadapted model on test data contain enough correct discriminative information to guide adaptation without the model overfitting to transcription errors, even at word error rates above 50%.
What would settle it
Running the adaptation on a dataset where initial WER exceeds 50% and observing that the adapted model shows higher WER than the unadapted baseline or the one-best adaptation method.
read the original abstract
Acoustic model adaptation to unseen test recordings aims to reduce the mismatch between training and testing conditions. Most adaptation schemes for neural network models require the use of an initial one-best transcription for the test data, generated by an unadapted model, in order to estimate the adaptation transform. It has been found that adaptation methods using discriminative objective functions - such as cross-entropy loss - often require careful regularisation to avoid over-fitting to errors in the one-best transcriptions. In this paper we solve this problem by performing discriminative adaptation using lattices obtained from a first pass decoding, an approach that can be readily integrated into the lattice-free maximum mutual information (LF-MMI) framework. We investigate this approach on three transcription tasks of varying difficulty: TED talks, multi-genre broadcast (MGB) and a low-resource language (Somali). We find that our proposed approach enables many more parameters to be adapted without over-fitting being observed, and is successful even when the initial transcription has a WER in excess of 50%.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes lattice-based unsupervised test-time adaptation of neural network acoustic models within the lattice-free maximum mutual information (LF-MMI) framework. Instead of relying on one-best transcriptions from an unadapted model, it uses lattices from first-pass decoding to perform discriminative adaptation. The central claim is that this enables adaptation of many more parameters without overfitting and remains effective even when initial WER exceeds 50%, as demonstrated on three tasks of varying difficulty: TED talks, multi-genre broadcast (MGB), and Somali.
Significance. If the results hold, the approach offers a practical way to increase the number of adaptable parameters in test-time adaptation while avoiding the overfitting issues common with one-best transcriptions and discriminative objectives. The integration with the established LF-MMI framework is a strength, as is the evaluation across tasks spanning high-resource to low-resource settings. The empirical demonstration that the method works above 50% initial WER directly addresses a key practical limitation in unsupervised adaptation.
minor comments (2)
- [Abstract] The abstract and introduction would benefit from a brief statement of the exact number of parameters adapted in the lattice-based vs. one-best conditions to make the 'many more parameters' claim immediately quantifiable.
- [Experimental results] Figure or table captions should explicitly note the initial WER of the unadapted model for each task so readers can directly verify the >50% WER regime without cross-referencing the text.
Simulated Author's Rebuttal
We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The report lists no specific major comments.
Circularity Check
No significant circularity detected
full rationale
The paper's central approach integrates lattices from first-pass decoding into the established LF-MMI framework for unsupervised adaptation. No derivation step reduces by construction to fitted parameters or self-referential definitions; the method is presented as an extension of prior LF-MMI work with external lattice inputs, and empirical results on TED, MGB, and Somali are reported as independent validation rather than tautological outcomes. No self-citation chains, ansatzes smuggled via citation, or uniqueness theorems imported from the authors appear in the load-bearing claims.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Lattices from unadapted first-pass decoding contain useful information for discriminative adaptation even with WER exceeding 50%.
Reference graph
Works this paper leans on
-
[1]
Introduction Acoustic model adaptation aims to improve automatic speech recognition (ASR) accuracy by reducing the mismatch betwee n training and test conditions. In feature-space adaptation , trans- formations of acoustic features are estimated to maximise t he log-likelihood of the adaptation data [1, 2]. A subset of the weights of a neural network acou...
-
[2]
Methods 2.1. Lattice supervision and LF-MMI Discriminative training using criteria such as maximum mut ual information (MMI) [26] has been shown to be sensitive to the accuracy of the transcripts [12, 27]. In lieu of better trans cripts, a range of transcript filtering approaches have previously b een explored [12, 13, 14]. In unsupervised or semi-supervis...
-
[3]
Experiments We conducted test-time model adaptation experiments on thr ee datasets: the TED-LIUM corpus of TED talks [23, 24], multi- genre TV broadcasts from the MGB 1 Challenge [25] and a corpus of Somali from the IARPA MA TERIAL programme. All models were trained and adapted using the Kaldi toolkit [36] . We describe the respective baseline models in s...
work page 2012
-
[4]
Results We conducted the first set of experiments on the TED-LIUM dataset. Adaptation of the model without i-vectors using la t- tices achieves 10 − 15% relative improvement when adapting LHUC parameters and 9 − 14% relative improvement when adapting all parameters, whereas improvements when adapti ng using best path and all adaptation data were much small...
-
[5]
Conclusions In this paper we compared unsupervised model adaptation us- ing a lattice with the best path obtained from the first pass de - coding as supervision. Our experiments show that using the lattice as supervision achieves better results than using t he best path, even when confidence-based data selection is used to re - move transcripts with many po...
-
[6]
C. J. Leggetter and P . C. Woodland, “Maximum likelihood l in- ear regression for speaker adaptation of continuous density hidden markov models,” Computer speech & language , vol. 9, no. 2, pp. 171–185, 1995
work page 1995
-
[7]
Maximum likelihood linear transformations f or HMM-based speech recognition,
M. Gales, “Maximum likelihood linear transformations f or HMM-based speech recognition,” Computer speech & language , vol. 12, no. 2, pp. 75–98, 1998
work page 1998
-
[8]
Learning hidden u nit contri- butions for unsupervised acoustic model adaptation,
P . Swietojanski, J. Li, and S. Renals, “Learning hidden u nit contri- butions for unsupervised acoustic model adaptation,” IEEE Trans- actions on Audio, Speech, and Language Processing , vol. 14, pp. 1450–1463, 2016
work page 2016
-
[9]
J. Xue, J. Li, D. Y u, M. Seltzer, and Y . Gong, “Singular val ue de- composition based low-footprint speaker adaptation and pe rson- alization for deep neural network,” in ICASSP, 2014
work page 2014
-
[10]
Speaker adaptation of context dependent deep n eural networks,
H. Liao, “Speaker adaptation of context dependent deep n eural networks,” in ICASSP, 2013
work page 2013
-
[11]
D. Y u, K. Y ao, H. Su, G. Li, and F. Seide, “KL-divergence re gu- larized deep neural network adaptation for improved large v ocab- ulary speech recognition,” in ICASSP, 2013
work page 2013
-
[12]
Front-end factor analysis for speaker verification,
N. Dehak, P . J. Kenny, R. Dehak, P . Dumouchel, and P . Ouell et, “Front-end factor analysis for speaker verification,” IEEE Trans- actions on Audio, Speech, and Language Processing , vol. 19, no. 4, pp. 788–798, 2011
work page 2011
-
[13]
Speaker adaptation of neural network acoustic models using i-vecto rs,
G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, “Speaker adaptation of neural network acoustic models using i-vecto rs,” in ASRU, 2013
work page 2013
-
[14]
O. Abdel-Hamid and H. Jiang, “Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminativ e learning of speaker code,” in ICASSP, 2013
work page 2013
-
[15]
L. Samarakoon and K. C. Sim, “On combining i-vectors and dis- criminative adaptation methods for unsupervised speaker n ormal- ization in DNN acoustic models,” in IEEE ICASSP, 2016
work page 2016
-
[16]
Speaker adaptation for continuous den sity HMMs: A review,
P . C. Woodland, “Speaker adaptation for continuous den sity HMMs: A review,” in ISCA W orkshop on Adaptation Methods for Speech Recognition, 2001
work page 2001
-
[17]
L. Mathias, G. Y egnanarayanan, and J. Fritsch, “Discri minative training of acoustic models applied to domains with unrelia ble transcripts [speech recognition applications],” in ICASSP, 2005
work page 2005
-
[18]
Investiga ting data selection for minimum phone error training of acoustic mode ls,
S.-H. Liu, F.-H. Chu, S.-H. Lin, and B. Chen, “Investiga ting data selection for minimum phone error training of acoustic mode ls,” in Multimedia and Expo, 2007 IEEE International Conference on. IEEE, 2007
work page 2007
-
[19]
Semi-Supervised Model Training for Unbounded Conversational Speech Recognition
S. Walker, M. Pedersen, I. Orife, and J. Flaks, “Semi-su pervised model training for unbounded conversational speech recognition,” arXiv preprint arXiv:1705.09724, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[20]
Semi-supervised DNN training with word selection for ASR,
K. V esel´ y, L. Burget, and J. ˇCernock´ y, “Semi-supervised DNN training with word selection for ASR,” in Interspeech, 2017
work page 2017
-
[21]
Dnn adaptation by automatic quality estimation of asr hypotheses,
D. Falavigna, M. Matassoni, S. Jalalvand, M. Negri, and M. Turchi, “Dnn adaptation by automatic quality estimation of asr hypotheses,” Computer Speech & Language, vol. 46, pp. 585– 604, 2017
work page 2017
-
[22]
P . Swietojanski and S. Renals, “Learning hidden unit co ntri- butions for unsupervised speaker adaptation of neural netw ork acoustic models,” in SLT, 2014
work page 2014
-
[23]
Subspace lhuc for fast adap ta- tion of deep neural network acoustic models
L. Samarakoon and K. C. Sim, “Subspace lhuc for fast adap ta- tion of deep neural network acoustic models.” in INTERSPEECH, 2016
work page 2016
-
[24]
Extended low-rank plus diagonal adaptation for deep and recurrent neural networks ,
Y . Zhao, J. Li, K. Kumar, and Y . Gong, “Extended low-rank plus diagonal adaptation for deep and recurrent neural networks ,” in ICASSP, 2017
work page 2017
-
[25]
Regularized adaptation of discrim inative classifiers,
X. Li and J. Bilmes, “Regularized adaptation of discrim inative classifiers,” in ICASSP, 2006
work page 2006
-
[26]
Lattice- based unsu- pervised acoustic model training,
T. Fraga-Silva, J.-L. Gauvain, and L. Lamel, “Lattice- based unsu- pervised acoustic model training,” in ICASSP, 2011
work page 2011
-
[27]
Semi - supervised training of acoustic models using lattice-free MMI,
V . Manohar, H. Hadian, D. Povey, and S. Khudanpur, “Semi - supervised training of acoustic models using lattice-free MMI,” in ICASSP, 2018
work page 2018
-
[28]
TED-LIUM:an auto- matic speech recognition dedicated corpus,
A. Rousseau, P . Del´ eglise, and Y . Est` eve, “TED-LIUM:an auto- matic speech recognition dedicated corpus,” in LREC, 2012
work page 2012
-
[29]
Enhancing the TED- LIUM corpus with selected data for language modeling and mor e TED talks,
A. Rousseau, P . Del´ eglise, and Y . Est` eve, “Enhancing the TED- LIUM corpus with selected data for language modeling and mor e TED talks,” in LREC, 2014
work page 2014
-
[30]
The MGB challenge: Evaluating multi-genre broadcast medi a recognition,
P . Bell, M. Gales, T. Hain, J. Kilgour, P . Lanchantin, X. Liu, A. McParland, S. Renals, O. Saz, M. Wester, and P . C. Woodland, “The MGB challenge: Evaluating multi-genre broadcast medi a recognition,” in ASRU, 2015
work page 2015
-
[31]
Maximum mu - tual information estimation of hidden markov model paramet ers for speech recognition,
L. Bahl, P . Brown, P . De Souza, and R. Mercer, “Maximum mu - tual information estimation of hidden markov model paramet ers for speech recognition,” in Acoustics, Speech, and Signal Pro- cessing, IEEE International Conference on ICASSP’86. , vol. 11. IEEE, 1986, pp. 49–52
work page 1986
-
[32]
Unsupervis ed training and directed manual transcription for L VCSR,
K. Y u, M. Gales, L. Wang, and P . C. Woodland, “Unsupervis ed training and directed manual transcription for L VCSR,” Speech Communication, vol. 52, no. 7-8, pp. 652–663, 2010
work page 2010
-
[33]
Lattice-based u n- supervised mllr for speaker adaptation,
M. Padmanabhan, G. Saon, and G. Zweig, “Lattice-based u n- supervised mllr for speaker adaptation,” in ASR2000-automatic speech recognition: challenges for the New Millenium ISCA T u- torial and Research W orkshop (ITRW), 2000
work page 2000
-
[34]
Discriminative training for large vocabula ry speech recognition,
D. Povey, “Discriminative training for large vocabula ry speech recognition,” Ph.D. dissertation, University of Cambridg e, 2005
work page 2005
-
[35]
Purely sequence-trained neu- ral networks for ASR based on lattice-free MMI,
D. Povey, V . Peddinti, D. Galvez, P . Ghahrmani, V . Manoh ar, X. Na, Y . Wang, and S. Khudanpur, “Purely sequence-trained neu- ral networks for ASR based on lattice-free MMI,” Interspeech, 2016
work page 2016
-
[36]
Sequen ce- discriminative training of deep neural networks,
K. V esel´ y, A. Ghoshal, L. Burget, and D. Povey, “Sequen ce- discriminative training of deep neural networks,” in Interspeech, 2013
work page 2013
-
[37]
J. Kaiser, B. Horvat, and Z. Kacic, “A novel loss functio n for the overall risk criterion based discriminative training of HM M mod- els,” in Sixth International Conference on Spoken Language Pro- cessing, 2000
work page 2000
-
[38]
End-t o-end speech recognition using lattice-free MMI,
H. Hadian, H. Sameti, D. Povey, and S. Khudanpur, “End-t o-end speech recognition using lattice-free MMI,” Interspeech, 2018
work page 2018
-
[39]
A compact model for speaker-adaptive training,
T. Anastasakos, J. McDonough, R. Schwartz, and J. Makho ul, “A compact model for speaker-adaptive training,” in ICSLP, 1996, pp. 1137–1140
work page 1996
-
[40]
SA T-LHUC: Speaker adap tive training for learning hidden unit contributions,
P . Swietojanski and S. Renals, “SA T-LHUC: Speaker adap tive training for learning hidden unit contributions,” in ICASSP, 2016
work page 2016
-
[41]
The Kaldi speech recog- nition toolkit,
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembe k, N. Goel, M. Hannemann, P . Motl´ ıˇ cek, Y . Qian, P . Schwarz, J. Silovsk´ y, G. Stemmer, and K. V esel´ y, “The Kaldi speech recog- nition toolkit,” in ASRU, 2011
work page 2011
-
[42]
A time delay ne ural network architecture for efficient modeling of long tempora l con- texts
V . Peddinti, D. Povey, and S. Khudanpur, “A time delay ne ural network architecture for efficient modeling of long tempora l con- texts.” in Interspeech, 2015
work page 2015
-
[43]
Overview of the IWSLT 2012 evaluation campaign,
M. Federico, M. Cettolo, L. Bentivogli, M. Paul, and S. S t¨ uker, “Overview of the IWSLT 2012 evaluation campaign,” in IWSLT, 2012
work page 2012
-
[44]
Semi-orthogonal low-rank matrix factor iza- tion for deep neural networks,
D. Povey, G. Cheng, Y . Wang, K. Li, H. Xu, M. Y armohamadi, and S. Khudanpur, “Semi-orthogonal low-rank matrix factor iza- tion for deep neural networks,” Interspeech, 2018
work page 2018
-
[45]
A pitch extraction algorithm tuned for au to- matic speech recognition,
P . Ghahremani, B. BabaAli, D. Povey, K. Riedhammer, J. T rmal, and S. Khudanpur, “A pitch extraction algorithm tuned for au to- matic speech recognition,” in ICASSP, 2014
work page 2014
-
[46]
Multilingual representations for low resource speech recognition and ke yword search,
J. Cui, B. Kingsbury, B. Ramabhadran, A. Sethy, K. Audkh asi, X. Cui, E. Kislal, L. Mangu, M. Nussbaum-Thom, M. Picheny, Z. T ¨ uske, P . Golik, R. Schl¨ uter, H. Ney, M. J. F. Gales, K. M. Knill, A. Ragni, H. Wang, and P . C. Woodland, “Multilingual representations for low resource speech recognition and ke yword search,” in IEEE ICASSP, 2016
work page 2016
-
[47]
Low-resource s peech recognition and keyword-spotting,
M. J. F. Gales, K. M. Knill, and A. Ragni, “Low-resource s peech recognition and keyword-spotting,” in SPECOM, 2017
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.