pith. sign in

arxiv: 1906.11521 · v1 · pith:BJ6EQI5Lnew · submitted 2019-06-27 · 💻 cs.CL · cs.SD· eess.AS

Lattice-Based Unsupervised Test-Time Adaptation of Neural Network Acoustic Models

Pith reviewed 2026-05-25 15:05 UTC · model grok-4.3

classification 💻 cs.CL cs.SDeess.AS
keywords unsupervised adaptationtest-time adaptationacoustic model adaptationlattice-based adaptationLF-MMIneural network acoustic modelsspeech recognitiondiscriminative adaptation
0
0 comments X

The pith

Using lattices from first-pass decoding for discriminative adaptation of neural acoustic models allows adapting more parameters without overfitting even when initial transcriptions have over 50% word error rate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that unsupervised test-time adaptation of neural network acoustic models can be done effectively by using lattices rather than one-best transcriptions. This approach integrates into the lattice-free maximum mutual information framework and avoids the need for heavy regularization against errors in initial transcripts. A reader would care because it addresses the mismatch between training and test conditions in speech recognition across varied tasks including TED talks, multi-genre broadcasts, and low-resource Somali. The method succeeds where one-best methods fail due to overfitting.

Core claim

Discriminative adaptation using lattices obtained from a first pass decoding can be integrated into the LF-MMI framework, enabling adaptation of many more parameters without observing overfitting, and remaining successful even when the initial transcription has a WER in excess of 50% on tasks such as TED talks, MGB, and Somali.

What carries the argument

Lattice-based discriminative adaptation within the LF-MMI framework, where lattices from unadapted model decoding provide the supervision for adaptation transforms.

If this is right

  • Adaptation of many more parameters becomes possible without overfitting.
  • Method works on tasks with varying difficulty including high-error initial transcripts.
  • Integrates readily into existing LF-MMI training framework.
  • Reduces mismatch between training and testing conditions in acoustic models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach could reduce the need for supervised adaptation data in new domains.
  • Extending lattice use might improve robustness in other unsupervised learning settings for sequence models.
  • Testable on additional low-resource languages to confirm generalization.
  • Potential for combining with other adaptation techniques like speaker adaptation.

Load-bearing premise

Lattices generated by an unadapted model on test data contain enough correct discriminative information to guide adaptation without the model overfitting to transcription errors, even at word error rates above 50%.

What would settle it

Running the adaptation on a dataset where initial WER exceeds 50% and observing that the adapted model shows higher WER than the unadapted baseline or the one-best adaptation method.

read the original abstract

Acoustic model adaptation to unseen test recordings aims to reduce the mismatch between training and testing conditions. Most adaptation schemes for neural network models require the use of an initial one-best transcription for the test data, generated by an unadapted model, in order to estimate the adaptation transform. It has been found that adaptation methods using discriminative objective functions - such as cross-entropy loss - often require careful regularisation to avoid over-fitting to errors in the one-best transcriptions. In this paper we solve this problem by performing discriminative adaptation using lattices obtained from a first pass decoding, an approach that can be readily integrated into the lattice-free maximum mutual information (LF-MMI) framework. We investigate this approach on three transcription tasks of varying difficulty: TED talks, multi-genre broadcast (MGB) and a low-resource language (Somali). We find that our proposed approach enables many more parameters to be adapted without over-fitting being observed, and is successful even when the initial transcription has a WER in excess of 50%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper proposes lattice-based unsupervised test-time adaptation of neural network acoustic models within the lattice-free maximum mutual information (LF-MMI) framework. Instead of relying on one-best transcriptions from an unadapted model, it uses lattices from first-pass decoding to perform discriminative adaptation. The central claim is that this enables adaptation of many more parameters without overfitting and remains effective even when initial WER exceeds 50%, as demonstrated on three tasks of varying difficulty: TED talks, multi-genre broadcast (MGB), and Somali.

Significance. If the results hold, the approach offers a practical way to increase the number of adaptable parameters in test-time adaptation while avoiding the overfitting issues common with one-best transcriptions and discriminative objectives. The integration with the established LF-MMI framework is a strength, as is the evaluation across tasks spanning high-resource to low-resource settings. The empirical demonstration that the method works above 50% initial WER directly addresses a key practical limitation in unsupervised adaptation.

minor comments (2)
  1. [Abstract] The abstract and introduction would benefit from a brief statement of the exact number of parameters adapted in the lattice-based vs. one-best conditions to make the 'many more parameters' claim immediately quantifiable.
  2. [Experimental results] Figure or table captions should explicitly note the initial WER of the unadapted model for each task so readers can directly verify the >50% WER regime without cross-referencing the text.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The report lists no specific major comments.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central approach integrates lattices from first-pass decoding into the established LF-MMI framework for unsupervised adaptation. No derivation step reduces by construction to fitted parameters or self-referential definitions; the method is presented as an extension of prior LF-MMI work with external lattice inputs, and empirical results on TED, MGB, and Somali are reported as independent validation rather than tautological outcomes. No self-citation chains, ansatzes smuggled via citation, or uniqueness theorems imported from the authors appear in the load-bearing claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review provides insufficient detail for exhaustive ledger; main domain assumption is the utility of first-pass lattices for adaptation.

axioms (1)
  • domain assumption Lattices from unadapted first-pass decoding contain useful information for discriminative adaptation even with WER exceeding 50%.
    Central to enabling the method without overfitting to transcription errors.

pith-pipeline@v0.9.0 · 5716 in / 1141 out tokens · 25401 ms · 2026-05-25T15:05:14.959648+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 1 internal anchor

  1. [1]

    In feature-space adaptation , trans- formations of acoustic features are estimated to maximise t he log-likelihood of the adaptation data [1, 2]

    Introduction Acoustic model adaptation aims to improve automatic speech recognition (ASR) accuracy by reducing the mismatch betwee n training and test conditions. In feature-space adaptation , trans- formations of acoustic features are estimated to maximise t he log-likelihood of the adaptation data [1, 2]. A subset of the weights of a neural network acou...

  2. [2]

    Methods 2.1. Lattice supervision and LF-MMI Discriminative training using criteria such as maximum mut ual information (MMI) [26] has been shown to be sensitive to the accuracy of the transcripts [12, 27]. In lieu of better trans cripts, a range of transcript filtering approaches have previously b een explored [12, 13, 14]. In unsupervised or semi-supervis...

  3. [3]

    surprise language

    Experiments We conducted test-time model adaptation experiments on thr ee datasets: the TED-LIUM corpus of TED talks [23, 24], multi- genre TV broadcasts from the MGB 1 Challenge [25] and a corpus of Somali from the IARPA MA TERIAL programme. All models were trained and adapted using the Kaldi toolkit [36] . We describe the respective baseline models in s...

  4. [4]

    Results We conducted the first set of experiments on the TED-LIUM dataset. Adaptation of the model without i-vectors using la t- tices achieves 10 − 15% relative improvement when adapting LHUC parameters and 9 − 14% relative improvement when adapting all parameters, whereas improvements when adapti ng using best path and all adaptation data were much small...

  5. [5]

    Conclusions In this paper we compared unsupervised model adaptation us- ing a lattice with the best path obtained from the first pass de - coding as supervision. Our experiments show that using the lattice as supervision achieves better results than using t he best path, even when confidence-based data selection is used to re - move transcripts with many po...

  6. [6]

    Maximum likelihood l in- ear regression for speaker adaptation of continuous density hidden markov models,

    C. J. Leggetter and P . C. Woodland, “Maximum likelihood l in- ear regression for speaker adaptation of continuous density hidden markov models,” Computer speech & language , vol. 9, no. 2, pp. 171–185, 1995

  7. [7]

    Maximum likelihood linear transformations f or HMM-based speech recognition,

    M. Gales, “Maximum likelihood linear transformations f or HMM-based speech recognition,” Computer speech & language , vol. 12, no. 2, pp. 75–98, 1998

  8. [8]

    Learning hidden u nit contri- butions for unsupervised acoustic model adaptation,

    P . Swietojanski, J. Li, and S. Renals, “Learning hidden u nit contri- butions for unsupervised acoustic model adaptation,” IEEE Trans- actions on Audio, Speech, and Language Processing , vol. 14, pp. 1450–1463, 2016

  9. [9]

    Singular val ue de- composition based low-footprint speaker adaptation and pe rson- alization for deep neural network,

    J. Xue, J. Li, D. Y u, M. Seltzer, and Y . Gong, “Singular val ue de- composition based low-footprint speaker adaptation and pe rson- alization for deep neural network,” in ICASSP, 2014

  10. [10]

    Speaker adaptation of context dependent deep n eural networks,

    H. Liao, “Speaker adaptation of context dependent deep n eural networks,” in ICASSP, 2013

  11. [11]

    KL-divergence re gu- larized deep neural network adaptation for improved large v ocab- ulary speech recognition,

    D. Y u, K. Y ao, H. Su, G. Li, and F. Seide, “KL-divergence re gu- larized deep neural network adaptation for improved large v ocab- ulary speech recognition,” in ICASSP, 2013

  12. [12]

    Front-end factor analysis for speaker verification,

    N. Dehak, P . J. Kenny, R. Dehak, P . Dumouchel, and P . Ouell et, “Front-end factor analysis for speaker verification,” IEEE Trans- actions on Audio, Speech, and Language Processing , vol. 19, no. 4, pp. 788–798, 2011

  13. [13]

    Speaker adaptation of neural network acoustic models using i-vecto rs,

    G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, “Speaker adaptation of neural network acoustic models using i-vecto rs,” in ASRU, 2013

  14. [14]

    Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminativ e learning of speaker code,

    O. Abdel-Hamid and H. Jiang, “Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminativ e learning of speaker code,” in ICASSP, 2013

  15. [15]

    On combining i-vectors and dis- criminative adaptation methods for unsupervised speaker n ormal- ization in DNN acoustic models,

    L. Samarakoon and K. C. Sim, “On combining i-vectors and dis- criminative adaptation methods for unsupervised speaker n ormal- ization in DNN acoustic models,” in IEEE ICASSP, 2016

  16. [16]

    Speaker adaptation for continuous den sity HMMs: A review,

    P . C. Woodland, “Speaker adaptation for continuous den sity HMMs: A review,” in ISCA W orkshop on Adaptation Methods for Speech Recognition, 2001

  17. [17]

    Discri minative training of acoustic models applied to domains with unrelia ble transcripts [speech recognition applications],

    L. Mathias, G. Y egnanarayanan, and J. Fritsch, “Discri minative training of acoustic models applied to domains with unrelia ble transcripts [speech recognition applications],” in ICASSP, 2005

  18. [18]

    Investiga ting data selection for minimum phone error training of acoustic mode ls,

    S.-H. Liu, F.-H. Chu, S.-H. Lin, and B. Chen, “Investiga ting data selection for minimum phone error training of acoustic mode ls,” in Multimedia and Expo, 2007 IEEE International Conference on. IEEE, 2007

  19. [19]

    Semi-Supervised Model Training for Unbounded Conversational Speech Recognition

    S. Walker, M. Pedersen, I. Orife, and J. Flaks, “Semi-su pervised model training for unbounded conversational speech recognition,” arXiv preprint arXiv:1705.09724, 2017

  20. [20]

    Semi-supervised DNN training with word selection for ASR,

    K. V esel´ y, L. Burget, and J. ˇCernock´ y, “Semi-supervised DNN training with word selection for ASR,” in Interspeech, 2017

  21. [21]

    Dnn adaptation by automatic quality estimation of asr hypotheses,

    D. Falavigna, M. Matassoni, S. Jalalvand, M. Negri, and M. Turchi, “Dnn adaptation by automatic quality estimation of asr hypotheses,” Computer Speech & Language, vol. 46, pp. 585– 604, 2017

  22. [22]

    Learning hidden unit co ntri- butions for unsupervised speaker adaptation of neural netw ork acoustic models,

    P . Swietojanski and S. Renals, “Learning hidden unit co ntri- butions for unsupervised speaker adaptation of neural netw ork acoustic models,” in SLT, 2014

  23. [23]

    Subspace lhuc for fast adap ta- tion of deep neural network acoustic models

    L. Samarakoon and K. C. Sim, “Subspace lhuc for fast adap ta- tion of deep neural network acoustic models.” in INTERSPEECH, 2016

  24. [24]

    Extended low-rank plus diagonal adaptation for deep and recurrent neural networks ,

    Y . Zhao, J. Li, K. Kumar, and Y . Gong, “Extended low-rank plus diagonal adaptation for deep and recurrent neural networks ,” in ICASSP, 2017

  25. [25]

    Regularized adaptation of discrim inative classifiers,

    X. Li and J. Bilmes, “Regularized adaptation of discrim inative classifiers,” in ICASSP, 2006

  26. [26]

    Lattice- based unsu- pervised acoustic model training,

    T. Fraga-Silva, J.-L. Gauvain, and L. Lamel, “Lattice- based unsu- pervised acoustic model training,” in ICASSP, 2011

  27. [27]

    Semi - supervised training of acoustic models using lattice-free MMI,

    V . Manohar, H. Hadian, D. Povey, and S. Khudanpur, “Semi - supervised training of acoustic models using lattice-free MMI,” in ICASSP, 2018

  28. [28]

    TED-LIUM:an auto- matic speech recognition dedicated corpus,

    A. Rousseau, P . Del´ eglise, and Y . Est` eve, “TED-LIUM:an auto- matic speech recognition dedicated corpus,” in LREC, 2012

  29. [29]

    Enhancing the TED- LIUM corpus with selected data for language modeling and mor e TED talks,

    A. Rousseau, P . Del´ eglise, and Y . Est` eve, “Enhancing the TED- LIUM corpus with selected data for language modeling and mor e TED talks,” in LREC, 2014

  30. [30]

    The MGB challenge: Evaluating multi-genre broadcast medi a recognition,

    P . Bell, M. Gales, T. Hain, J. Kilgour, P . Lanchantin, X. Liu, A. McParland, S. Renals, O. Saz, M. Wester, and P . C. Woodland, “The MGB challenge: Evaluating multi-genre broadcast medi a recognition,” in ASRU, 2015

  31. [31]

    Maximum mu - tual information estimation of hidden markov model paramet ers for speech recognition,

    L. Bahl, P . Brown, P . De Souza, and R. Mercer, “Maximum mu - tual information estimation of hidden markov model paramet ers for speech recognition,” in Acoustics, Speech, and Signal Pro- cessing, IEEE International Conference on ICASSP’86. , vol. 11. IEEE, 1986, pp. 49–52

  32. [32]

    Unsupervis ed training and directed manual transcription for L VCSR,

    K. Y u, M. Gales, L. Wang, and P . C. Woodland, “Unsupervis ed training and directed manual transcription for L VCSR,” Speech Communication, vol. 52, no. 7-8, pp. 652–663, 2010

  33. [33]

    Lattice-based u n- supervised mllr for speaker adaptation,

    M. Padmanabhan, G. Saon, and G. Zweig, “Lattice-based u n- supervised mllr for speaker adaptation,” in ASR2000-automatic speech recognition: challenges for the New Millenium ISCA T u- torial and Research W orkshop (ITRW), 2000

  34. [34]

    Discriminative training for large vocabula ry speech recognition,

    D. Povey, “Discriminative training for large vocabula ry speech recognition,” Ph.D. dissertation, University of Cambridg e, 2005

  35. [35]

    Purely sequence-trained neu- ral networks for ASR based on lattice-free MMI,

    D. Povey, V . Peddinti, D. Galvez, P . Ghahrmani, V . Manoh ar, X. Na, Y . Wang, and S. Khudanpur, “Purely sequence-trained neu- ral networks for ASR based on lattice-free MMI,” Interspeech, 2016

  36. [36]

    Sequen ce- discriminative training of deep neural networks,

    K. V esel´ y, A. Ghoshal, L. Burget, and D. Povey, “Sequen ce- discriminative training of deep neural networks,” in Interspeech, 2013

  37. [37]

    A novel loss functio n for the overall risk criterion based discriminative training of HM M mod- els,

    J. Kaiser, B. Horvat, and Z. Kacic, “A novel loss functio n for the overall risk criterion based discriminative training of HM M mod- els,” in Sixth International Conference on Spoken Language Pro- cessing, 2000

  38. [38]

    End-t o-end speech recognition using lattice-free MMI,

    H. Hadian, H. Sameti, D. Povey, and S. Khudanpur, “End-t o-end speech recognition using lattice-free MMI,” Interspeech, 2018

  39. [39]

    A compact model for speaker-adaptive training,

    T. Anastasakos, J. McDonough, R. Schwartz, and J. Makho ul, “A compact model for speaker-adaptive training,” in ICSLP, 1996, pp. 1137–1140

  40. [40]

    SA T-LHUC: Speaker adap tive training for learning hidden unit contributions,

    P . Swietojanski and S. Renals, “SA T-LHUC: Speaker adap tive training for learning hidden unit contributions,” in ICASSP, 2016

  41. [41]

    The Kaldi speech recog- nition toolkit,

    D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembe k, N. Goel, M. Hannemann, P . Motl´ ıˇ cek, Y . Qian, P . Schwarz, J. Silovsk´ y, G. Stemmer, and K. V esel´ y, “The Kaldi speech recog- nition toolkit,” in ASRU, 2011

  42. [42]

    A time delay ne ural network architecture for efficient modeling of long tempora l con- texts

    V . Peddinti, D. Povey, and S. Khudanpur, “A time delay ne ural network architecture for efficient modeling of long tempora l con- texts.” in Interspeech, 2015

  43. [43]

    Overview of the IWSLT 2012 evaluation campaign,

    M. Federico, M. Cettolo, L. Bentivogli, M. Paul, and S. S t¨ uker, “Overview of the IWSLT 2012 evaluation campaign,” in IWSLT, 2012

  44. [44]

    Semi-orthogonal low-rank matrix factor iza- tion for deep neural networks,

    D. Povey, G. Cheng, Y . Wang, K. Li, H. Xu, M. Y armohamadi, and S. Khudanpur, “Semi-orthogonal low-rank matrix factor iza- tion for deep neural networks,” Interspeech, 2018

  45. [45]

    A pitch extraction algorithm tuned for au to- matic speech recognition,

    P . Ghahremani, B. BabaAli, D. Povey, K. Riedhammer, J. T rmal, and S. Khudanpur, “A pitch extraction algorithm tuned for au to- matic speech recognition,” in ICASSP, 2014

  46. [46]

    Multilingual representations for low resource speech recognition and ke yword search,

    J. Cui, B. Kingsbury, B. Ramabhadran, A. Sethy, K. Audkh asi, X. Cui, E. Kislal, L. Mangu, M. Nussbaum-Thom, M. Picheny, Z. T ¨ uske, P . Golik, R. Schl¨ uter, H. Ney, M. J. F. Gales, K. M. Knill, A. Ragni, H. Wang, and P . C. Woodland, “Multilingual representations for low resource speech recognition and ke yword search,” in IEEE ICASSP, 2016

  47. [47]

    Low-resource s peech recognition and keyword-spotting,

    M. J. F. Gales, K. M. Knill, and A. Ragni, “Low-resource s peech recognition and keyword-spotting,” in SPECOM, 2017