Latent Dirichlet Allocation Based Acoustic Data Selection for Automatic Speech Recognition

Mortaza (Morrie) Doulaty; Thomas Hain

arxiv: 1907.01302 · v1 · pith:UDAYDFYRnew · submitted 2019-07-02 · 💻 cs.CL

Latent Dirichlet Allocation Based Acoustic Data Selection for Automatic Speech Recognition

Mortaza (Morrie) Doulaty , Thomas Hain This is my paper

Pith reviewed 2026-05-25 11:20 UTC · model grok-4.3

classification 💻 cs.CL

keywords data selectionLatent Dirichlet Allocationautomatic speech recognitionacoustic modelingdomain adaptationmeeting datatopic modeling

0 comments

The pith

Acoustic topic modeling selects training data that improves speech recognition over random or full-pool approaches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to use acoustic Latent Dirichlet Allocation to choose which utterances from a large mixed pool of speech data best match a target domain for training automatic speech recognition systems. With a target of 32 hours of meeting recordings and a pool of 2000 hours spanning many styles, the method picks a subset that yields higher accuracy than picking at random, using posterior probabilities, or training on everything. A reader would care because mismatched training data hurts performance even for modern models, and careful selection can make better use of existing recordings without collecting new matched data.

Core claim

Acoustic Latent Dirichlet Allocation applied to acoustic features produces topic distributions that act as a similarity measure. Given representative target utterances from meeting data, these distributions allow selection of the most similar data from a 2k-hour pool, resulting in acoustic models that outperform those trained on randomly selected data, on data selected by posterior probabilities, or on the entire pool.

What carries the argument

Acoustic Latent Dirichlet Allocation (aLDA) topic distributions on acoustic features, which serve as the similarity criterion to rank and select training utterances from the pool.

If this is right

Selected data leads to better ASR performance than the full pool of 2k hours.
The method works for both far-field and close-talk meeting data.
It beats random selection and posterior-based selection from the same pool.
Small representative sets from the target domain suffice to drive the selection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If aLDA captures acoustic domain similarity, the same selection could apply to other target domains like telephony or lectures.
Combining aLDA selection with other criteria might yield even stronger results.
This suggests data selection can reduce the need for large matched datasets in ASR training.

Load-bearing premise

That the topic distributions from aLDA on acoustic features reflect similarities that actually affect automatic speech recognition accuracy.

What would settle it

Training ASR models on aLDA-selected data and finding no accuracy gain or a loss compared to the baselines on the meeting test set would disprove the effectiveness of this selection method.

Figures

Figures reproduced from arXiv: 1907.01302 by Mortaza (Morrie) Doulaty, Thomas Hain.

**Figure 1.** Figure 1: Graphical model representation of LDA model, the only observed variables are wt’s. α and β are dataset level parameters, θd˜i is a document level variable and zt is a latent variable indicating the domain from which wt was drawn. The following joint distribution is the result of the generative process of LDA: p(θ, z, d¯|α, β) = p(θ|α) YT t=1 p(zt|θ)p(wt|zt, β) (3) The posterior distribution of the latent v… view at source ↗

read the original abstract

Selecting in-domain data from a large pool of diverse and out-of-domain data is a non-trivial problem. In most cases simply using all of the available data will lead to sub-optimal and in some cases even worse performance compared to carefully selecting a matching set. This is true even for data-inefficient neural models. Acoustic Latent Dirichlet Allocation (aLDA) is shown to be useful in a variety of speech technology related tasks, including domain adaptation of acoustic models for automatic speech recognition and entity labeling for information retrieval. In this paper we propose to use aLDA as a data similarity criterion in a data selection framework. Given a large pool of out-of-domain and potentially mismatched data, the task is to select the best-matching training data to a set of representative utterances sampled from a target domain. Our target data consists of around 32 hours of meeting data (both far-field and close-talk) and the pool contains 2k hours of meeting, talks, voice search, dictation, command-and-control, audio books, lectures, generic media and telephony speech data. The proposed technique for training data selection, significantly outperforms random selection, posterior-based selection as well as using all of the available data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

aLDA selection claims to beat random, posterior, and full-pool training for meeting ASR but the abstract supplies no numbers so the gain stays uncheckable.

read the letter

The main thing to know is that the paper applies acoustic LDA topic distributions to rank and select training utterances from a 2k-hour mixed pool to match a 32-hour meeting target, and states that the resulting models beat random selection, a posterior baseline, and training on everything. That last comparison is useful because it shows when extra data actually hurts. The application of aLDA as a similarity measure for this exact selection task is the concrete new step; earlier uses of aLDA in speech were for adaptation or labeling, not subset ranking. The experimental frame is reasonable on its face: a diverse pool, a small representative target sample, and the inclusion of the full-pool condition as a control. The soft spots are straightforward. The abstract asserts significant outperformance yet gives zero WER values, no selected data size, no error bars, and no mention of statistical tests, so the central claim cannot be assessed from what is shown. The assumption that unsupervised topics on acoustic features will separate data along the dimensions that actually move word error rate is plausible but untested in the description; there is no reported correlation between topic distance and per-utterance error, nor an ablation against simpler statistics such as mean MFCCs or i-vectors. That gap matches the stress-test concern and leaves open the possibility that any gain comes from incidental factors rather than domain match. This work is aimed at ASR practitioners who already have large heterogeneous collections and need to curate subsets for a narrow target like meetings or far-field transcription. A reader in that position could extract a workable recipe to try. The paper deserves a serious referee because the setup is concrete, the baselines are sensible, and the idea is simple enough to verify or refute with the full results table. I would send it to review.

Referee Report

2 major / 1 minor

Summary. The paper proposes an aLDA-based framework to select a subset of training utterances from a 2k-hour heterogeneous pool (meeting, talks, voice search, dictation, etc.) that best matches a 32-hour target meeting domain (far-field and close-talk), using topic-distribution similarity to a small representative target sample. The central empirical claim is that this selection significantly outperforms random selection, posterior-based selection, and training on the entire pool for ASR acoustic-model performance.

Significance. If the superiority is robustly demonstrated, the method supplies a practical unsupervised technique for mitigating domain mismatch in ASR training data selection, addressing the known risk that including all available data can degrade performance even for data-inefficient neural models. It builds directly on prior uses of aLDA for domain adaptation and related speech tasks.

major comments (2)

[Experiments] Experiments section: the abstract asserts significant outperformance over random, posterior-based, and full-pool baselines, yet supplies no WER numbers, error bars, dataset splits, model architecture details, or statistical tests; this is load-bearing for the central empirical claim.
[§3] §3 (aLDA Data Selection): no correlation analysis is reported between aLDA topic distances (cosine or KL) and per-utterance WER on a held-out set, nor an ablation replacing aLDA topics with simpler acoustic statistics such as mean MFCC or i-vectors; this directly tests whether the unsupervised topics separate data along dimensions that actually drive ASR error rates.

minor comments (1)

The abstract would be strengthened by a single sentence summarizing the magnitude of the reported gains and the evaluation metric.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and constructive comments. We address each major point below and will make the necessary revisions to strengthen the experimental reporting and validation of the aLDA approach.

read point-by-point responses

Referee: [Experiments] Experiments section: the abstract asserts significant outperformance over random, posterior-based, and full-pool baselines, yet supplies no WER numbers, error bars, dataset splits, model architecture details, or statistical tests; this is load-bearing for the central empirical claim.

Authors: We agree that explicit numerical results, error bars, dataset split details, model architecture specifications, and statistical tests are required to substantiate the central claims. Although the experiments section contains comparative results, we will revise it to include full WER tables with absolute values, confidence intervals or standard deviations, precise descriptions of the 32-hour target and 2k-hour pool splits, the acoustic model architecture (e.g., DNN-HMM or TDNN details), and any significance testing performed. revision: yes
Referee: [§3] §3 (aLDA Data Selection): no correlation analysis is reported between aLDA topic distances (cosine or KL) and per-utterance WER on a held-out set, nor an ablation replacing aLDA topics with simpler acoustic statistics such as mean MFCC or i-vectors; this directly tests whether the unsupervised topics separate data along dimensions that actually drive ASR error rates.

Authors: We recognize that demonstrating a direct link between aLDA topic distances and ASR error rates, as well as comparing against simpler acoustic features, would strengthen the justification for using topic modeling. We will add both a correlation analysis (using cosine and KL distances against held-out per-utterance WER) and an ablation study replacing aLDA with mean MFCC vectors and i-vectors in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No derivation chain present; purely empirical comparison

full rationale

The paper proposes an aLDA-based data selection method and reports empirical ASR WER improvements over random, posterior, and full-data baselines on meeting data. No equations, first-principles derivations, fitted parameters renamed as predictions, or uniqueness theorems are referenced in the provided abstract or description. The central claim rests on experimental results rather than any self-referential reduction of outputs to inputs. No load-bearing steps match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No specific free parameters, axioms, or invented entities are described in the provided abstract; all technical details remain at summary level.

pith-pipeline@v0.9.0 · 5743 in / 1216 out tokens · 31059 ms · 2026-05-25T11:20:17.101704+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 1 internal anchor

[1]

Introduction Bootstrapping an speech recognition system for a new domain is a common practical problem. A typical scenario is to have some limited in-domain data from a target domain that ASR system is being built for and a pool of out-of-domain data, of- ten containing a diverse set of potentially mismatched data. Us- ing all of the available data is not...

work page
[2]

Latent Dirichlet Allocation Based Acoustic Data Selection for Automatic Speech Recognition

Acoustic Latent Dirichlet Allocation As shown in our previous works [3, 4, 5], aLDA domain pos- teriors have a unique distribution across different domains that can be used to characterise the acoustic scenery. In this work we make use of aLDA domain posterior features as a basis of acoustic similarity in a data selection problem. The idea is that using a...

work page internal anchor Pith review Pith/arXiv arXiv 1907
[3]

Experimental Setup To evaluate the effectiveness of aLDA for data-selection in ASR, we are trying to solve this practical problem: given a small set of in-domain data and a large pool of out-of-domain and potentially mismatched data, what’s the best set of data that can be selected from the pool to train a model for the in-domain data. 3.1. Data The in-do...

work page 2000
[4]

as well as language modelling tasks [15, 16, 17, 18, 19]. Training tLDA models followed a similar procedure to aLDA Algorithm 1 Data-selection based on Dirichlet posterior Input: Training data Strn ofM utterances, Training set Dirichlet posteriors{γtrn 1 ,...,γ trn M }, Dev set posterior centroids{γdev 1 ,...,γ dev N }, Distance thresholdλ Initialize: Sne...

work page 2000
[5]

Conclusions Selecting matching data to a small set of in-domain data from a large pool of out-of-domain and mismatched data is a non- trivial problem. This problem arises in many practical applica- tions of speech recognition where the task is to build an ASR system for a new target domain where there is a very limited amount of data is available. Often u...

work page 2000
[6]

Acknowledgements The ﬁrst author would like to thank Trevor Francis for support- ing parts of this work

work page
[7]

Unsupervised sub- modular subset selection for speech data,

K. Wei, Y . Liu, K. Kirchhoff, and J. Bilmes, “Unsupervised sub- modular subset selection for speech data,” in Proc. of ICASSP , Florence, Italy, 2014

work page 2014
[8]

Data-selective transfer learn- ing for multi-domain speech recognition,

M. Doulaty, O. Saz, and T. Hain, “Data-selective transfer learn- ing for multi-domain speech recognition,” inProc. of Interspeech, Dresden, Germany, 2015

work page 2015
[9]

Unsupervised domain discovery using latent dirichlet allo- cation for acoustic modelling in speech recognition,

——, “Unsupervised domain discovery using latent dirichlet allo- cation for acoustic modelling in speech recognition,” in Proc. of Interspeech, Dresden, Germany, 2015

work page 2015
[10]

Latent Dirichlet Allocation Based Organisation of Broadcast Media Archives for Deep Neural Network Adaptation,

M. Doulaty, O. Saz, R. W. M. Ng, and T. Hain, “Latent Dirichlet Allocation Based Organisation of Broadcast Media Archives for Deep Neural Network Adaptation,” in Proc. of ASRU , Arizona, USA, 2015

work page 2015
[11]

Automatic Genre and Show Identiﬁcation of Broadcast Media,

——, “Automatic Genre and Show Identiﬁcation of Broadcast Media,” in Proc. of Interspeech, California, USA, 2016

work page 2016
[12]

Acoustic topic model for audio information retrieval,

S. Kim, S. Narayanan, and S. Sundaram, “Acoustic topic model for audio information retrieval,” in Proc. of WASPAA, New Paltz NY , USA, 2009

work page 2009
[13]

On-line genre classiﬁ- cation of TV programs using audio content,

S. Kim, P. Georgiou, and S. Narayanan, “On-line genre classiﬁ- cation of TV programs using audio content,” in Proc. of ICASSP, Vancouver, Canada, 2013

work page 2013
[14]

Latent Dirichlet Allo- cation,

D. M. Blei, A. Y . Ng, and M. I. Jordan, “Latent Dirichlet Allo- cation,” Journal of Machine Learning Research , vol. 3, pp. 993– 1022, 2003

work page 2003
[15]

Factor analysis for audio-based video genre classiﬁcation

M. Rouvier, D. Matrouf, and G. Linares, “Factor analysis for audio-based video genre classiﬁcation.” in Proc. of Interspeech , Brighton, UK, 2009

work page 2009
[16]

Scalable modiﬁed kneser-ney language model estimation,

K. Heaﬁeld, I. Pouzyrevsky, J. H. Clark, and P. Koehn, “Scalable modiﬁed kneser-ney language model estimation,” inProc. of ACL, Soﬁa, Bulgaria, 2013

work page 2013
[17]

Gnu parallel - the command-line power tool,

O. Tange, “Gnu parallel - the command-line power tool,” ;login: The USENIX Magazine , vol. 36, no. 1, pp. 42–47, Feb 2011. [Online]. Available: http://www.gnu.org/s/parallel

work page 2011
[18]

CMU Sequence-to-Sequence G2P toolkit,

“CMU Sequence-to-Sequence G2P toolkit,” https://github.com/ cmusphinx/g2p-seq2seq

work page
[19]

Purely sequence-trained neu- ral networks for asr based on lattice-free mmi,

D. Povey, V . Peddinti, D. Galvez, P. Ghahremani, V . Manohar, X. Na, Y . Wang, and S. Khudanpur, “Purely sequence-trained neu- ral networks for asr based on lattice-free mmi,” in Proc. of Inter- speech, California, USA, 2016

work page 2016
[20]

The kaldi speech recognition toolkit,

D. Povey, A. Ghoshal, G. Boulianne, L. Burget et al. , “The kaldi speech recognition toolkit,” IEEE Signal Processing Soci- ety, Tech. Rep., 2011

work page 2011
[21]

The 2015 Shefﬁeld system for transcrip- tion of multi-genre broadcast media,

O. Saz, M. Doulaty, S. Deena, R. Milner, R. W. M. Ng, M. Hasan, Y . Liu, and T. Hain, “The 2015 Shefﬁeld system for transcrip- tion of multi-genre broadcast media,” in Proc. of ASRU, Arizona, USA, 2015

work page 2015
[22]

Combin- ing feature and model-based adaptation of rnnlms for multi-genre broadcast speech recognition,

S. Deena, M. Hasan, M. Doulaty, O. Saz, and T. Hain, “Combin- ing feature and model-based adaptation of rnnlms for multi-genre broadcast speech recognition,” inProc. of Interspeech, California, USA, 2016

work page 2016
[23]

Semi-supervised adaptation of RNNLMs by ﬁne-tuning with domain-speciﬁc auxiliary features,

S. Deena, R. W. Ng, P. Madhyashta, L. Specia, and T. Hain, “Semi-supervised adaptation of RNNLMs by ﬁne-tuning with domain-speciﬁc auxiliary features,” in Proc. of Interspeech , Stockholm, Sweden, 2017

work page 2017
[24]

Recurrent neural network language model adaptation for multi-genre broad- cast speech recognition and alignment,

S. Deena, M. Hasan, M. Doulaty, O. Saz, and T. Hain, “Recurrent neural network language model adaptation for multi-genre broad- cast speech recognition and alignment,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) , vol. 27, no. 3, pp. 572–582, 2019

work page 2019
[25]

Lightly supervised align- ment of subtitles on multi-genre broadcasts,

O. Saz, S. Deena, M. Doulaty, M. Hasan, B. Khaliq, R. Milner, R. W. M. Ng, J. Olcoz, and T. Hain, “Lightly supervised align- ment of subtitles on multi-genre broadcasts,” Multimedia Tools and Applications, vol. 77, no. 23, pp. 30 533–30 550, 2018

work page 2018
[26]

Automatic optimization of data perturbation distributions for multi-style training in speech recognition,

M. Doulaty, R. Rose, and O. Siohan, “Automatic optimization of data perturbation distributions for multi-style training in speech recognition,” in Proc. of SLT, California, USA, 2016

work page 2016
[27]

The AMI meeting corpus: A pre- announcement,

J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, W. Karaiskos, Vasilis Kraaij, M. Kronenthal, G. Lathoud, M. Lincoln, A. Lisowska, I. McCowan, W. Post, D. Reidsma, and P. Wellner, “The AMI meeting corpus: A pre- announcement,” in Proc. of MLMI, Bethesda, USA, 2006

work page 2006
[28]

The ICSI meeting corpus,

A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke, and C. Wooters, “The ICSI meeting corpus,” in Proc. of ICASSP, Hong Kong, 2003

work page 2003

[1] [1]

Introduction Bootstrapping an speech recognition system for a new domain is a common practical problem. A typical scenario is to have some limited in-domain data from a target domain that ASR system is being built for and a pool of out-of-domain data, of- ten containing a diverse set of potentially mismatched data. Us- ing all of the available data is not...

work page

[2] [2]

Latent Dirichlet Allocation Based Acoustic Data Selection for Automatic Speech Recognition

Acoustic Latent Dirichlet Allocation As shown in our previous works [3, 4, 5], aLDA domain pos- teriors have a unique distribution across different domains that can be used to characterise the acoustic scenery. In this work we make use of aLDA domain posterior features as a basis of acoustic similarity in a data selection problem. The idea is that using a...

work page internal anchor Pith review Pith/arXiv arXiv 1907

[3] [3]

Experimental Setup To evaluate the effectiveness of aLDA for data-selection in ASR, we are trying to solve this practical problem: given a small set of in-domain data and a large pool of out-of-domain and potentially mismatched data, what’s the best set of data that can be selected from the pool to train a model for the in-domain data. 3.1. Data The in-do...

work page 2000

[4] [4]

as well as language modelling tasks [15, 16, 17, 18, 19]. Training tLDA models followed a similar procedure to aLDA Algorithm 1 Data-selection based on Dirichlet posterior Input: Training data Strn ofM utterances, Training set Dirichlet posteriors{γtrn 1 ,...,γ trn M }, Dev set posterior centroids{γdev 1 ,...,γ dev N }, Distance thresholdλ Initialize: Sne...

work page 2000

[5] [5]

Conclusions Selecting matching data to a small set of in-domain data from a large pool of out-of-domain and mismatched data is a non- trivial problem. This problem arises in many practical applica- tions of speech recognition where the task is to build an ASR system for a new target domain where there is a very limited amount of data is available. Often u...

work page 2000

[6] [6]

Acknowledgements The ﬁrst author would like to thank Trevor Francis for support- ing parts of this work

work page

[7] [7]

Unsupervised sub- modular subset selection for speech data,

K. Wei, Y . Liu, K. Kirchhoff, and J. Bilmes, “Unsupervised sub- modular subset selection for speech data,” in Proc. of ICASSP , Florence, Italy, 2014

work page 2014

[8] [8]

Data-selective transfer learn- ing for multi-domain speech recognition,

M. Doulaty, O. Saz, and T. Hain, “Data-selective transfer learn- ing for multi-domain speech recognition,” inProc. of Interspeech, Dresden, Germany, 2015

work page 2015

[9] [9]

Unsupervised domain discovery using latent dirichlet allo- cation for acoustic modelling in speech recognition,

——, “Unsupervised domain discovery using latent dirichlet allo- cation for acoustic modelling in speech recognition,” in Proc. of Interspeech, Dresden, Germany, 2015

work page 2015

[10] [10]

Latent Dirichlet Allocation Based Organisation of Broadcast Media Archives for Deep Neural Network Adaptation,

M. Doulaty, O. Saz, R. W. M. Ng, and T. Hain, “Latent Dirichlet Allocation Based Organisation of Broadcast Media Archives for Deep Neural Network Adaptation,” in Proc. of ASRU , Arizona, USA, 2015

work page 2015

[11] [11]

Automatic Genre and Show Identiﬁcation of Broadcast Media,

——, “Automatic Genre and Show Identiﬁcation of Broadcast Media,” in Proc. of Interspeech, California, USA, 2016

work page 2016

[12] [12]

Acoustic topic model for audio information retrieval,

S. Kim, S. Narayanan, and S. Sundaram, “Acoustic topic model for audio information retrieval,” in Proc. of WASPAA, New Paltz NY , USA, 2009

work page 2009

[13] [13]

On-line genre classiﬁ- cation of TV programs using audio content,

S. Kim, P. Georgiou, and S. Narayanan, “On-line genre classiﬁ- cation of TV programs using audio content,” in Proc. of ICASSP, Vancouver, Canada, 2013

work page 2013

[14] [14]

Latent Dirichlet Allo- cation,

D. M. Blei, A. Y . Ng, and M. I. Jordan, “Latent Dirichlet Allo- cation,” Journal of Machine Learning Research , vol. 3, pp. 993– 1022, 2003

work page 2003

[15] [15]

Factor analysis for audio-based video genre classiﬁcation

M. Rouvier, D. Matrouf, and G. Linares, “Factor analysis for audio-based video genre classiﬁcation.” in Proc. of Interspeech , Brighton, UK, 2009

work page 2009

[16] [16]

Scalable modiﬁed kneser-ney language model estimation,

K. Heaﬁeld, I. Pouzyrevsky, J. H. Clark, and P. Koehn, “Scalable modiﬁed kneser-ney language model estimation,” inProc. of ACL, Soﬁa, Bulgaria, 2013

work page 2013

[17] [17]

Gnu parallel - the command-line power tool,

O. Tange, “Gnu parallel - the command-line power tool,” ;login: The USENIX Magazine , vol. 36, no. 1, pp. 42–47, Feb 2011. [Online]. Available: http://www.gnu.org/s/parallel

work page 2011

[18] [18]

CMU Sequence-to-Sequence G2P toolkit,

“CMU Sequence-to-Sequence G2P toolkit,” https://github.com/ cmusphinx/g2p-seq2seq

work page

[19] [19]

Purely sequence-trained neu- ral networks for asr based on lattice-free mmi,

D. Povey, V . Peddinti, D. Galvez, P. Ghahremani, V . Manohar, X. Na, Y . Wang, and S. Khudanpur, “Purely sequence-trained neu- ral networks for asr based on lattice-free mmi,” in Proc. of Inter- speech, California, USA, 2016

work page 2016

[20] [20]

The kaldi speech recognition toolkit,

D. Povey, A. Ghoshal, G. Boulianne, L. Burget et al. , “The kaldi speech recognition toolkit,” IEEE Signal Processing Soci- ety, Tech. Rep., 2011

work page 2011

[21] [21]

The 2015 Shefﬁeld system for transcrip- tion of multi-genre broadcast media,

O. Saz, M. Doulaty, S. Deena, R. Milner, R. W. M. Ng, M. Hasan, Y . Liu, and T. Hain, “The 2015 Shefﬁeld system for transcrip- tion of multi-genre broadcast media,” in Proc. of ASRU, Arizona, USA, 2015

work page 2015

[22] [22]

Combin- ing feature and model-based adaptation of rnnlms for multi-genre broadcast speech recognition,

S. Deena, M. Hasan, M. Doulaty, O. Saz, and T. Hain, “Combin- ing feature and model-based adaptation of rnnlms for multi-genre broadcast speech recognition,” inProc. of Interspeech, California, USA, 2016

work page 2016

[23] [23]

Semi-supervised adaptation of RNNLMs by ﬁne-tuning with domain-speciﬁc auxiliary features,

S. Deena, R. W. Ng, P. Madhyashta, L. Specia, and T. Hain, “Semi-supervised adaptation of RNNLMs by ﬁne-tuning with domain-speciﬁc auxiliary features,” in Proc. of Interspeech , Stockholm, Sweden, 2017

work page 2017

[24] [24]

Recurrent neural network language model adaptation for multi-genre broad- cast speech recognition and alignment,

S. Deena, M. Hasan, M. Doulaty, O. Saz, and T. Hain, “Recurrent neural network language model adaptation for multi-genre broad- cast speech recognition and alignment,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) , vol. 27, no. 3, pp. 572–582, 2019

work page 2019

[25] [25]

Lightly supervised align- ment of subtitles on multi-genre broadcasts,

O. Saz, S. Deena, M. Doulaty, M. Hasan, B. Khaliq, R. Milner, R. W. M. Ng, J. Olcoz, and T. Hain, “Lightly supervised align- ment of subtitles on multi-genre broadcasts,” Multimedia Tools and Applications, vol. 77, no. 23, pp. 30 533–30 550, 2018

work page 2018

[26] [26]

Automatic optimization of data perturbation distributions for multi-style training in speech recognition,

M. Doulaty, R. Rose, and O. Siohan, “Automatic optimization of data perturbation distributions for multi-style training in speech recognition,” in Proc. of SLT, California, USA, 2016

work page 2016

[27] [27]

The AMI meeting corpus: A pre- announcement,

J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, W. Karaiskos, Vasilis Kraaij, M. Kronenthal, G. Lathoud, M. Lincoln, A. Lisowska, I. McCowan, W. Post, D. Reidsma, and P. Wellner, “The AMI meeting corpus: A pre- announcement,” in Proc. of MLMI, Bethesda, USA, 2006

work page 2006

[28] [28]

The ICSI meeting corpus,

A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke, and C. Wooters, “The ICSI meeting corpus,” in Proc. of ICASSP, Hong Kong, 2003

work page 2003