Latent Dirichlet Allocation Based Acoustic Data Selection for Automatic Speech Recognition
Pith reviewed 2026-05-25 11:20 UTC · model grok-4.3
The pith
Acoustic topic modeling selects training data that improves speech recognition over random or full-pool approaches.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Acoustic Latent Dirichlet Allocation applied to acoustic features produces topic distributions that act as a similarity measure. Given representative target utterances from meeting data, these distributions allow selection of the most similar data from a 2k-hour pool, resulting in acoustic models that outperform those trained on randomly selected data, on data selected by posterior probabilities, or on the entire pool.
What carries the argument
Acoustic Latent Dirichlet Allocation (aLDA) topic distributions on acoustic features, which serve as the similarity criterion to rank and select training utterances from the pool.
If this is right
- Selected data leads to better ASR performance than the full pool of 2k hours.
- The method works for both far-field and close-talk meeting data.
- It beats random selection and posterior-based selection from the same pool.
- Small representative sets from the target domain suffice to drive the selection.
Where Pith is reading between the lines
- If aLDA captures acoustic domain similarity, the same selection could apply to other target domains like telephony or lectures.
- Combining aLDA selection with other criteria might yield even stronger results.
- This suggests data selection can reduce the need for large matched datasets in ASR training.
Load-bearing premise
That the topic distributions from aLDA on acoustic features reflect similarities that actually affect automatic speech recognition accuracy.
What would settle it
Training ASR models on aLDA-selected data and finding no accuracy gain or a loss compared to the baselines on the meeting test set would disprove the effectiveness of this selection method.
Figures
read the original abstract
Selecting in-domain data from a large pool of diverse and out-of-domain data is a non-trivial problem. In most cases simply using all of the available data will lead to sub-optimal and in some cases even worse performance compared to carefully selecting a matching set. This is true even for data-inefficient neural models. Acoustic Latent Dirichlet Allocation (aLDA) is shown to be useful in a variety of speech technology related tasks, including domain adaptation of acoustic models for automatic speech recognition and entity labeling for information retrieval. In this paper we propose to use aLDA as a data similarity criterion in a data selection framework. Given a large pool of out-of-domain and potentially mismatched data, the task is to select the best-matching training data to a set of representative utterances sampled from a target domain. Our target data consists of around 32 hours of meeting data (both far-field and close-talk) and the pool contains 2k hours of meeting, talks, voice search, dictation, command-and-control, audio books, lectures, generic media and telephony speech data. The proposed technique for training data selection, significantly outperforms random selection, posterior-based selection as well as using all of the available data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes an aLDA-based framework to select a subset of training utterances from a 2k-hour heterogeneous pool (meeting, talks, voice search, dictation, etc.) that best matches a 32-hour target meeting domain (far-field and close-talk), using topic-distribution similarity to a small representative target sample. The central empirical claim is that this selection significantly outperforms random selection, posterior-based selection, and training on the entire pool for ASR acoustic-model performance.
Significance. If the superiority is robustly demonstrated, the method supplies a practical unsupervised technique for mitigating domain mismatch in ASR training data selection, addressing the known risk that including all available data can degrade performance even for data-inefficient neural models. It builds directly on prior uses of aLDA for domain adaptation and related speech tasks.
major comments (2)
- [Experiments] Experiments section: the abstract asserts significant outperformance over random, posterior-based, and full-pool baselines, yet supplies no WER numbers, error bars, dataset splits, model architecture details, or statistical tests; this is load-bearing for the central empirical claim.
- [§3] §3 (aLDA Data Selection): no correlation analysis is reported between aLDA topic distances (cosine or KL) and per-utterance WER on a held-out set, nor an ablation replacing aLDA topics with simpler acoustic statistics such as mean MFCC or i-vectors; this directly tests whether the unsupervised topics separate data along dimensions that actually drive ASR error rates.
minor comments (1)
- The abstract would be strengthened by a single sentence summarizing the magnitude of the reported gains and the evaluation metric.
Simulated Author's Rebuttal
We thank the referee for the thoughtful review and constructive comments. We address each major point below and will make the necessary revisions to strengthen the experimental reporting and validation of the aLDA approach.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the abstract asserts significant outperformance over random, posterior-based, and full-pool baselines, yet supplies no WER numbers, error bars, dataset splits, model architecture details, or statistical tests; this is load-bearing for the central empirical claim.
Authors: We agree that explicit numerical results, error bars, dataset split details, model architecture specifications, and statistical tests are required to substantiate the central claims. Although the experiments section contains comparative results, we will revise it to include full WER tables with absolute values, confidence intervals or standard deviations, precise descriptions of the 32-hour target and 2k-hour pool splits, the acoustic model architecture (e.g., DNN-HMM or TDNN details), and any significance testing performed. revision: yes
-
Referee: [§3] §3 (aLDA Data Selection): no correlation analysis is reported between aLDA topic distances (cosine or KL) and per-utterance WER on a held-out set, nor an ablation replacing aLDA topics with simpler acoustic statistics such as mean MFCC or i-vectors; this directly tests whether the unsupervised topics separate data along dimensions that actually drive ASR error rates.
Authors: We recognize that demonstrating a direct link between aLDA topic distances and ASR error rates, as well as comparing against simpler acoustic features, would strengthen the justification for using topic modeling. We will add both a correlation analysis (using cosine and KL distances against held-out per-utterance WER) and an ablation study replacing aLDA with mean MFCC vectors and i-vectors in the revised manuscript. revision: yes
Circularity Check
No derivation chain present; purely empirical comparison
full rationale
The paper proposes an aLDA-based data selection method and reports empirical ASR WER improvements over random, posterior, and full-data baselines on meeting data. No equations, first-principles derivations, fitted parameters renamed as predictions, or uniqueness theorems are referenced in the provided abstract or description. The central claim rests on experimental results rather than any self-referential reduction of outputs to inputs. No load-bearing steps match the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Introduction Bootstrapping an speech recognition system for a new domain is a common practical problem. A typical scenario is to have some limited in-domain data from a target domain that ASR system is being built for and a pool of out-of-domain data, of- ten containing a diverse set of potentially mismatched data. Us- ing all of the available data is not...
-
[2]
Latent Dirichlet Allocation Based Acoustic Data Selection for Automatic Speech Recognition
Acoustic Latent Dirichlet Allocation As shown in our previous works [3, 4, 5], aLDA domain pos- teriors have a unique distribution across different domains that can be used to characterise the acoustic scenery. In this work we make use of aLDA domain posterior features as a basis of acoustic similarity in a data selection problem. The idea is that using a...
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[3]
Experimental Setup To evaluate the effectiveness of aLDA for data-selection in ASR, we are trying to solve this practical problem: given a small set of in-domain data and a large pool of out-of-domain and potentially mismatched data, what’s the best set of data that can be selected from the pool to train a model for the in-domain data. 3.1. Data The in-do...
work page 2000
-
[4]
as well as language modelling tasks [15, 16, 17, 18, 19]. Training tLDA models followed a similar procedure to aLDA Algorithm 1 Data-selection based on Dirichlet posterior Input: Training data Strn ofM utterances, Training set Dirichlet posteriors{γtrn 1 ,...,γ trn M }, Dev set posterior centroids{γdev 1 ,...,γ dev N }, Distance thresholdλ Initialize: Sne...
work page 2000
-
[5]
Conclusions Selecting matching data to a small set of in-domain data from a large pool of out-of-domain and mismatched data is a non- trivial problem. This problem arises in many practical applica- tions of speech recognition where the task is to build an ASR system for a new target domain where there is a very limited amount of data is available. Often u...
work page 2000
-
[6]
Acknowledgements The first author would like to thank Trevor Francis for support- ing parts of this work
-
[7]
Unsupervised sub- modular subset selection for speech data,
K. Wei, Y . Liu, K. Kirchhoff, and J. Bilmes, “Unsupervised sub- modular subset selection for speech data,” in Proc. of ICASSP , Florence, Italy, 2014
work page 2014
-
[8]
Data-selective transfer learn- ing for multi-domain speech recognition,
M. Doulaty, O. Saz, and T. Hain, “Data-selective transfer learn- ing for multi-domain speech recognition,” inProc. of Interspeech, Dresden, Germany, 2015
work page 2015
-
[9]
——, “Unsupervised domain discovery using latent dirichlet allo- cation for acoustic modelling in speech recognition,” in Proc. of Interspeech, Dresden, Germany, 2015
work page 2015
-
[10]
M. Doulaty, O. Saz, R. W. M. Ng, and T. Hain, “Latent Dirichlet Allocation Based Organisation of Broadcast Media Archives for Deep Neural Network Adaptation,” in Proc. of ASRU , Arizona, USA, 2015
work page 2015
-
[11]
Automatic Genre and Show Identification of Broadcast Media,
——, “Automatic Genre and Show Identification of Broadcast Media,” in Proc. of Interspeech, California, USA, 2016
work page 2016
-
[12]
Acoustic topic model for audio information retrieval,
S. Kim, S. Narayanan, and S. Sundaram, “Acoustic topic model for audio information retrieval,” in Proc. of WASPAA, New Paltz NY , USA, 2009
work page 2009
-
[13]
On-line genre classifi- cation of TV programs using audio content,
S. Kim, P. Georgiou, and S. Narayanan, “On-line genre classifi- cation of TV programs using audio content,” in Proc. of ICASSP, Vancouver, Canada, 2013
work page 2013
-
[14]
Latent Dirichlet Allo- cation,
D. M. Blei, A. Y . Ng, and M. I. Jordan, “Latent Dirichlet Allo- cation,” Journal of Machine Learning Research , vol. 3, pp. 993– 1022, 2003
work page 2003
-
[15]
Factor analysis for audio-based video genre classification
M. Rouvier, D. Matrouf, and G. Linares, “Factor analysis for audio-based video genre classification.” in Proc. of Interspeech , Brighton, UK, 2009
work page 2009
-
[16]
Scalable modified kneser-ney language model estimation,
K. Heafield, I. Pouzyrevsky, J. H. Clark, and P. Koehn, “Scalable modified kneser-ney language model estimation,” inProc. of ACL, Sofia, Bulgaria, 2013
work page 2013
-
[17]
Gnu parallel - the command-line power tool,
O. Tange, “Gnu parallel - the command-line power tool,” ;login: The USENIX Magazine , vol. 36, no. 1, pp. 42–47, Feb 2011. [Online]. Available: http://www.gnu.org/s/parallel
work page 2011
-
[18]
CMU Sequence-to-Sequence G2P toolkit,
“CMU Sequence-to-Sequence G2P toolkit,” https://github.com/ cmusphinx/g2p-seq2seq
-
[19]
Purely sequence-trained neu- ral networks for asr based on lattice-free mmi,
D. Povey, V . Peddinti, D. Galvez, P. Ghahremani, V . Manohar, X. Na, Y . Wang, and S. Khudanpur, “Purely sequence-trained neu- ral networks for asr based on lattice-free mmi,” in Proc. of Inter- speech, California, USA, 2016
work page 2016
-
[20]
The kaldi speech recognition toolkit,
D. Povey, A. Ghoshal, G. Boulianne, L. Burget et al. , “The kaldi speech recognition toolkit,” IEEE Signal Processing Soci- ety, Tech. Rep., 2011
work page 2011
-
[21]
The 2015 Sheffield system for transcrip- tion of multi-genre broadcast media,
O. Saz, M. Doulaty, S. Deena, R. Milner, R. W. M. Ng, M. Hasan, Y . Liu, and T. Hain, “The 2015 Sheffield system for transcrip- tion of multi-genre broadcast media,” in Proc. of ASRU, Arizona, USA, 2015
work page 2015
-
[22]
S. Deena, M. Hasan, M. Doulaty, O. Saz, and T. Hain, “Combin- ing feature and model-based adaptation of rnnlms for multi-genre broadcast speech recognition,” inProc. of Interspeech, California, USA, 2016
work page 2016
-
[23]
Semi-supervised adaptation of RNNLMs by fine-tuning with domain-specific auxiliary features,
S. Deena, R. W. Ng, P. Madhyashta, L. Specia, and T. Hain, “Semi-supervised adaptation of RNNLMs by fine-tuning with domain-specific auxiliary features,” in Proc. of Interspeech , Stockholm, Sweden, 2017
work page 2017
-
[24]
S. Deena, M. Hasan, M. Doulaty, O. Saz, and T. Hain, “Recurrent neural network language model adaptation for multi-genre broad- cast speech recognition and alignment,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) , vol. 27, no. 3, pp. 572–582, 2019
work page 2019
-
[25]
Lightly supervised align- ment of subtitles on multi-genre broadcasts,
O. Saz, S. Deena, M. Doulaty, M. Hasan, B. Khaliq, R. Milner, R. W. M. Ng, J. Olcoz, and T. Hain, “Lightly supervised align- ment of subtitles on multi-genre broadcasts,” Multimedia Tools and Applications, vol. 77, no. 23, pp. 30 533–30 550, 2018
work page 2018
-
[26]
M. Doulaty, R. Rose, and O. Siohan, “Automatic optimization of data perturbation distributions for multi-style training in speech recognition,” in Proc. of SLT, California, USA, 2016
work page 2016
-
[27]
The AMI meeting corpus: A pre- announcement,
J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, W. Karaiskos, Vasilis Kraaij, M. Kronenthal, G. Lathoud, M. Lincoln, A. Lisowska, I. McCowan, W. Post, D. Reidsma, and P. Wellner, “The AMI meeting corpus: A pre- announcement,” in Proc. of MLMI, Bethesda, USA, 2006
work page 2006
-
[28]
A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke, and C. Wooters, “The ICSI meeting corpus,” in Proc. of ICASSP, Hong Kong, 2003
work page 2003
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.