pith. sign in

arxiv: 1907.01302 · v1 · pith:UDAYDFYRnew · submitted 2019-07-02 · 💻 cs.CL

Latent Dirichlet Allocation Based Acoustic Data Selection for Automatic Speech Recognition

Pith reviewed 2026-05-25 11:20 UTC · model grok-4.3

classification 💻 cs.CL
keywords data selectionLatent Dirichlet Allocationautomatic speech recognitionacoustic modelingdomain adaptationmeeting datatopic modeling
0
0 comments X

The pith

Acoustic topic modeling selects training data that improves speech recognition over random or full-pool approaches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to use acoustic Latent Dirichlet Allocation to choose which utterances from a large mixed pool of speech data best match a target domain for training automatic speech recognition systems. With a target of 32 hours of meeting recordings and a pool of 2000 hours spanning many styles, the method picks a subset that yields higher accuracy than picking at random, using posterior probabilities, or training on everything. A reader would care because mismatched training data hurts performance even for modern models, and careful selection can make better use of existing recordings without collecting new matched data.

Core claim

Acoustic Latent Dirichlet Allocation applied to acoustic features produces topic distributions that act as a similarity measure. Given representative target utterances from meeting data, these distributions allow selection of the most similar data from a 2k-hour pool, resulting in acoustic models that outperform those trained on randomly selected data, on data selected by posterior probabilities, or on the entire pool.

What carries the argument

Acoustic Latent Dirichlet Allocation (aLDA) topic distributions on acoustic features, which serve as the similarity criterion to rank and select training utterances from the pool.

If this is right

  • Selected data leads to better ASR performance than the full pool of 2k hours.
  • The method works for both far-field and close-talk meeting data.
  • It beats random selection and posterior-based selection from the same pool.
  • Small representative sets from the target domain suffice to drive the selection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If aLDA captures acoustic domain similarity, the same selection could apply to other target domains like telephony or lectures.
  • Combining aLDA selection with other criteria might yield even stronger results.
  • This suggests data selection can reduce the need for large matched datasets in ASR training.

Load-bearing premise

That the topic distributions from aLDA on acoustic features reflect similarities that actually affect automatic speech recognition accuracy.

What would settle it

Training ASR models on aLDA-selected data and finding no accuracy gain or a loss compared to the baselines on the meeting test set would disprove the effectiveness of this selection method.

Figures

Figures reproduced from arXiv: 1907.01302 by Mortaza (Morrie) Doulaty, Thomas Hain.

Figure 1
Figure 1. Figure 1: Graphical model representation of LDA model, the only observed variables are wt’s. α and β are dataset level parameters, θd˜i is a document level variable and zt is a latent variable indicating the domain from which wt was drawn. The following joint distribution is the result of the generative process of LDA: p(θ, z, d¯|α, β) = p(θ|α) YT t=1 p(zt|θ)p(wt|zt, β) (3) The posterior distribution of the latent v… view at source ↗
read the original abstract

Selecting in-domain data from a large pool of diverse and out-of-domain data is a non-trivial problem. In most cases simply using all of the available data will lead to sub-optimal and in some cases even worse performance compared to carefully selecting a matching set. This is true even for data-inefficient neural models. Acoustic Latent Dirichlet Allocation (aLDA) is shown to be useful in a variety of speech technology related tasks, including domain adaptation of acoustic models for automatic speech recognition and entity labeling for information retrieval. In this paper we propose to use aLDA as a data similarity criterion in a data selection framework. Given a large pool of out-of-domain and potentially mismatched data, the task is to select the best-matching training data to a set of representative utterances sampled from a target domain. Our target data consists of around 32 hours of meeting data (both far-field and close-talk) and the pool contains 2k hours of meeting, talks, voice search, dictation, command-and-control, audio books, lectures, generic media and telephony speech data. The proposed technique for training data selection, significantly outperforms random selection, posterior-based selection as well as using all of the available data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes an aLDA-based framework to select a subset of training utterances from a 2k-hour heterogeneous pool (meeting, talks, voice search, dictation, etc.) that best matches a 32-hour target meeting domain (far-field and close-talk), using topic-distribution similarity to a small representative target sample. The central empirical claim is that this selection significantly outperforms random selection, posterior-based selection, and training on the entire pool for ASR acoustic-model performance.

Significance. If the superiority is robustly demonstrated, the method supplies a practical unsupervised technique for mitigating domain mismatch in ASR training data selection, addressing the known risk that including all available data can degrade performance even for data-inefficient neural models. It builds directly on prior uses of aLDA for domain adaptation and related speech tasks.

major comments (2)
  1. [Experiments] Experiments section: the abstract asserts significant outperformance over random, posterior-based, and full-pool baselines, yet supplies no WER numbers, error bars, dataset splits, model architecture details, or statistical tests; this is load-bearing for the central empirical claim.
  2. [§3] §3 (aLDA Data Selection): no correlation analysis is reported between aLDA topic distances (cosine or KL) and per-utterance WER on a held-out set, nor an ablation replacing aLDA topics with simpler acoustic statistics such as mean MFCC or i-vectors; this directly tests whether the unsupervised topics separate data along dimensions that actually drive ASR error rates.
minor comments (1)
  1. The abstract would be strengthened by a single sentence summarizing the magnitude of the reported gains and the evaluation metric.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and constructive comments. We address each major point below and will make the necessary revisions to strengthen the experimental reporting and validation of the aLDA approach.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the abstract asserts significant outperformance over random, posterior-based, and full-pool baselines, yet supplies no WER numbers, error bars, dataset splits, model architecture details, or statistical tests; this is load-bearing for the central empirical claim.

    Authors: We agree that explicit numerical results, error bars, dataset split details, model architecture specifications, and statistical tests are required to substantiate the central claims. Although the experiments section contains comparative results, we will revise it to include full WER tables with absolute values, confidence intervals or standard deviations, precise descriptions of the 32-hour target and 2k-hour pool splits, the acoustic model architecture (e.g., DNN-HMM or TDNN details), and any significance testing performed. revision: yes

  2. Referee: [§3] §3 (aLDA Data Selection): no correlation analysis is reported between aLDA topic distances (cosine or KL) and per-utterance WER on a held-out set, nor an ablation replacing aLDA topics with simpler acoustic statistics such as mean MFCC or i-vectors; this directly tests whether the unsupervised topics separate data along dimensions that actually drive ASR error rates.

    Authors: We recognize that demonstrating a direct link between aLDA topic distances and ASR error rates, as well as comparing against simpler acoustic features, would strengthen the justification for using topic modeling. We will add both a correlation analysis (using cosine and KL distances against held-out per-utterance WER) and an ablation study replacing aLDA with mean MFCC vectors and i-vectors in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No derivation chain present; purely empirical comparison

full rationale

The paper proposes an aLDA-based data selection method and reports empirical ASR WER improvements over random, posterior, and full-data baselines on meeting data. No equations, first-principles derivations, fitted parameters renamed as predictions, or uniqueness theorems are referenced in the provided abstract or description. The central claim rests on experimental results rather than any self-referential reduction of outputs to inputs. No load-bearing steps match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No specific free parameters, axioms, or invented entities are described in the provided abstract; all technical details remain at summary level.

pith-pipeline@v0.9.0 · 5743 in / 1216 out tokens · 31059 ms · 2026-05-25T11:20:17.101704+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 1 internal anchor

  1. [1]

    Introduction Bootstrapping an speech recognition system for a new domain is a common practical problem. A typical scenario is to have some limited in-domain data from a target domain that ASR system is being built for and a pool of out-of-domain data, of- ten containing a diverse set of potentially mismatched data. Us- ing all of the available data is not...

  2. [2]

    Latent Dirichlet Allocation Based Acoustic Data Selection for Automatic Speech Recognition

    Acoustic Latent Dirichlet Allocation As shown in our previous works [3, 4, 5], aLDA domain pos- teriors have a unique distribution across different domains that can be used to characterise the acoustic scenery. In this work we make use of aLDA domain posterior features as a basis of acoustic similarity in a data selection problem. The idea is that using a...

  3. [3]

    Experimental Setup To evaluate the effectiveness of aLDA for data-selection in ASR, we are trying to solve this practical problem: given a small set of in-domain data and a large pool of out-of-domain and potentially mismatched data, what’s the best set of data that can be selected from the pool to train a model for the in-domain data. 3.1. Data The in-do...

  4. [4]

    as well as language modelling tasks [15, 16, 17, 18, 19]. Training tLDA models followed a similar procedure to aLDA Algorithm 1 Data-selection based on Dirichlet posterior Input: Training data Strn ofM utterances, Training set Dirichlet posteriors{γtrn 1 ,...,γ trn M }, Dev set posterior centroids{γdev 1 ,...,γ dev N }, Distance thresholdλ Initialize: Sne...

  5. [5]

    Conclusions Selecting matching data to a small set of in-domain data from a large pool of out-of-domain and mismatched data is a non- trivial problem. This problem arises in many practical applica- tions of speech recognition where the task is to build an ASR system for a new target domain where there is a very limited amount of data is available. Often u...

  6. [6]

    Acknowledgements The first author would like to thank Trevor Francis for support- ing parts of this work

  7. [7]

    Unsupervised sub- modular subset selection for speech data,

    K. Wei, Y . Liu, K. Kirchhoff, and J. Bilmes, “Unsupervised sub- modular subset selection for speech data,” in Proc. of ICASSP , Florence, Italy, 2014

  8. [8]

    Data-selective transfer learn- ing for multi-domain speech recognition,

    M. Doulaty, O. Saz, and T. Hain, “Data-selective transfer learn- ing for multi-domain speech recognition,” inProc. of Interspeech, Dresden, Germany, 2015

  9. [9]

    Unsupervised domain discovery using latent dirichlet allo- cation for acoustic modelling in speech recognition,

    ——, “Unsupervised domain discovery using latent dirichlet allo- cation for acoustic modelling in speech recognition,” in Proc. of Interspeech, Dresden, Germany, 2015

  10. [10]

    Latent Dirichlet Allocation Based Organisation of Broadcast Media Archives for Deep Neural Network Adaptation,

    M. Doulaty, O. Saz, R. W. M. Ng, and T. Hain, “Latent Dirichlet Allocation Based Organisation of Broadcast Media Archives for Deep Neural Network Adaptation,” in Proc. of ASRU , Arizona, USA, 2015

  11. [11]

    Automatic Genre and Show Identification of Broadcast Media,

    ——, “Automatic Genre and Show Identification of Broadcast Media,” in Proc. of Interspeech, California, USA, 2016

  12. [12]

    Acoustic topic model for audio information retrieval,

    S. Kim, S. Narayanan, and S. Sundaram, “Acoustic topic model for audio information retrieval,” in Proc. of WASPAA, New Paltz NY , USA, 2009

  13. [13]

    On-line genre classifi- cation of TV programs using audio content,

    S. Kim, P. Georgiou, and S. Narayanan, “On-line genre classifi- cation of TV programs using audio content,” in Proc. of ICASSP, Vancouver, Canada, 2013

  14. [14]

    Latent Dirichlet Allo- cation,

    D. M. Blei, A. Y . Ng, and M. I. Jordan, “Latent Dirichlet Allo- cation,” Journal of Machine Learning Research , vol. 3, pp. 993– 1022, 2003

  15. [15]

    Factor analysis for audio-based video genre classification

    M. Rouvier, D. Matrouf, and G. Linares, “Factor analysis for audio-based video genre classification.” in Proc. of Interspeech , Brighton, UK, 2009

  16. [16]

    Scalable modified kneser-ney language model estimation,

    K. Heafield, I. Pouzyrevsky, J. H. Clark, and P. Koehn, “Scalable modified kneser-ney language model estimation,” inProc. of ACL, Sofia, Bulgaria, 2013

  17. [17]

    Gnu parallel - the command-line power tool,

    O. Tange, “Gnu parallel - the command-line power tool,” ;login: The USENIX Magazine , vol. 36, no. 1, pp. 42–47, Feb 2011. [Online]. Available: http://www.gnu.org/s/parallel

  18. [18]

    CMU Sequence-to-Sequence G2P toolkit,

    “CMU Sequence-to-Sequence G2P toolkit,” https://github.com/ cmusphinx/g2p-seq2seq

  19. [19]

    Purely sequence-trained neu- ral networks for asr based on lattice-free mmi,

    D. Povey, V . Peddinti, D. Galvez, P. Ghahremani, V . Manohar, X. Na, Y . Wang, and S. Khudanpur, “Purely sequence-trained neu- ral networks for asr based on lattice-free mmi,” in Proc. of Inter- speech, California, USA, 2016

  20. [20]

    The kaldi speech recognition toolkit,

    D. Povey, A. Ghoshal, G. Boulianne, L. Burget et al. , “The kaldi speech recognition toolkit,” IEEE Signal Processing Soci- ety, Tech. Rep., 2011

  21. [21]

    The 2015 Sheffield system for transcrip- tion of multi-genre broadcast media,

    O. Saz, M. Doulaty, S. Deena, R. Milner, R. W. M. Ng, M. Hasan, Y . Liu, and T. Hain, “The 2015 Sheffield system for transcrip- tion of multi-genre broadcast media,” in Proc. of ASRU, Arizona, USA, 2015

  22. [22]

    Combin- ing feature and model-based adaptation of rnnlms for multi-genre broadcast speech recognition,

    S. Deena, M. Hasan, M. Doulaty, O. Saz, and T. Hain, “Combin- ing feature and model-based adaptation of rnnlms for multi-genre broadcast speech recognition,” inProc. of Interspeech, California, USA, 2016

  23. [23]

    Semi-supervised adaptation of RNNLMs by fine-tuning with domain-specific auxiliary features,

    S. Deena, R. W. Ng, P. Madhyashta, L. Specia, and T. Hain, “Semi-supervised adaptation of RNNLMs by fine-tuning with domain-specific auxiliary features,” in Proc. of Interspeech , Stockholm, Sweden, 2017

  24. [24]

    Recurrent neural network language model adaptation for multi-genre broad- cast speech recognition and alignment,

    S. Deena, M. Hasan, M. Doulaty, O. Saz, and T. Hain, “Recurrent neural network language model adaptation for multi-genre broad- cast speech recognition and alignment,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) , vol. 27, no. 3, pp. 572–582, 2019

  25. [25]

    Lightly supervised align- ment of subtitles on multi-genre broadcasts,

    O. Saz, S. Deena, M. Doulaty, M. Hasan, B. Khaliq, R. Milner, R. W. M. Ng, J. Olcoz, and T. Hain, “Lightly supervised align- ment of subtitles on multi-genre broadcasts,” Multimedia Tools and Applications, vol. 77, no. 23, pp. 30 533–30 550, 2018

  26. [26]

    Automatic optimization of data perturbation distributions for multi-style training in speech recognition,

    M. Doulaty, R. Rose, and O. Siohan, “Automatic optimization of data perturbation distributions for multi-style training in speech recognition,” in Proc. of SLT, California, USA, 2016

  27. [27]

    The AMI meeting corpus: A pre- announcement,

    J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, W. Karaiskos, Vasilis Kraaij, M. Kronenthal, G. Lathoud, M. Lincoln, A. Lisowska, I. McCowan, W. Post, D. Reidsma, and P. Wellner, “The AMI meeting corpus: A pre- announcement,” in Proc. of MLMI, Bethesda, USA, 2006

  28. [28]

    The ICSI meeting corpus,

    A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke, and C. Wooters, “The ICSI meeting corpus,” in Proc. of ICASSP, Hong Kong, 2003