Teach an all-rounder with experts in different domains

Dan Su; Dong Yu; Zhao You

arxiv: 1907.05698 · v1 · pith:VXHRZ4BMnew · submitted 2019-07-09 · 📡 eess.AS · cs.CL· cs.SD

Teach an all-rounder with experts in different domains

Zhao You , Dan Su , Dong Yu This is my paper

Pith reviewed 2026-05-25 00:01 UTC · model grok-4.3

classification 📡 eess.AS cs.CLcs.SD

keywords automatic speech recognitionteacher-student trainingmulti-domain modelingdomain adaptationacoustic modelDFSMNLSTMcharacter error rate

0 comments

The pith

A single student model taught simultaneously by multiple domain-specific teachers outperforms both multi-condition and domain-dependent models in automatic speech recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes training one all-rounder acoustic model by first creating domain-specific teacher models through fine-tuning a shared multi-condition model on each domain's data subset. These teachers then jointly instruct a single student model to handle all domains at once. Experiments cover two setups, one with varying speaking styles and one with near-field, far-field, and noisy conditions, using both DFSMN and LSTM architectures. The student model beats the baseline multi-condition model by up to 10.4 percent relative character error rate reduction and also surpasses the individual domain-dependent models.

Core claim

By applying the teacher-student framework, domain-dependent teacher models fine-tuned on specific subsets teach one student model simultaneously, producing a multi-domain acoustic model that outperforms both the original multi-condition model and the separate domain-dependent models on the tested domain setups.

What carries the argument

The teacher-student training process in which multiple domain-specific teachers simultaneously guide training of one shared student model.

If this is right

The student model handles multiple domains with one set of parameters instead of maintaining separate models.
The method works across different model types including DFSMN and LSTM.
Gains appear in both speaking-style domains and acoustic-condition domains.
The student exceeds the performance of any single teacher model on its own domain.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This simultaneous teaching could allow easier scaling to additional domains without retraining everything from scratch.
The approach might reduce deployment complexity by replacing multiple domain-specific models with one.
Knowledge transfer across domains via teachers could be tested in other sequence modeling tasks beyond speech.

Load-bearing premise

Fine-tuning separate teachers on domain subsets and then using them to train one student will improve performance across domains without the teachers passing on biases or limitations from their own training.

What would settle it

A direct comparison showing that a model trained on the combined data from all domains achieves equal or lower error rates than the student model taught by the domain teachers would falsify the central claim.

read the original abstract

In many automatic speech recognition (ASR) tasks, an ideal model has to be applicable over multiple domains. In this paper, we propose to teach an all-rounder with experts in different domains. Concretely, we build a multi-domain acoustic model by applying the teacher-student training framework. First, for each domain, a teacher model (domain-dependent model) is trained by fine-tuning a multi-condition model with domain-specific subset. Then all these teacher models are used to teach one single student model simultaneously. We perform experiments on two predefined domain setups. One is domains with different speaking styles, the other is nearfield, far-field and far-field with noise. Moreover, two types of models are examined: deep feedforward sequential memory network (DFSMN) and long short term memory (LSTM). Experimental results show that the model trained with this framework outperforms not only multi-condition model but also domain-dependent model. Specially, our training method provides up to 10.4% relative character error rate improvement over baseline model (multi-condition model).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Multi-teacher distillation gives practical gains for multi-domain ASR but the outperformance over specialists needs tighter controls to be convincing.

read the letter

The one thing to know is that this paper describes a multi-teacher distillation method for multi-domain ASR and claims the resulting student beats both the multi-condition model and the individual domain experts by up to 10.4% relative CER. What is new is the simultaneous use of several domain-tuned teachers to train one all-rounder student. The paper applies the standard teacher-student framework in a multi-expert way, first fine-tuning each teacher on its domain data from a shared multi-condition starting point, then having all teachers supervise the student at the same time. They test this on speaking-style domains and on acoustic condition domains (nearfield, farfield, noisy farfield), using DFSMN and LSTM models. The experiments are concrete and the gains are reported clearly. The paper does well at showing a practical training recipe that improves over the obvious baselines. The fact that they check two different domain definitions and two architectures adds some robustness to the findings. The soft spot is exactly the one raised in the stress test. Because each teacher is fine-tuned on a subset, it is possible that the teachers lose some general capability or pick up subset-specific biases. If that happens, the student's advantage over the domain models could be an artifact rather than a real improvement in generalization. The abstract gives no breakdown of error rates per domain to confirm the student wins on every domain against its specialist teacher, and there are no ablations that test the fine-tuning step or train domain models from scratch for comparison. Without those, the claim that the student outperforms the domain-dependent models rests on thinner evidence than the claim against the multi-condition baseline. This paper is for people who build ASR systems that must handle several domains at once and want a single model rather than switching between specialists. A practitioner looking for a distillation-based way to combine domain data would find it worth reading and trying. It is not a foundational advance, but the empirical results are solid enough on their own terms that it deserves a serious referee who can check the full experimental details and ask for the missing controls. I would recommend sending it to peer review.

Referee Report

3 major / 2 minor

Summary. The paper proposes a teacher-student framework for multi-domain ASR: domain-specific teachers are obtained by fine-tuning a shared multi-condition model on domain subsets, after which the teachers simultaneously distill to one student model intended as an all-rounder. Experiments on two domain partitions (speaking-style domains; near/far-field plus noise) with DFSMN and LSTM architectures report that the student exceeds both the multi-condition baseline and the individual domain-dependent teachers, with a maximum 10.4 % relative CER reduction.

Significance. If the per-domain superiority claim holds after proper controls, the method supplies a practical recipe for a single deployable model that avoids the overhead of maintaining separate specialists while still improving on each specialist's native domain. The contribution is empirical and incremental rather than theoretical; its value rests entirely on the strength of the experimental evidence.

major comments (3)

[Abstract / Experimental results] Abstract and results section: the central claim that the student outperforms each domain-dependent teacher on that teacher's native domain is unsupported by any per-domain table or figure; only aggregate relative gains are stated, so it is impossible to verify whether the student actually beats the specialists rather than merely averaging their biases.
[Method] Method description: teachers are created by fine-tuning the multi-condition model on domain subsets, yet no ablation that (a) removes the fine-tuning step or (b) compares against domain models trained from scratch is reported. Without these controls the 10.4 % gain cannot be attributed to the proposed framework rather than to teacher overfitting or data-partition effects.
[Experiments] Experimental setup: the abstract and implied results give no dataset sizes, no statistical significance tests, and no description of how domain subsets were constructed or whether they overlap, all of which are load-bearing for assessing whether the reported improvements are reliable or confounded.

minor comments (2)

[Abstract] Abstract: 'Specially' should read 'Specifically'.
[Results] The paper should supply at least one table listing absolute CER for multi-condition, each teacher, and the student on every domain so readers can directly inspect the per-domain comparisons.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful review and constructive suggestions. We address each major comment in detail below and will make revisions to the manuscript to improve clarity and provide additional supporting evidence.

read point-by-point responses

Referee: [Abstract / Experimental results] Abstract and results section: the central claim that the student outperforms each domain-dependent teacher on that teacher's native domain is unsupported by any per-domain table or figure; only aggregate relative gains are stated, so it is impossible to verify whether the student actually beats the specialists rather than merely averaging their biases.

Authors: The results section of the manuscript includes tables with per-domain CER for both speaking-style domains and near/far-field setups. These tables demonstrate that the student model achieves better performance than each individual domain-dependent teacher on the corresponding domain. To address the concern, we will revise the paper to include a dedicated table that explicitly lists the CER for the multi-condition baseline, each teacher, and the student on each domain, highlighting the per-domain improvements. revision: yes
Referee: [Method] Method description: teachers are created by fine-tuning the multi-condition model on domain subsets, yet no ablation that (a) removes the fine-tuning step or (b) compares against domain models trained from scratch is reported. Without these controls the 10.4 % gain cannot be attributed to the proposed framework rather than to teacher overfitting or data-partition effects.

Authors: We agree that additional ablations would help attribute the gains more clearly to the multi-teacher distillation framework. In the revised manuscript, we will include an ablation study where domain-specific models are trained from scratch on the domain subsets (without fine-tuning from the multi-condition model) and then used as teachers for distillation. For removing the fine-tuning step, note that the domain-dependent teachers are defined as the fine-tuned models; however, we can compare against using the unfine-tuned multi-condition model as a single teacher. We will report these results to strengthen the evidence. revision: yes
Referee: [Experiments] Experimental setup: the abstract and implied results give no dataset sizes, no statistical significance tests, and no description of how domain subsets were constructed or whether they overlap, all of which are load-bearing for assessing whether the reported improvements are reliable or confounded.

Authors: We will expand the experimental setup section to include the sizes of the training and test sets for each domain, a detailed description of the domain partition construction (including any overlap between subsets), and the results of statistical significance testing (e.g., using the bootstrap method or McNemar's test on the error rates). These details were present in our internal experimental records but omitted from the initial submission for space reasons. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical training procedure with no derivations or fitted predictions

full rationale

The paper describes an empirical teacher-student training method for multi-domain ASR without any mathematical derivations, equations, or first-principles claims. Teachers are fine-tuned on domain subsets and used to distill a student; performance is measured via character error rate on held-out data. No step reduces a claimed result to its inputs by construction, renames a fit as a prediction, or relies on self-citation for uniqueness. The central claim rests on experimental comparisons, which are externally falsifiable and independent of any internal reduction. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on specific parameters, axioms, or new entities. Standard assumptions in deep learning for ASR apply.

pith-pipeline@v0.9.0 · 5712 in / 973 out tokens · 22299 ms · 2026-05-25T00:01:53.063909+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 2 internal anchors

[1]

INTRODUCTION Thanks to deep learning approaches [1, 2], great progress has been made in automatic speech recognition performance. Although deep neural networks have superior robustness over GMM systems on different conditions such as speaker, recording channel and acoustic environment [3], domain ro- bustness is still a challenging problem. First, it is i...

work page internal anchor Pith review Pith/arXiv arXiv 1907
[2]

MULTI-DOMAIN TEACHER-STUDENT TRAINING 2.1. Teacher-student training In the teacher-student training framework, [12, 13, 14] have shown that it is possible to train an student model to match the output distribution of a teacher model. Specially, the student model can be learned via single teacher network or multiple teacher networks. Details of the two lea...

work page
[3]

Dn denotes the n-th domain

We pool data from multiple application domains. Dn denotes the n-th domain. Then, we train a multi-condition model with minibatches samples which are chosen randomly from the pooled set

work page
[4]

Tn denotes the n-th teacher model which is trained with then-th domain data

Domain-dependent teacher models are produced by ﬁne-tuning the multi-condition model with domain- dependent data respectively. Tn denotes the n-th teacher model which is trained with then-th domain data

work page
[5]

During the training process, sam- ples in one minibatch are chosen randomly from the mixed data set, and may come from different domains

The proposed model is learned from these N domain- dependent teacher models. During the training process, sam- ples in one minibatch are chosen randomly from the mixed data set, and may come from different domains. The train pro- cess exploits each sample for training by using the soft targets produced from its corresponding domain-dependent teacher model...

work page
[6]

Training setup The feature vectors used in all the experiments are 40- dimensional log-mel ﬁlterbank energy features appended with the ﬁrst and second order derivatives

EXPERIMENTAL SETUP 3.1. Training setup The feature vectors used in all the experiments are 40- dimensional log-mel ﬁlterbank energy features appended with the ﬁrst and second order derivatives. Log-mel ﬁlter- bank energy features are computed with a 25ms window and shifted every 10ms. We stack 8 consecutive frames and sub- sample the input frames with 3. ...

work page 2000
[7]

EXPERIMENTAL RESULTS In this work, we evaluate the performance of the proposed method on several large vocabulary Mandarin speech recog- nition tasks including near-ﬁeld speech and far-ﬁeld speech as described in section 3.2. 4.1. Mixed speaking style corpus For the ﬁrst set of experiments, we validate the effectiveness of the proposed method by dealing w...

work page
[8]

We explore this method for acoustic mod- eling on two different tasks

CONCLUSIONS AND FUTURE WORKS In this paper, we propose a multi-domain teacher-student training method for teaching an all-rounder with experts in different domains. We explore this method for acoustic mod- eling on two different tasks. We ﬁnd that the model trained by this method not only outperforms multi-condition model but also outperforms the domain-d...

work page
[9]

Thus, we will explore this training strategy to improve the performance of LSTM mod- els in the future work

has shown that combining intermediate representa- tions of multiple teacher networks can signiﬁcantly improve the student network’s performance. Thus, we will explore this training strategy to improve the performance of LSTM mod- els in the future work

work page
[10]

Context- dependent pre-trained deep neural networks for large- vocabulary speech recognition,

G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context- dependent pre-trained deep neural networks for large- vocabulary speech recognition,” in IEEE Transactions on audio, speech, and language processing. IEEE, 2012, vol. 20, p. 3042

work page 2012
[11]

Recent progresses in deep learning based acoustic models,

D. Yu and J. Li, “Recent progresses in deep learning based acoustic models,” in IEEE/CAA Journal of Auto- matica Sinica. IEEE, 2017, vol. 4, p. 396409

work page 2017
[12]

A compara- tive analytic study on the gaussian mixture and context dependent deep neural network hidden markov models,

Y . Huang, D. Yu, C. Liu, and Y . Gong, “A compara- tive analytic study on the gaussian mixture and context dependent deep neural network hidden markov models,” in INTERSPEECH. ISCA, 2014

work page 2014
[13]

Speaker stress-resistant continuous speech recognition,

B. B. Paul and E. A. Martin, “Speaker stress-resistant continuous speech recognition,” in International Con- ference on Acoustics, Speech and Signal Processing . IEEE, 1988

work page 1988
[14]

Tandem con- nectionist feature extraction for conventional hmm sys- tems,

H. Hermansky, D. P. Ellis, and S. Sharma, “Tandem con- nectionist feature extraction for conventional hmm sys- tems,” in International Conference on Acoustics, Speech and Signal Processing. IEEE, 2000

work page 2000
[15]

An investigation of deep neural networks for noise robust speech recogni- tion,

M. L. Seltzer, D. Yu, and Y . Wang, “An investigation of deep neural networks for noise robust speech recogni- tion,” in International Conference on Acoustics, Speech and Signal Processing. IEEE, 2013

work page 2013
[16]

Making machines understand us in reverberant rooms,

T. Yoshioka, A. Sehr, M. Delcroix, K. Kinoshita, R. Maas, T. Nakatani, and W. Kellermann, “Making machines understand us in reverberant rooms,” in IEEE Signal Processing Letter. IEEE, 2012

work page 2012
[17]

Speech enhance- ment with lstm recurrent neural networks and its ap- plication to noise-robust asr,

F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. L. Roux, J. R. Hershey, and B. Schuller, “Speech enhance- ment with lstm recurrent neural networks and its ap- plication to noise-robust asr,” in International Confer- ence on Latent Variable Analysis and Single Separation. IEEE, 2015

work page 2015
[18]

Domain adaptation using factorized hidden layer for robust automatic speech recognition,

K. C. Sim, A. Narayanan, A. Misra, A. Tripathi, G. Pun- dak, T. Sainath, P. Haghani, B. Li, and M. Bacchi- ani, “Domain adaptation using factorized hidden layer for robust automatic speech recognition,” in INTER- SPEECH, 2018

work page 2018
[19]

A study of enhancement, augmentation, and autoencoder methods for domain adaptation in distant speech recognition,

H. Tang, W. N. Hsu, F. Grondin, and J Glass, “A study of enhancement, augmentation, and autoencoder methods for domain adaptation in distant speech recognition,” in INTERSPEECH, 2018

work page 2018
[20]

Toward domain-invariant speech recognition via large scale training,

A. Narayanan, A. Misra, K. C. Sim, G. Pundak, A. Tripathi, M. Elfeky, P. Haghani, T. Strohman, and M. Bacchiani, “Toward domain-invariant speech recognition via large scale training,” in https://arxiv.org/abs/1808.05312, 2018

work page arXiv 2018
[21]

Do deep nets really need to be deep?,

Jimmy Ba and Rich Caruana, “Do deep nets really need to be deep?,” in Advances in neural information pro- cessing systems, 2014, pp. 2654–2662

work page 2014
[22]

Learning small-size dnn with output-distribution-based criteria,

Jinyu Li, Rui Zhao, Jui-Ting Huang, and Yifan Gong, “Learning small-size dnn with output-distribution-based criteria,” in Fifteenth annual conference of the interna- tional speech communication association, 2014

work page 2014
[23]

Distilling the knowledge in a neural network,

G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” inNIPS Deep Learning and Representation Learning Workshop, 2015

work page 2015
[24]

Distilling knowl- edge from ensembles of neural networks for speech recognition.,

Yevgen Chebotar and Austin Waters, “Distilling knowl- edge from ensembles of neural networks for speech recognition.,” in Interspeech, 2016, pp. 3439–3443

work page 2016
[25]

Learning from multiple teacher networks,

Shan You, Chang Xu, Chao Xu, and Dacheng Tao, “Learning from multiple teacher networks,” inProceed- ings of the 23rd ACM SIGKDD International Confer- ence on Knowledge Discovery and Data Mining. ACM, 2017, pp. 1285–1294

work page 2017
[26]

Syllable-based acoustic modeling with ctc-smbr-lstm,

Zhongdi Qu, Parisa Haghani, Eugene Weinstein, and Pe- dro Moreno, “Syllable-based acoustic modeling with ctc-smbr-lstm,” in Automatic Speech Recognition and Understanding Workshop (ASRU), 2017 IEEE . IEEE, 2017, pp. 173–177

work page 2017
[27]

Image method for efﬁ- ciently simulating small-room acoustics,

J. B. Allen and D. A. Berkley, “Image method for efﬁ- ciently simulating small-room acoustics,” in The Jour- nal of the Acoustical Society of America. IEEE, 1979, p. 943950

work page 1979
[28]

Learning feature mapping using deep neu- ral network bottleneck features for distant large vocab- ulary speech recognition,

I. Himawan, P. Motlicek, D. Imseng, B. Potard, N. Kim, and J. Lee, “Learning feature mapping using deep neu- ral network bottleneck features for distant large vocab- ulary speech recognition,” in International Conference on Acoustics, Speech and Signal Processing, 2015

work page 2015
[29]

Deep-FSMN for Large Vocabulary Continuous Speech Recognition

Shiliang Zhang, Ming Lei, Zhijie Yan, and Lirong Dai, “Deep-fsmn for large vocabulary continuous speech recognition,” arXiv preprint arXiv:1803.05030, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[30]

The kaldi speech recognition toolkit,

Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Han- nemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al., “The kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society, 2011, number EPFL-CONF-192584

work page 2011
[31]

Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update ﬁlter- ing,

K. Chen and Q. Huo, “Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update ﬁlter- ing,” in ICASSP. IEEE, 2016, p. 58805884

work page 2016

[1] [1]

INTRODUCTION Thanks to deep learning approaches [1, 2], great progress has been made in automatic speech recognition performance. Although deep neural networks have superior robustness over GMM systems on different conditions such as speaker, recording channel and acoustic environment [3], domain ro- bustness is still a challenging problem. First, it is i...

work page internal anchor Pith review Pith/arXiv arXiv 1907

[2] [2]

MULTI-DOMAIN TEACHER-STUDENT TRAINING 2.1. Teacher-student training In the teacher-student training framework, [12, 13, 14] have shown that it is possible to train an student model to match the output distribution of a teacher model. Specially, the student model can be learned via single teacher network or multiple teacher networks. Details of the two lea...

work page

[3] [3]

Dn denotes the n-th domain

We pool data from multiple application domains. Dn denotes the n-th domain. Then, we train a multi-condition model with minibatches samples which are chosen randomly from the pooled set

work page

[4] [4]

Tn denotes the n-th teacher model which is trained with then-th domain data

Domain-dependent teacher models are produced by ﬁne-tuning the multi-condition model with domain- dependent data respectively. Tn denotes the n-th teacher model which is trained with then-th domain data

work page

[5] [5]

During the training process, sam- ples in one minibatch are chosen randomly from the mixed data set, and may come from different domains

The proposed model is learned from these N domain- dependent teacher models. During the training process, sam- ples in one minibatch are chosen randomly from the mixed data set, and may come from different domains. The train pro- cess exploits each sample for training by using the soft targets produced from its corresponding domain-dependent teacher model...

work page

[6] [6]

Training setup The feature vectors used in all the experiments are 40- dimensional log-mel ﬁlterbank energy features appended with the ﬁrst and second order derivatives

EXPERIMENTAL SETUP 3.1. Training setup The feature vectors used in all the experiments are 40- dimensional log-mel ﬁlterbank energy features appended with the ﬁrst and second order derivatives. Log-mel ﬁlter- bank energy features are computed with a 25ms window and shifted every 10ms. We stack 8 consecutive frames and sub- sample the input frames with 3. ...

work page 2000

[7] [7]

EXPERIMENTAL RESULTS In this work, we evaluate the performance of the proposed method on several large vocabulary Mandarin speech recog- nition tasks including near-ﬁeld speech and far-ﬁeld speech as described in section 3.2. 4.1. Mixed speaking style corpus For the ﬁrst set of experiments, we validate the effectiveness of the proposed method by dealing w...

work page

[8] [8]

We explore this method for acoustic mod- eling on two different tasks

CONCLUSIONS AND FUTURE WORKS In this paper, we propose a multi-domain teacher-student training method for teaching an all-rounder with experts in different domains. We explore this method for acoustic mod- eling on two different tasks. We ﬁnd that the model trained by this method not only outperforms multi-condition model but also outperforms the domain-d...

work page

[9] [9]

Thus, we will explore this training strategy to improve the performance of LSTM mod- els in the future work

has shown that combining intermediate representa- tions of multiple teacher networks can signiﬁcantly improve the student network’s performance. Thus, we will explore this training strategy to improve the performance of LSTM mod- els in the future work

work page

[10] [10]

Context- dependent pre-trained deep neural networks for large- vocabulary speech recognition,

G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context- dependent pre-trained deep neural networks for large- vocabulary speech recognition,” in IEEE Transactions on audio, speech, and language processing. IEEE, 2012, vol. 20, p. 3042

work page 2012

[11] [11]

Recent progresses in deep learning based acoustic models,

D. Yu and J. Li, “Recent progresses in deep learning based acoustic models,” in IEEE/CAA Journal of Auto- matica Sinica. IEEE, 2017, vol. 4, p. 396409

work page 2017

[12] [12]

A compara- tive analytic study on the gaussian mixture and context dependent deep neural network hidden markov models,

Y . Huang, D. Yu, C. Liu, and Y . Gong, “A compara- tive analytic study on the gaussian mixture and context dependent deep neural network hidden markov models,” in INTERSPEECH. ISCA, 2014

work page 2014

[13] [13]

Speaker stress-resistant continuous speech recognition,

B. B. Paul and E. A. Martin, “Speaker stress-resistant continuous speech recognition,” in International Con- ference on Acoustics, Speech and Signal Processing . IEEE, 1988

work page 1988

[14] [14]

Tandem con- nectionist feature extraction for conventional hmm sys- tems,

H. Hermansky, D. P. Ellis, and S. Sharma, “Tandem con- nectionist feature extraction for conventional hmm sys- tems,” in International Conference on Acoustics, Speech and Signal Processing. IEEE, 2000

work page 2000

[15] [15]

An investigation of deep neural networks for noise robust speech recogni- tion,

M. L. Seltzer, D. Yu, and Y . Wang, “An investigation of deep neural networks for noise robust speech recogni- tion,” in International Conference on Acoustics, Speech and Signal Processing. IEEE, 2013

work page 2013

[16] [16]

Making machines understand us in reverberant rooms,

T. Yoshioka, A. Sehr, M. Delcroix, K. Kinoshita, R. Maas, T. Nakatani, and W. Kellermann, “Making machines understand us in reverberant rooms,” in IEEE Signal Processing Letter. IEEE, 2012

work page 2012

[17] [17]

Speech enhance- ment with lstm recurrent neural networks and its ap- plication to noise-robust asr,

F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. L. Roux, J. R. Hershey, and B. Schuller, “Speech enhance- ment with lstm recurrent neural networks and its ap- plication to noise-robust asr,” in International Confer- ence on Latent Variable Analysis and Single Separation. IEEE, 2015

work page 2015

[18] [18]

Domain adaptation using factorized hidden layer for robust automatic speech recognition,

K. C. Sim, A. Narayanan, A. Misra, A. Tripathi, G. Pun- dak, T. Sainath, P. Haghani, B. Li, and M. Bacchi- ani, “Domain adaptation using factorized hidden layer for robust automatic speech recognition,” in INTER- SPEECH, 2018

work page 2018

[19] [19]

A study of enhancement, augmentation, and autoencoder methods for domain adaptation in distant speech recognition,

H. Tang, W. N. Hsu, F. Grondin, and J Glass, “A study of enhancement, augmentation, and autoencoder methods for domain adaptation in distant speech recognition,” in INTERSPEECH, 2018

work page 2018

[20] [20]

Toward domain-invariant speech recognition via large scale training,

A. Narayanan, A. Misra, K. C. Sim, G. Pundak, A. Tripathi, M. Elfeky, P. Haghani, T. Strohman, and M. Bacchiani, “Toward domain-invariant speech recognition via large scale training,” in https://arxiv.org/abs/1808.05312, 2018

work page arXiv 2018

[21] [21]

Do deep nets really need to be deep?,

Jimmy Ba and Rich Caruana, “Do deep nets really need to be deep?,” in Advances in neural information pro- cessing systems, 2014, pp. 2654–2662

work page 2014

[22] [22]

Learning small-size dnn with output-distribution-based criteria,

Jinyu Li, Rui Zhao, Jui-Ting Huang, and Yifan Gong, “Learning small-size dnn with output-distribution-based criteria,” in Fifteenth annual conference of the interna- tional speech communication association, 2014

work page 2014

[23] [23]

Distilling the knowledge in a neural network,

G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” inNIPS Deep Learning and Representation Learning Workshop, 2015

work page 2015

[24] [24]

Distilling knowl- edge from ensembles of neural networks for speech recognition.,

Yevgen Chebotar and Austin Waters, “Distilling knowl- edge from ensembles of neural networks for speech recognition.,” in Interspeech, 2016, pp. 3439–3443

work page 2016

[25] [25]

Learning from multiple teacher networks,

Shan You, Chang Xu, Chao Xu, and Dacheng Tao, “Learning from multiple teacher networks,” inProceed- ings of the 23rd ACM SIGKDD International Confer- ence on Knowledge Discovery and Data Mining. ACM, 2017, pp. 1285–1294

work page 2017

[26] [26]

Syllable-based acoustic modeling with ctc-smbr-lstm,

Zhongdi Qu, Parisa Haghani, Eugene Weinstein, and Pe- dro Moreno, “Syllable-based acoustic modeling with ctc-smbr-lstm,” in Automatic Speech Recognition and Understanding Workshop (ASRU), 2017 IEEE . IEEE, 2017, pp. 173–177

work page 2017

[27] [27]

Image method for efﬁ- ciently simulating small-room acoustics,

J. B. Allen and D. A. Berkley, “Image method for efﬁ- ciently simulating small-room acoustics,” in The Jour- nal of the Acoustical Society of America. IEEE, 1979, p. 943950

work page 1979

[28] [28]

Learning feature mapping using deep neu- ral network bottleneck features for distant large vocab- ulary speech recognition,

I. Himawan, P. Motlicek, D. Imseng, B. Potard, N. Kim, and J. Lee, “Learning feature mapping using deep neu- ral network bottleneck features for distant large vocab- ulary speech recognition,” in International Conference on Acoustics, Speech and Signal Processing, 2015

work page 2015

[29] [29]

Deep-FSMN for Large Vocabulary Continuous Speech Recognition

Shiliang Zhang, Ming Lei, Zhijie Yan, and Lirong Dai, “Deep-fsmn for large vocabulary continuous speech recognition,” arXiv preprint arXiv:1803.05030, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[30] [30]

The kaldi speech recognition toolkit,

Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Han- nemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al., “The kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society, 2011, number EPFL-CONF-192584

work page 2011

[31] [31]

Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update ﬁlter- ing,

K. Chen and Q. Huo, “Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update ﬁlter- ing,” in ICASSP. IEEE, 2016, p. 58805884

work page 2016