Learn Spelling from Teachers: Transferring Knowledge from Language Models to Sequence-to-Sequence Speech Recognition
Pith reviewed 2026-05-24 22:10 UTC · model grok-4.3
The pith
A language model teaches a sequence-to-sequence speech recognizer through soft labels during training only.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A recurrent neural network language model trained on large-scale external text produces soft labels that supervise the training of a sequence-to-sequence speech recognition model, allowing the language model to serve as a teacher; the resulting recognizer achieves a character error rate of 9.3 percent on AISHELL-1, an 18.42 percent relative reduction compared with the vanilla sequence-to-sequence baseline, without any external component added during testing.
What carries the argument
Knowledge distillation in which the external RNN language model supplies soft probability targets to guide the sequence-to-sequence model's training.
If this is right
- The sequence-to-sequence model reaches lower character error rates while keeping its test-time architecture and speed unchanged.
- The distilled model can be combined with shallow fusion at decoding time for additional gains.
- No external language model or fusion network needs to be loaded or run during inference.
- The same training procedure can be applied to other sequence-to-sequence tasks that benefit from external text knowledge.
Where Pith is reading between the lines
- The method may reduce engineering effort needed to maintain separate language-model components in deployed systems.
- Similar distillation could be tested on non-Chinese languages or larger speech corpora to check consistency of the gains.
- If the soft-label supervision generalizes, it might serve as a lighter alternative to explicit fusion networks in other multimodal sequence tasks.
Load-bearing premise
The reported error-rate reduction is produced by the distillation process itself rather than by any unstated differences in training procedure or data handling between the proposed system and the vanilla baseline.
What would settle it
Run the identical training schedule, optimizer, data splits, and hyperparameters on both the vanilla sequence-to-sequence model and the distilled version, then measure whether the character error rate difference remains.
Figures
read the original abstract
Integrating an external language model into a sequence-to-sequence speech recognition system is non-trivial. Previous works utilize linear interpolation or a fusion network to integrate external language models. However, these approaches introduce external components, and increase decoding computation. In this paper, we instead propose a knowledge distillation based training approach to integrating external language models into a sequence-to-sequence model. A recurrent neural network language model, which is trained on large scale external text, generates soft labels to guide the sequence-to-sequence model training. Thus, the language model plays the role of the teacher. This approach does not add any external component to the sequence-to-sequence model during testing. And this approach is flexible to be combined with shallow fusion technique together for decoding. The experiments are conducted on public Chinese datasets AISHELL-1 and CLMAD. Our approach achieves a character error rate of 9.3%, which is relatively reduced by 18.42% compared with the vanilla sequence-to-sequence model.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a knowledge-distillation training procedure in which soft labels from a pre-trained RNN language model (trained on external text) supervise a sequence-to-sequence ASR model. The resulting seq2seq model incorporates LM knowledge without any additional components or increased computation at inference time and can optionally be combined with shallow fusion. On the AISHELL-1 and CLMAD Chinese datasets the method is reported to reach 9.3 % CER, an 18.42 % relative reduction versus a vanilla seq2seq baseline.
Significance. If the empirical claim is reproducible, the approach supplies a training-only route for injecting large-scale text-derived language-model knowledge into end-to-end ASR without altering the decoder graph or latency. This is potentially attractive for production systems that already use seq2seq models. The paper does not, however, supply the experimental protocol, hyper-parameter tables, or statistical tests needed to confirm that the reported gain is attributable to distillation rather than uncontrolled differences in training procedure.
major comments (2)
- [Abstract] Abstract: the central claim of an 18.42 % relative CER reduction is presented without any statement that the vanilla seq2seq baseline and the KD-trained model share identical architecture, optimizer, learning-rate schedule, batch size, data splits, or regularization. Because these factors are not shown to be controlled, the observed gap cannot be unambiguously attributed to the knowledge-distillation process.
- [Abstract] Abstract: no error bars, statistical significance tests, or number of runs are supplied for the 9.3 % CER figure, making it impossible to judge whether the reported improvement exceeds normal training variability.
minor comments (1)
- [Abstract] The abstract states that experiments were conducted on both AISHELL-1 and CLMAD yet reports a numeric result only for AISHELL-1; results on CLMAD should be added for completeness.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We provide point-by-point responses to the major comments below and will revise the abstract and manuscript as indicated to improve clarity regarding our experimental setup.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of an 18.42 % relative CER reduction is presented without any statement that the vanilla seq2seq baseline and the KD-trained model share identical architecture, optimizer, learning-rate schedule, batch size, data splits, or regularization. Because these factors are not shown to be controlled, the observed gap cannot be unambiguously attributed to the knowledge-distillation process.
Authors: We agree that the abstract would benefit from an explicit statement confirming identical training conditions. The full paper specifies that the vanilla seq2seq baseline and the knowledge-distilled model employ the exact same architecture, optimizer (Adam), learning-rate schedule, batch size, data splits, and regularization techniques, with the sole difference being the addition of the distillation loss term. We will revise the abstract to include this clarification, thereby strengthening the attribution of the observed improvement to the proposed method. revision: yes
-
Referee: [Abstract] Abstract: no error bars, statistical significance tests, or number of runs are supplied for the 9.3 % CER figure, making it impossible to judge whether the reported improvement exceeds normal training variability.
Authors: We recognize the value of multiple runs and statistical analysis for assessing result reliability. Our reported figures are based on single training runs owing to the substantial computational resources required for these experiments. We will update the manuscript to clearly indicate that results are from single runs and discuss this as a limitation. This constitutes a partial revision as we cannot provide error bars or significance tests without conducting additional training runs. revision: partial
Circularity Check
No circularity; purely empirical result with no derivation chain
full rationale
The paper reports an empirical outcome: a knowledge-distillation training procedure yields 9.3% CER (18.42% relative reduction) on public Chinese datasets. No equations, mathematical derivations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear in the provided text. The central claim rests on measured performance differences between a vanilla seq2seq baseline and the KD-trained model; any attribution questions fall under experimental controls rather than circularity. This matches the reader's 0.0 assessment and the default expectation that most empirical papers contain no circularity.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Non-Intrusive Automatic Speech Recognition Refinement: A Survey
A survey that classifies non-intrusive ASR refinement methods into five categories, reviews domain adaptation and evaluation datasets, proposes standardized metrics, and identifies future research directions.
Reference graph
Works this paper leans on
-
[1]
Introduction Attention based sequence-to-sequence (Seq2Seq) models have achieved promising performance in automatic speech recogni- tion (ASR) [1, 2, 3, 4]. A Seq2Seq model consists of two com- ponents: an encoder encodes the acoustic feature sequence into a high level representation, and a decoder generates the cor- responding word sequence. The encoder ...
-
[2]
Background: Seq2Seq models for ASR A basic Seq2Seq model is shown in Fig. 1(a). First, a speech signal is processed into an acoustic feature sequence. Then, an 1http://openslr.org/33/ 2http://openslr.org/55/ arXiv:1907.06017v1 [eess.AS] 13 Jul 2019 Encoder Feature Extraction Acoustic Representation RNNLM Decoder ··· ··· Loss y t - 1 y t Softmax Softmax Ac...
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[3]
Distilling knowledge from external LMs The basic idea of “Learn Spelling from Teachers” (LST) is: first, train an RNNLM on an external large scale text corpus, and then use this RNNLM to guide Seq2Seq model training. Besides 1-of-K hard labels provided by the transcriptions, the RNNLM provides soft labels, which carries the knowledge of the text corpus. Th...
-
[4]
KD was proposed for model compres- sion [9]
Related work Knowledge distillation. KD was proposed for model compres- sion [9]. It is also referred to as teacher-student learning. Yoon et al. proposed to use KD to reduce the size of a Seq2Seq model for machine translation [13]. It has also been used for domain adaptation for acoustic models [14] and language models [15]. Different from these work, ou...
-
[5]
Experiments 5.1. Datasets We use a Chinese corpus AISHELL-1 to evaluate our proposed approach [10]. The training set contains 150 hours of speech (120, 098 utterances) recorded by 340 speakers. The devel- opment set contains 20 hours of speech ( 14, 326 utterances) recorded by 40 speakers. And the test set contains 10 hours of speech ( 7, 176 utterances) ...
work page 2048
-
[6]
An RNNLM is first trained on large scale external text data
Conclusions In this paper, we propose LST training approach to integrating an external RNNLM into a Seq2Seq model. An RNNLM is first trained on large scale external text data. Then, the RNNLM pro- vides soft labels of training transcriptions to train the Seq2Seq model. We used transformer based Seq2Seq as backbone, and conducted experiments on public avail...
-
[7]
Acknowledgements This work is supported by the National Key Research & De- velopment Plan of China (No.2017YFB1002801), the National Natural Science Foundation of China (NSFC) (No.61425017, No.61831022, No.61773379, No.61603390), the Strategic Pri- ority Research Program of Chinese Academy of Sciences (No.XDC02050100), and Inria-CAS Joint Research Project...
-
[8]
End-to-end attention-based large vocabulary speech recog- nition,
D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y . Ben- gio, “End-to-end attention-based large vocabulary speech recog- nition,” international conference on acoustics, speech, and signal processing, pp. 4945–4949, 2016
work page 2016
-
[9]
Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,
W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 4960–4964
work page 2016
-
[10]
Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition,
L. Dong, S. Xu, and B. Xu, “Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5884–5888
work page 2018
-
[11]
State- of-the-art speech recognition with sequence-to-sequence models,
C.-C. Chiu, T. N. Sainath, Y . Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Goninaet al., “State- of-the-art speech recognition with sequence-to-sequence models,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4774–4778
work page 2018
-
[12]
On using monolin- gual corpora in neural machine translation,
C. Gulcehre, O. Firat, K. Xu, K. Cho, L. Barrault, H. Lin, F. Bougares, H. Schwenk, and Y . Bengio, “On using monolin- gual corpora in neural machine translation,” arXiv: Computation and Language, 2015
work page 2015
-
[13]
An analysis of incorporating an external lan- guage model into a sequence-to-sequence model,
A. Kannan, Y . Wu, P. Nguyen, T. N. Sainath, Z. Chen, and R. Prabhavalkar, “An analysis of incorporating an external lan- guage model into a sequence-to-sequence model,” pp. 5824–5828, 2018
work page 2018
-
[14]
Cold fusion: Training seq2seq models together with language models
A. Sriram, H. Jun, S. Satheesh, and A. Coates, “Cold fusion: Training seq2seq models together with language models.” pp. 387–391, 2018
work page 2018
-
[15]
S. Changhao, C. Wen, G. Wang, D. Su, M. Luo, and D. Yu, “Com- ponent fusion: Learning replaceable language model component for end-to-end speech recognition system,” in 2019 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019
work page 2019
-
[16]
Distilling the Knowledge in a Neural Network
G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[17]
AIShell-1: An open-source mandarin speech corpus and a speech recognition baseline,
H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “AIShell-1: An open-source mandarin speech corpus and a speech recognition baseline,” in 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA). IEEE, 2017, pp. 1–5
work page 2017
-
[18]
CLMAD: A chinese lan- guage model adaptation dataset,
Y . Bai, J. Tao, J. Yi, Z. Wen, and C. Fan, “CLMAD: A chinese lan- guage model adaptation dataset,” in The Eleventh International Symposium on Chinese Spoken Language Processing (ISCSLP 2018), 2018
work page 2018
-
[19]
Attention-based models for speech recognition,
J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y . Ben- gio, “Attention-based models for speech recognition,” in Ad- vances in neural information processing systems, 2015, pp. 577– 585
work page 2015
-
[20]
Sequence-Level Knowledge Distillation
Y . Kim and A. M. Rush, “Sequence-level knowledge distillation,” arXiv preprint arXiv:1606.07947, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[21]
Large-Scale Domain Adaptation via Teacher-Student Learning
J. Li, M. L. Seltzer, X. Wang, R. Zhao, and Y . Gong, “Large-scale domain adaptation via teacher-student learning,” arXiv preprint arXiv:1708.05466, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[22]
J. Andr ´es-Ferrer, N. Bodenstab, and P. V ozila, “Efficient language model adaptation with noise contrastive estimation and kullback- leibler regularization,” Proc. Interspeech 2018 , pp. 3368–3372, 2018
work page 2018
-
[23]
Towards better decoding and lan- guage model integration in sequence to sequence models,
J. Chorowski and N. Jaitly, “Towards better decoding and lan- guage model integration in sequence to sequence models,” Proc. Interspeech 2017, pp. 523–527, 2017
work page 2017
-
[24]
Regularizing Neural Networks by Penalizing Confident Output Distributions
G. Pereyra, G. Tucker, J. Chorowski, Ł. Kaiser, and G. Hinton, “Regularizing neural networks by penalizing confident output dis- tributions,” arXiv preprint arXiv:1701.06548, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[25]
Scalable term selection for text categorization,
J. Li and M. Sun, “Scalable term selection for text categorization,” in Proceedings of the 2007 Joint Conference on Empirical Meth- ods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), 2007
work page 2007
-
[26]
Xenc: An open-source tool for data selection in natural language processing,
A. Rousseau, “Xenc: An open-source tool for data selection in natural language processing,” The Prague Bulletin of Mathemati- cal Linguistics, vol. 100, pp. 73–82, 2013
work page 2013
-
[27]
Intelligent selection of language model training data,
R. C. Moore and W. Lewis, “Intelligent selection of language model training data,” in Proceedings of the ACL 2010 conference short papers. Association for Computational Linguistics, 2010, pp. 220–224
work page 2010
-
[28]
Syllable-Based Sequence-to-Sequence Speech Recognition with the Transformer in Mandarin Chinese
S. Zhou, L. Dong, S. Xu, and B. Xu, “Syllable-based sequence- to-sequence speech recognition with the transformer in mandarin chinese,” arXiv preprint arXiv:1804.10752, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[29]
Using the Output Embedding to Improve Language Models
O. Press and L. Wolf, “Using the output embedding to improve language models,” arXiv preprint arXiv:1608.05859, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.