pith. sign in

arxiv: 1907.06017 · v1 · pith:6GEIL3VDnew · submitted 2019-07-13 · 📡 eess.AS · cs.CL

Learn Spelling from Teachers: Transferring Knowledge from Language Models to Sequence-to-Sequence Speech Recognition

Pith reviewed 2026-05-24 22:10 UTC · model grok-4.3

classification 📡 eess.AS cs.CL
keywords knowledge distillationlanguage modelsequence-to-sequencespeech recognitionAISHELL-1character error ratesoft labels
0
0 comments X

The pith

A language model teaches a sequence-to-sequence speech recognizer through soft labels during training only.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to transfer knowledge from a large external language model into a sequence-to-sequence speech recognition system by using the language model to generate soft target labels that supervise training. This avoids adding any new components or computation at test time while still benefiting from text data the recognizer never sees directly. The approach is also compatible with standard shallow fusion at decoding time. On public Chinese datasets the method lowers character error rate from the baseline level to 9.3 percent. A reader cares because many production systems want the accuracy boost of external language models without paying extra latency or memory at inference.

Core claim

A recurrent neural network language model trained on large-scale external text produces soft labels that supervise the training of a sequence-to-sequence speech recognition model, allowing the language model to serve as a teacher; the resulting recognizer achieves a character error rate of 9.3 percent on AISHELL-1, an 18.42 percent relative reduction compared with the vanilla sequence-to-sequence baseline, without any external component added during testing.

What carries the argument

Knowledge distillation in which the external RNN language model supplies soft probability targets to guide the sequence-to-sequence model's training.

If this is right

  • The sequence-to-sequence model reaches lower character error rates while keeping its test-time architecture and speed unchanged.
  • The distilled model can be combined with shallow fusion at decoding time for additional gains.
  • No external language model or fusion network needs to be loaded or run during inference.
  • The same training procedure can be applied to other sequence-to-sequence tasks that benefit from external text knowledge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method may reduce engineering effort needed to maintain separate language-model components in deployed systems.
  • Similar distillation could be tested on non-Chinese languages or larger speech corpora to check consistency of the gains.
  • If the soft-label supervision generalizes, it might serve as a lighter alternative to explicit fusion networks in other multimodal sequence tasks.

Load-bearing premise

The reported error-rate reduction is produced by the distillation process itself rather than by any unstated differences in training procedure or data handling between the proposed system and the vanilla baseline.

What would settle it

Run the identical training schedule, optimizer, data splits, and hyperparameters on both the vanilla sequence-to-sequence model and the distilled version, then measure whether the character error rate difference remains.

Figures

Figures reproduced from arXiv: 1907.06017 by Jiangyan Yi, Jianhua Tao, Ye Bai, Zhengkun Tian, Zhengqi Wen.

Figure 1
Figure 1. Figure 1: (a) illustrates a basic encoder-decoder architecture for ASR. x1, · · · , xt represent acoustic features, ct−1 denotes the context of t − 1 step, and yt−1 denotes the previous token. The decoder predicts the current token in terms of the context ct−1, the previous token yt−1 , and the acoustic vector generated by the encoder. The loss is computed with the softmax function of the decoder and the current gro… view at source ↗
Figure 2
Figure 2. Figure 2: Hard labels and soft labels at one time step of a se￾quence for training. The values of the soft labels reflect knowl￾edge of the external LM. is the history context, and T is a parameter called temperature to smooth the outputs. To make the Seq2Seq model learn the knowledge from the RNNLM, we minimize the Kullback-Leibler divergence (KLD) between estimated probability of the RNNLM PLM and the es￾timated p… view at source ↗
Figure 3
Figure 3. Figure 3: The loss curves of Seq2Seq model (left) and Seq2Seq model with LST (right). For Seq2Seq model, the training loss is lower than validation loss. However, with LST, the training loss is higher than the validation loss. Moreover, the validation loss in the right figure is a little smaller than the left one. to improve the performance of Seq2Seq models. Moreover, the model which uses LST and shallow fusion tog… view at source ↗
read the original abstract

Integrating an external language model into a sequence-to-sequence speech recognition system is non-trivial. Previous works utilize linear interpolation or a fusion network to integrate external language models. However, these approaches introduce external components, and increase decoding computation. In this paper, we instead propose a knowledge distillation based training approach to integrating external language models into a sequence-to-sequence model. A recurrent neural network language model, which is trained on large scale external text, generates soft labels to guide the sequence-to-sequence model training. Thus, the language model plays the role of the teacher. This approach does not add any external component to the sequence-to-sequence model during testing. And this approach is flexible to be combined with shallow fusion technique together for decoding. The experiments are conducted on public Chinese datasets AISHELL-1 and CLMAD. Our approach achieves a character error rate of 9.3%, which is relatively reduced by 18.42% compared with the vanilla sequence-to-sequence model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a knowledge-distillation training procedure in which soft labels from a pre-trained RNN language model (trained on external text) supervise a sequence-to-sequence ASR model. The resulting seq2seq model incorporates LM knowledge without any additional components or increased computation at inference time and can optionally be combined with shallow fusion. On the AISHELL-1 and CLMAD Chinese datasets the method is reported to reach 9.3 % CER, an 18.42 % relative reduction versus a vanilla seq2seq baseline.

Significance. If the empirical claim is reproducible, the approach supplies a training-only route for injecting large-scale text-derived language-model knowledge into end-to-end ASR without altering the decoder graph or latency. This is potentially attractive for production systems that already use seq2seq models. The paper does not, however, supply the experimental protocol, hyper-parameter tables, or statistical tests needed to confirm that the reported gain is attributable to distillation rather than uncontrolled differences in training procedure.

major comments (2)
  1. [Abstract] Abstract: the central claim of an 18.42 % relative CER reduction is presented without any statement that the vanilla seq2seq baseline and the KD-trained model share identical architecture, optimizer, learning-rate schedule, batch size, data splits, or regularization. Because these factors are not shown to be controlled, the observed gap cannot be unambiguously attributed to the knowledge-distillation process.
  2. [Abstract] Abstract: no error bars, statistical significance tests, or number of runs are supplied for the 9.3 % CER figure, making it impossible to judge whether the reported improvement exceeds normal training variability.
minor comments (1)
  1. [Abstract] The abstract states that experiments were conducted on both AISHELL-1 and CLMAD yet reports a numeric result only for AISHELL-1; results on CLMAD should be added for completeness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We provide point-by-point responses to the major comments below and will revise the abstract and manuscript as indicated to improve clarity regarding our experimental setup.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of an 18.42 % relative CER reduction is presented without any statement that the vanilla seq2seq baseline and the KD-trained model share identical architecture, optimizer, learning-rate schedule, batch size, data splits, or regularization. Because these factors are not shown to be controlled, the observed gap cannot be unambiguously attributed to the knowledge-distillation process.

    Authors: We agree that the abstract would benefit from an explicit statement confirming identical training conditions. The full paper specifies that the vanilla seq2seq baseline and the knowledge-distilled model employ the exact same architecture, optimizer (Adam), learning-rate schedule, batch size, data splits, and regularization techniques, with the sole difference being the addition of the distillation loss term. We will revise the abstract to include this clarification, thereby strengthening the attribution of the observed improvement to the proposed method. revision: yes

  2. Referee: [Abstract] Abstract: no error bars, statistical significance tests, or number of runs are supplied for the 9.3 % CER figure, making it impossible to judge whether the reported improvement exceeds normal training variability.

    Authors: We recognize the value of multiple runs and statistical analysis for assessing result reliability. Our reported figures are based on single training runs owing to the substantial computational resources required for these experiments. We will update the manuscript to clearly indicate that results are from single runs and discuss this as a limitation. This constitutes a partial revision as we cannot provide error bars or significance tests without conducting additional training runs. revision: partial

Circularity Check

0 steps flagged

No circularity; purely empirical result with no derivation chain

full rationale

The paper reports an empirical outcome: a knowledge-distillation training procedure yields 9.3% CER (18.42% relative reduction) on public Chinese datasets. No equations, mathematical derivations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear in the provided text. The central claim rests on measured performance differences between a vanilla seq2seq baseline and the KD-trained model; any attribution questions fall under experimental controls rather than circularity. This matches the reader's 0.0 assessment and the default expectation that most empirical papers contain no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.0 · 5709 in / 1164 out tokens · 30073 ms · 2026-05-24T22:10:15.440194+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Non-Intrusive Automatic Speech Recognition Refinement: A Survey

    eess.AS 2025-08 accept novelty 4.0

    A survey that classifies non-intrusive ASR refinement methods into five categories, reviews domain adaptation and evaluation datasets, proposes standardized metrics, and identifies future research directions.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 1 Pith paper · 7 internal anchors

  1. [1]

    teach” the student (Seq2Seq decoder). Thus, we refer to the proposed training ap- proach as “Learn Spelling from Teachers

    Introduction Attention based sequence-to-sequence (Seq2Seq) models have achieved promising performance in automatic speech recogni- tion (ASR) [1, 2, 3, 4]. A Seq2Seq model consists of two com- ponents: an encoder encodes the acoustic feature sequence into a high level representation, and a decoder generates the cor- responding word sequence. The encoder ...

  2. [2]

    Learn Spelling from Teachers: Transferring Knowledge from Language Models to Sequence-to-Sequence Speech Recognition

    Background: Seq2Seq models for ASR A basic Seq2Seq model is shown in Fig. 1(a). First, a speech signal is processed into an acoustic feature sequence. Then, an 1http://openslr.org/33/ 2http://openslr.org/55/ arXiv:1907.06017v1 [eess.AS] 13 Jul 2019 Encoder Feature Extraction Acoustic Representation RNNLM Decoder ··· ··· Loss y t - 1 y t Softmax Softmax Ac...

  3. [3]

    Learn Spelling from Teachers

    Distilling knowledge from external LMs The basic idea of “Learn Spelling from Teachers” (LST) is: first, train an RNNLM on an external large scale text corpus, and then use this RNNLM to guide Seq2Seq model training. Besides 1-of-K hard labels provided by the transcriptions, the RNNLM provides soft labels, which carries the knowledge of the text corpus. Th...

  4. [4]

    KD was proposed for model compres- sion [9]

    Related work Knowledge distillation. KD was proposed for model compres- sion [9]. It is also referred to as teacher-student learning. Yoon et al. proposed to use KD to reduce the size of a Seq2Seq model for machine translation [13]. It has also been used for domain adaptation for acoustic models [14] and language models [15]. Different from these work, ou...

  5. [5]

    <unk>”, “<sos>

    Experiments 5.1. Datasets We use a Chinese corpus AISHELL-1 to evaluate our proposed approach [10]. The training set contains 150 hours of speech (120, 098 utterances) recorded by 340 speakers. The devel- opment set contains 20 hours of speech ( 14, 326 utterances) recorded by 40 speakers. And the test set contains 10 hours of speech ( 7, 176 utterances) ...

  6. [6]

    An RNNLM is first trained on large scale external text data

    Conclusions In this paper, we propose LST training approach to integrating an external RNNLM into a Seq2Seq model. An RNNLM is first trained on large scale external text data. Then, the RNNLM pro- vides soft labels of training transcriptions to train the Seq2Seq model. We used transformer based Seq2Seq as backbone, and conducted experiments on public avail...

  7. [7]

    Acknowledgements This work is supported by the National Key Research & De- velopment Plan of China (No.2017YFB1002801), the National Natural Science Foundation of China (NSFC) (No.61425017, No.61831022, No.61773379, No.61603390), the Strategic Pri- ority Research Program of Chinese Academy of Sciences (No.XDC02050100), and Inria-CAS Joint Research Project...

  8. [8]

    End-to-end attention-based large vocabulary speech recog- nition,

    D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y . Ben- gio, “End-to-end attention-based large vocabulary speech recog- nition,” international conference on acoustics, speech, and signal processing, pp. 4945–4949, 2016

  9. [9]

    Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,

    W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 4960–4964

  10. [10]

    Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition,

    L. Dong, S. Xu, and B. Xu, “Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5884–5888

  11. [11]

    State- of-the-art speech recognition with sequence-to-sequence models,

    C.-C. Chiu, T. N. Sainath, Y . Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Goninaet al., “State- of-the-art speech recognition with sequence-to-sequence models,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4774–4778

  12. [12]

    On using monolin- gual corpora in neural machine translation,

    C. Gulcehre, O. Firat, K. Xu, K. Cho, L. Barrault, H. Lin, F. Bougares, H. Schwenk, and Y . Bengio, “On using monolin- gual corpora in neural machine translation,” arXiv: Computation and Language, 2015

  13. [13]

    An analysis of incorporating an external lan- guage model into a sequence-to-sequence model,

    A. Kannan, Y . Wu, P. Nguyen, T. N. Sainath, Z. Chen, and R. Prabhavalkar, “An analysis of incorporating an external lan- guage model into a sequence-to-sequence model,” pp. 5824–5828, 2018

  14. [14]

    Cold fusion: Training seq2seq models together with language models

    A. Sriram, H. Jun, S. Satheesh, and A. Coates, “Cold fusion: Training seq2seq models together with language models.” pp. 387–391, 2018

  15. [15]

    Com- ponent fusion: Learning replaceable language model component for end-to-end speech recognition system,

    S. Changhao, C. Wen, G. Wang, D. Su, M. Luo, and D. Yu, “Com- ponent fusion: Learning replaceable language model component for end-to-end speech recognition system,” in 2019 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019

  16. [16]

    Distilling the Knowledge in a Neural Network

    G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015

  17. [17]

    AIShell-1: An open-source mandarin speech corpus and a speech recognition baseline,

    H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “AIShell-1: An open-source mandarin speech corpus and a speech recognition baseline,” in 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA). IEEE, 2017, pp. 1–5

  18. [18]

    CLMAD: A chinese lan- guage model adaptation dataset,

    Y . Bai, J. Tao, J. Yi, Z. Wen, and C. Fan, “CLMAD: A chinese lan- guage model adaptation dataset,” in The Eleventh International Symposium on Chinese Spoken Language Processing (ISCSLP 2018), 2018

  19. [19]

    Attention-based models for speech recognition,

    J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y . Ben- gio, “Attention-based models for speech recognition,” in Ad- vances in neural information processing systems, 2015, pp. 577– 585

  20. [20]

    Sequence-Level Knowledge Distillation

    Y . Kim and A. M. Rush, “Sequence-level knowledge distillation,” arXiv preprint arXiv:1606.07947, 2016

  21. [21]

    Large-Scale Domain Adaptation via Teacher-Student Learning

    J. Li, M. L. Seltzer, X. Wang, R. Zhao, and Y . Gong, “Large-scale domain adaptation via teacher-student learning,” arXiv preprint arXiv:1708.05466, 2017

  22. [22]

    Efficient language model adaptation with noise contrastive estimation and kullback- leibler regularization,

    J. Andr ´es-Ferrer, N. Bodenstab, and P. V ozila, “Efficient language model adaptation with noise contrastive estimation and kullback- leibler regularization,” Proc. Interspeech 2018 , pp. 3368–3372, 2018

  23. [23]

    Towards better decoding and lan- guage model integration in sequence to sequence models,

    J. Chorowski and N. Jaitly, “Towards better decoding and lan- guage model integration in sequence to sequence models,” Proc. Interspeech 2017, pp. 523–527, 2017

  24. [24]

    Regularizing Neural Networks by Penalizing Confident Output Distributions

    G. Pereyra, G. Tucker, J. Chorowski, Ł. Kaiser, and G. Hinton, “Regularizing neural networks by penalizing confident output dis- tributions,” arXiv preprint arXiv:1701.06548, 2017

  25. [25]

    Scalable term selection for text categorization,

    J. Li and M. Sun, “Scalable term selection for text categorization,” in Proceedings of the 2007 Joint Conference on Empirical Meth- ods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), 2007

  26. [26]

    Xenc: An open-source tool for data selection in natural language processing,

    A. Rousseau, “Xenc: An open-source tool for data selection in natural language processing,” The Prague Bulletin of Mathemati- cal Linguistics, vol. 100, pp. 73–82, 2013

  27. [27]

    Intelligent selection of language model training data,

    R. C. Moore and W. Lewis, “Intelligent selection of language model training data,” in Proceedings of the ACL 2010 conference short papers. Association for Computational Linguistics, 2010, pp. 220–224

  28. [28]

    Syllable-Based Sequence-to-Sequence Speech Recognition with the Transformer in Mandarin Chinese

    S. Zhou, L. Dong, S. Xu, and B. Xu, “Syllable-based sequence- to-sequence speech recognition with the transformer in mandarin chinese,” arXiv preprint arXiv:1804.10752, 2018

  29. [29]

    Using the Output Embedding to Improve Language Models

    O. Press and L. Wolf, “Using the output embedding to improve language models,” arXiv preprint arXiv:1608.05859, 2016