pith. sign in

arxiv: 1907.04916 · v1 · pith:QAB5TIP5new · submitted 2019-07-08 · 📡 eess.AS · cs.CL· cs.LG· cs.SD· stat.ML

Listen, Attend, Spell and Adapt: Speaker Adapted Sequence-to-Sequence ASR

Pith reviewed 2026-05-25 00:51 UTC · model grok-4.3

classification 📡 eess.AS cs.CLcs.LGcs.SDstat.ML
keywords speaker adaptationsequence-to-sequence ASRKullback-Leibler divergenceword error rateautomatic speech recognitionKLD adaptationLHN adaptationdictation data
0
0 comments X

The pith

KLD speaker adaptation on seq2seq ASR delivers 25% relative WER reduction, exceeding the 18.7% gain from conventional acoustic model adaptation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that sequence-to-sequence ASR models, previously compared mostly in speaker-independent settings, can be adapted to individual speakers using Kullback-Leibler divergence regularization or linear hidden networks. With adaptation data up to 20 hours per speaker drawn from dictation material, the adapted seq2seq system records larger relative word error rate drops than acoustic-model adaptation in a conventional pipeline. Performance scales log-linearly downward with more adaptation data, and additional gains come from minimum-WER adaptation and language-model score fusion. A reader would care because seq2seq architectures are simpler than traditional pipelines yet now appear able to reach or surpass their practical robustness without extra system complexity.

Core claim

Speaker-adapted sequence-to-sequence ASR using KLD regularization achieves a 25% relative word error rate improvement, compared with an 18.7% gain obtained by acoustic-model adaptation in a conventional system; the word error rate of the seq2seq model falls log-linearly with the quantity of adaptation data, and further reductions follow from minimum-WER adaptation and language-model fusion.

What carries the argument

Kullback-Leibler divergence (KLD) adaptation, which adds a regularization term that keeps the adapted model's output distribution close to the speaker-independent baseline while the model is fine-tuned on speaker data.

If this is right

  • Word error rate of the adapted seq2seq model continues to drop in a log-linear fashion as adaptation data increases to 20 hours per speaker.
  • Adapting under a minimum word-error-rate criterion and fusing scores with an adapted language model each produce additional performance gains beyond KLD alone.
  • LHN adaptation provides an alternative mechanism that can also be applied to the seq2seq encoder-decoder stack.
  • The overall seq2seq pipeline reaches or exceeds the accuracy of a fully adapted conventional ASR system under the tested conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Production deployments that already favor seq2seq for its end-to-end simplicity could adopt the same KLD procedure to handle speaker variation without maintaining separate acoustic and language model adaptation pipelines.
  • The log-linear scaling suggests a predictable schedule for collecting adaptation recordings: each doubling of data yields a fixed fractional error reduction.
  • The same regularization approach could be tested on other sources of mismatch such as accent or channel variation while keeping the core model unchanged.

Load-bearing premise

The speaker-independent seq2seq baseline trained on large dictation data offers a fair and representative starting point for measuring adaptation gains against conventional systems.

What would settle it

A controlled experiment that retrains both the seq2seq and conventional baselines on identical data volumes and evaluates them on the same held-out speaker sets; if the relative WER advantage disappears or reverses, the central claim does not hold.

Figures

Figures reproduced from arXiv: 1907.04916 by Felix Weninger, Jes\'us Andr\'es-Ferrer, Puming Zhan, Xinwei Li.

Figure 1
Figure 1. Figure 1: , instead of the simple CE loss. 3.3. mWER training and adaptation mWER training [27] was introduced as a discriminative training method for seq2seq systems and is similar in spirit to traditional sequence training [28]. In our experiments, we always build SA ℎ1 ℎ1 ℎ1 ℎ2 ℎ2 ℎ𝑇 𝑈 ⋯ ⋯ ℎ2 ℎ𝑇 𝑥1 𝑥2 𝑥𝑇 𝑈 𝑈 𝑈 ′ 𝑈 ′ 𝑈 ′ ℎ𝑇 𝛼𝑖,1 𝛼𝑖,𝑇 𝑠𝑖−1 𝑦𝑖−1 ∗ 𝑝𝑖 ℎ1 SI ℎ1 SI ℎ2 SI ℎ2 SI ℎ𝑇 ⋯ SI ⋯ 𝑥1 𝑥2 𝑥𝑇 ℎ𝑇 SI 𝑠𝑖 SI 𝑠𝑖−1 SI 𝑦𝑖−… view at source ↗
Figure 2
Figure 2. Figure 2: WER of the speaker adapted seq2seq system (KLD or encoder LHN adaptation) and the speaker adapted conventional ASR system (KLD) with various amounts of adaptation data [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

Sequence-to-sequence (seq2seq) based ASR systems have shown state-of-the-art performances while having clear advantages in terms of simplicity. However, comparisons are mostly done on speaker independent (SI) ASR systems, though speaker adapted conventional systems are commonly used in practice for improving robustness to speaker and environment variations. In this paper, we apply speaker adaptation to seq2seq models with the goal of matching the performance of conventional ASR adaptation. Specifically, we investigate Kullback-Leibler divergence (KLD) as well as Linear Hidden Network (LHN) based adaptation for seq2seq ASR, using different amounts (up to 20 hours) of adaptation data per speaker. Our SI models are trained on large amounts of dictation data and achieve state-of-the-art results. We obtained 25% relative word error rate (WER) improvement with KLD adaptation of the seq2seq model vs. 18.7% gain from acoustic model adaptation in the conventional system. We also show that the WER of the seq2seq model decreases log-linearly with the amount of adaptation data. Finally, we analyze adaptation based on the minimum WER criterion and adapting the language model (LM) for score fusion with the speaker adapted seq2seq model, which result in further improvements of the seq2seq system performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript investigates speaker adaptation for sequence-to-sequence ASR, applying KLD and LHN adaptation to models trained on large dictation data. It reports a 25% relative WER improvement via KLD adaptation (vs. 18.7% gain from acoustic-model adaptation in a conventional system) using up to 20 h of per-speaker adaptation data, observes log-linear WER reduction with adaptation data volume, and examines minimum-WER adaptation plus LM fusion for additional gains.

Significance. If the relative-gain comparison rests on matched SI baselines, training data scale, test partitions, and adaptation regimes, the result would indicate that seq2seq models can be adapted at least as effectively as conventional systems while retaining architectural simplicity. The log-linear scaling observation would also supply a practical empirical guideline for adaptation-data requirements.

major comments (2)
  1. [Abstract] Abstract: the headline claim of 25% relative WER improvement for KLD-adapted seq2seq versus 18.7% for conventional acoustic-model adaptation is presented without absolute SI WER numbers for either system and without any statement that the speaker-independent baselines, training-data volumes, speaker test partitions, or adaptation-data regimes (up to 20 h per speaker) are matched between the seq2seq and conventional pipelines.
  2. [Abstract] Abstract: no error bars, confidence intervals, data-split details, or statistical-significance tests are supplied for the reported relative improvements, leaving open the possibility that post-hoc experimental choices affect the central comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for identifying areas where the abstract could be strengthened for clarity. We address each major comment below and will update the abstract in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claim of 25% relative WER improvement for KLD-adapted seq2seq versus 18.7% for conventional acoustic-model adaptation is presented without absolute SI WER numbers for either system and without any statement that the speaker-independent baselines, training-data volumes, speaker test partitions, or adaptation-data regimes (up to 20 h per speaker) are matched between the seq2seq and conventional pipelines.

    Authors: We agree that the abstract would benefit from greater transparency. In the revision we will insert the absolute SI WER figures for both the seq2seq and conventional systems and add an explicit statement that the SI baselines were trained on the same large dictation corpus, evaluated on identical test partitions, and that the adaptation data volumes (up to 20 h per speaker) follow the same regime. revision: yes

  2. Referee: [Abstract] Abstract: no error bars, confidence intervals, data-split details, or statistical-significance tests are supplied for the reported relative improvements, leaving open the possibility that post-hoc experimental choices affect the central comparison.

    Authors: The experimental section of the manuscript already details the data splits, speaker partitions, and the range of adaptation data volumes examined. To address the abstract-level concern we will add a concise clause noting that the reported gains are observed consistently across multiple adaptation-data sizes and speakers. Formal error bars and significance tests were not part of the original experimental design; we therefore cannot add them without new analysis and mark this point as only partially addressable in revision. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical WER comparisons with no derivations or self-referential reductions

full rationale

The paper reports experimental results on KLD and LHN adaptation for seq2seq ASR, claiming relative WER gains (25% vs. 18.7%) from adaptation data up to 20h per speaker. No equations, derivations, or first-principles claims appear in the abstract or described content. Results are direct empirical measurements on dictation data, with no fitted parameters renamed as predictions, no self-citation load-bearing on uniqueness theorems, and no ansatz or renaming of known results. The comparison of relative gains rests on experimental conditions rather than any tautological reduction to inputs. This matches the default case of a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model or derivation; the work is an empirical application study. No free parameters, axioms, or invented entities are introduced or required.

pith-pipeline@v0.9.0 · 5791 in / 1099 out tokens · 21386 ms · 2026-05-25T00:51:07.696404+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 4 internal anchors

  1. [1]

    Sequence-to-sequence (seq2seq) modeling [4, 5] is a state-of-the-art method for end-to-end ASR, which has shown competitive results compared to traditional ASR systems [6]

    Introduction End-to-end ASR systems have recently received increasing at- tention, due to their ability of integrating all components of an ASR system in a single deep neural network (DNN), which greatly simplifies and unifies the training and decoding process [1, 2, 3]. Sequence-to-sequence (seq2seq) modeling [4, 5] is a state-of-the-art method for end-to-...

  2. [2]

    However, relatively few studies so far deal with adaptation of the end-to-end ASR systems

    Relation to prior work Many approaches for adapting ‘conventional’ DNN acoustic models have been developed over the years, such as linear trans- formation based approaches in [10, 11, 12], training with regular- ization in [13, 14], and various forms of speaker identity vectors in [15, 16, 17]. However, relatively few studies so far deal with adaptation o...

  3. [3]

    Listen, Attend, Spell and Adapt: Speaker Adapted Sequence-to-Sequence ASR

    Methodology In this work, we use an encoder-decoder architecture with at- tention similar to Listen-Attend-Spell (LAS) [4], treating end- to-end ASR as a seq2seq learning task: The goal is to predict a sequenceyi of symbols (here, we use sub-word units) from arXiv:1907.04916v1 [eess.AS] 8 Jul 2019 a sequence of acoustic features xt,t = 1,...,T , where T i...

  4. [4]

    (7) Here,β is the regularization strength (KLD relevance),LCE is the cross-entropy (CE) loss, andpSI i is the output distribution of the SI model

    is to encourage the output distributions of the SA and SI models to be similar, by minimizing the loss: LKLD = ∑ i (1−β)LCE(y∗ i,p i) +βLCE(pSI i ,p i). (7) Here,β is the regularization strength (KLD relevance),LCE is the cross-entropy (CE) loss, andpSI i is the output distribution of the SI model. The targety∗ i is represented as a one-hot vector. 3.2. L...

  5. [5]

    Data set We perform our experiments on a dictation data set

    Experiments 4.1. Data set We perform our experiments on a dictation data set. All utter- ances are anonymized field data. The audio is sampled at 8 kHz. A training set of 7.6 k hours from 58 k speakers is used to train SI models. The performance is measured on an evaluation set with 35 speakers (392 k words). The speakers cover various dictation domains wi...

  6. [6]

    Adaptation of various parameter subsets We start our evaluation by adapting different parameter subsets of the seq2seq model with the KLD method, using 2 h of adaptation data

    Results 5.1. Adaptation of various parameter subsets We start our evaluation by adapting different parameter subsets of the seq2seq model with the KLD method, using 2 h of adaptation data. We conjecture that due to the importance of language modeling for the dictation task, adapting only thedecoder should yield a significant gain as well. As can be seen fr...

  7. [7]

    Conclusions In this paper, we have presented several effective techniques for adapting seq2seq ASR systems. Significant gains have been achieved even with a few minutes of speech data, and at the same time we have shown that seq2seq ASR systems can exploit larger amounts of adaptation data effectively. Furthermore, we have achieved state-of-the-art perform...

  8. [8]

    Acknowledgements We would like to thank Peter Skala and Ming Yang for their help with the baseline ASR adaptation and many helpful discussions

  9. [9]

    Sequence Transduction with Recurrent Neural Networks

    A. Graves, “Sequence transduction with recurrent neural networks,” arXiv:1211.3711, 2012

  10. [10]

    Exploring neural transducers for end-to-end speech recognition,

    E. Battenberg, J. Chen, R. Child, A. Coates, Y . G. Y . Li, H. Liu, S. Satheesh, A. Sriram, and Z. Zhu, “Exploring neural transducers for end-to-end speech recognition,” in Proc. of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). Oki- nawa, Japan: IEEE, 2017, pp. 206–213

  11. [11]

    Advancing acoustic- to-word CTC model,

    J. Li, G. Ye, A. Das, R. Zhao, and Y . Gong, “Advancing acoustic- to-word CTC model,” in Proc. of ICASSP . Calgary, Canada: IEEE, 2018, pp. 5794–5798

  12. [12]

    Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,

    W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Proc. of ICASSP. Shanghai, China: IEEE, 2016, pp. 4960–4964

  13. [13]

    Convolutional sequence to sequence learning,

    J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y . N. Dauphin, “Convolutional sequence to sequence learning,” in Proc. of 34th International Conference on Machine Learning (ICML) . Sydney, Australia: PMLR, 2017, pp. 1243–1252

  14. [14]

    State- of-the-art speech recognition with sequence-to-sequence models,

    C.-C. Chiu, T. N. Sainath, Y . Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina et al., “State- of-the-art speech recognition with sequence-to-sequence models,” in Proc. of ICASSP . Calgary, Canada: IEEE, 2018, pp. 4774– 4778

  15. [15]

    Speech recognition with deep recurrent neural networks,

    A. Graves, A. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing , May 2013, pp. 6645–6649

  16. [16]

    Towards better decoding and language model integration in sequence to sequence models

    J. Chorowski and N. Jaitly, “Towards better decoding and language model integration in sequence to sequence models,” arXiv:1612.02695, 2017

  17. [17]

    An analysis of incorporating an external language model into a sequence-to-sequence model,

    A. Kannan, Y . Wu, P. Nguyen, T. N. Sainath, Z. Chen, and R. Prab- havalkar, “An analysis of incorporating an external language model into a sequence-to-sequence model,” in Proc. of ICASSP. IEEE, 2018, pp. 5824–5828

  18. [18]

    Speaker adaptation for hybrid HMM-ANN continuous speech recognition system,

    J. Neto, L. Almeida, M. Hochberg, C. Martins, L. Nunes, S. Renals, and T. Robinson, “Speaker adaptation for hybrid HMM-ANN continuous speech recognition system,” in European Conference on Speech Communication and Technology, Madrid, Spain, 1995, pp. 2171–2174

  19. [19]

    Lin- ear hidden transformations for adaptation of hybrid ANN/HMM models,

    R. Gemello, F. Mana, S. Scanzio, P. Laface, and R. De Mori, “Lin- ear hidden transformations for adaptation of hybrid ANN/HMM models,” Speech Communication, vol. 49, no. 10-11, pp. 827–835, 2007

  20. [20]

    Intermediate-layer DNN adaptation for offline and session-based iterative speaker adapta- tion,

    K. Kumar, C. Liu, K. Yao, and Y . Gong, “Intermediate-layer DNN adaptation for offline and session-based iterative speaker adapta- tion,” in Proc. of INTERSPEECH . Dresden, Germany: ISCA, 2015

  21. [21]

    Regularized adaptation of discriminative classifiers,

    X. Li and J. Bilmes, “Regularized adaptation of discriminative classifiers,” inProc. of ICASSP, vol. 1. Toulouse, France: IEEE, 2006, pp. 1–237

  22. [22]

    KL-divergence regular- ized deep neural network adaptation for improved large vocabulary speech recognition,

    D. Yu, K. Yao, H. Su, G. Li, and F. Seide, “KL-divergence regular- ized deep neural network adaptation for improved large vocabulary speech recognition,” in Proc. of ICASSP . Vancouver, Canada: IEEE, 2013, pp. 7893–7897

  23. [23]

    Speaker adap- tation of neural network acoustic models using i-vectors,

    G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, “Speaker adap- tation of neural network acoustic models using i-vectors,” in Proc. of IEEE Workshop on Automatic Speech Recognition and Under- standing (ASRU). Olomouc, Czech Republic: IEEE, 2013, pp. 55–59

  24. [24]

    X-vectors: Robust DNN embeddings for speaker recognition,

    D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” in Proc. of ICASSP. Calgary, Canada: IEEE, 2018, pp. 5329–5333

  25. [25]

    Sequence summarizing neural network for speaker adaptation,

    K. Vesel´y, S. Watanabe, K. ˇZmol´ıkov´a, M. Karafi´at, L. Burget, and J. H. ˇCernock´y, “Sequence summarizing neural network for speaker adaptation,” in Proc. of ICASSP. Shanghai, China: IEEE, 2016, pp. 5315–5319

  26. [26]

    Auxiliary feature based adaptation of end-to-end ASR systems,

    M. Delcroix, S. Watanabe, A. Ogawa, S. Karita, and T. Nakatani, “Auxiliary feature based adaptation of end-to-end ASR systems,” in Proc. of INTERSPEECH. Hyderabad, India: ISCA, 2018, pp. 2444–2448

  27. [27]

    Speaker adaptation for multichannel end-to-end speech recog- nition,

    T. Ochiai, S. Watanabe, S. Katagiri, T. Hori, and J. Hershey, “Speaker adaptation for multichannel end-to-end speech recog- nition,” in Proc. of ICASSP. Calgary, Canada: IEEE, 2018, pp. 6707–6711

  28. [28]

    Speaker adaptation for end-to-end CTC models,

    K. Li, J. Li, Y . Zhao, K. Kumar, and Y . Gong, “Speaker adaptation for end-to-end CTC models,” in Proc. of IEEE Spoken Language Technology Workshop (SLT). Athens, Greece: IEEE, 2018, pp. 542–549

  29. [29]

    Recur- rent neural network language model adaptation for conversational speech recognition,

    K. Li, H. Xu, Y . Wang, D. Povey, and S. Khudanpur, “Recur- rent neural network language model adaptation for conversational speech recognition,” inProc. of INTERSPEECH, Hyderabad, India, 2018, pp. 3373–3377

  30. [30]

    A fast and simple algorithm for training neural probabilistic language models,

    A. Mnih and Y . W. Teh, “A fast and simple algorithm for training neural probabilistic language models,” in In Proceedings of the International Conference on Machine Learning , 2012

  31. [31]

    Efficient language model adaptation with noise contrastive estimation and kullback- leibler regularization,

    J. Andr´es-Ferrer, N. Bodenstab, and P. V ozila, “Efficient language model adaptation with noise contrastive estimation and kullback- leibler regularization,” in Proc. of INTERSPEECH. Hyderabad, India: ISCA, 2018, pp. 3368–3372

  32. [32]

    Deep context: End-to-end contextual speech recog- nition,

    G. Pundak, T. N. Sainath, R. Prabhavalkar, A. Kannan, and D. Zhao, “Deep context: End-to-end contextual speech recog- nition,” in SLT. IEEE, 2018, pp. 418–425

  33. [33]

    Neural machine translation by jointly learning to align and translate,

    D. Bahdanau, K. Cho, and Y . Bengio, “Neural machine translation by jointly learning to align and translate,” in Proc. of International Conference on Learning Representations (ICLR). San Diego, CA: open publishing, 2015

  34. [34]

    DNN online adaptation for automatic speech recognition,

    X. Li, Y . Pan, M. Gibson, and P. Zhan, “DNN online adaptation for automatic speech recognition,” in Proc. of 29th Conference on Electronic Speech Signal Processing (ESSV) . Ulm, Germany: TUDpress, 2018

  35. [35]

    Minimum word error rate training for attention-based sequence-to-sequence models,

    R. Prabhavalkar, T. N. Sainath, Y . Wu, P. Nguyen, Z. Chen, C.- C. Chiu, and A. Kannan, “Minimum word error rate training for attention-based sequence-to-sequence models,” inProc. of ICASSP. Calgary, Canada: IEEE, 2018, pp. 4839–4843

  36. [36]

    Scalable Minimum Bayes Risk training of Deep Neural Network acoustic models using distributed Hessian-free optimization,

    B. Kingsbury, T. N. Sainath, and H. Soltau, “Scalable Minimum Bayes Risk training of Deep Neural Network acoustic models using distributed Hessian-free optimization,” in Proc. of INTERSPEECH. Portland, OR: ISCA, 2012

  37. [37]

    A comparison of techniques for language model inte- gration in encoder-decoder speech recognition,

    S. Toshniwal, A. Kannan, C.-C. Chiu, Y . Wu, T. Sainath, and K. Livescu, “A comparison of techniques for language model inte- gration in encoder-decoder speech recognition,” in Proc. of IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 2018, pp. 369–375

  38. [38]

    Dropout: A simple way to prevent neural net- works from overfitting,

    N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural net- works from overfitting,”Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014

  39. [39]

    Re- thinking the inception architecture for computer vision,

    C. Szegedy, V . Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Re- thinking the inception architecture for computer vision,” inProc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV: IEEE, 2016, pp. 2818–2826

  40. [40]

    Recurrent Neural Network Regularization

    W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural network regularization,” 2014. [Online]. Available: http://arxiv.org/abs/1409.2329

  41. [41]

    Linearly augmented deep neural network,

    P. Ghahremani, J. Droppo, and M. L. Seltzer, “Linearly augmented deep neural network,” in Proc. of ICASSP . Shanghai, China: IEEE, 2016, pp. 5085–5089