Listen, Attend, Spell and Adapt: Speaker Adapted Sequence-to-Sequence ASR
Pith reviewed 2026-05-25 00:51 UTC · model grok-4.3
The pith
KLD speaker adaptation on seq2seq ASR delivers 25% relative WER reduction, exceeding the 18.7% gain from conventional acoustic model adaptation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Speaker-adapted sequence-to-sequence ASR using KLD regularization achieves a 25% relative word error rate improvement, compared with an 18.7% gain obtained by acoustic-model adaptation in a conventional system; the word error rate of the seq2seq model falls log-linearly with the quantity of adaptation data, and further reductions follow from minimum-WER adaptation and language-model fusion.
What carries the argument
Kullback-Leibler divergence (KLD) adaptation, which adds a regularization term that keeps the adapted model's output distribution close to the speaker-independent baseline while the model is fine-tuned on speaker data.
If this is right
- Word error rate of the adapted seq2seq model continues to drop in a log-linear fashion as adaptation data increases to 20 hours per speaker.
- Adapting under a minimum word-error-rate criterion and fusing scores with an adapted language model each produce additional performance gains beyond KLD alone.
- LHN adaptation provides an alternative mechanism that can also be applied to the seq2seq encoder-decoder stack.
- The overall seq2seq pipeline reaches or exceeds the accuracy of a fully adapted conventional ASR system under the tested conditions.
Where Pith is reading between the lines
- Production deployments that already favor seq2seq for its end-to-end simplicity could adopt the same KLD procedure to handle speaker variation without maintaining separate acoustic and language model adaptation pipelines.
- The log-linear scaling suggests a predictable schedule for collecting adaptation recordings: each doubling of data yields a fixed fractional error reduction.
- The same regularization approach could be tested on other sources of mismatch such as accent or channel variation while keeping the core model unchanged.
Load-bearing premise
The speaker-independent seq2seq baseline trained on large dictation data offers a fair and representative starting point for measuring adaptation gains against conventional systems.
What would settle it
A controlled experiment that retrains both the seq2seq and conventional baselines on identical data volumes and evaluates them on the same held-out speaker sets; if the relative WER advantage disappears or reverses, the central claim does not hold.
Figures
read the original abstract
Sequence-to-sequence (seq2seq) based ASR systems have shown state-of-the-art performances while having clear advantages in terms of simplicity. However, comparisons are mostly done on speaker independent (SI) ASR systems, though speaker adapted conventional systems are commonly used in practice for improving robustness to speaker and environment variations. In this paper, we apply speaker adaptation to seq2seq models with the goal of matching the performance of conventional ASR adaptation. Specifically, we investigate Kullback-Leibler divergence (KLD) as well as Linear Hidden Network (LHN) based adaptation for seq2seq ASR, using different amounts (up to 20 hours) of adaptation data per speaker. Our SI models are trained on large amounts of dictation data and achieve state-of-the-art results. We obtained 25% relative word error rate (WER) improvement with KLD adaptation of the seq2seq model vs. 18.7% gain from acoustic model adaptation in the conventional system. We also show that the WER of the seq2seq model decreases log-linearly with the amount of adaptation data. Finally, we analyze adaptation based on the minimum WER criterion and adapting the language model (LM) for score fusion with the speaker adapted seq2seq model, which result in further improvements of the seq2seq system performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript investigates speaker adaptation for sequence-to-sequence ASR, applying KLD and LHN adaptation to models trained on large dictation data. It reports a 25% relative WER improvement via KLD adaptation (vs. 18.7% gain from acoustic-model adaptation in a conventional system) using up to 20 h of per-speaker adaptation data, observes log-linear WER reduction with adaptation data volume, and examines minimum-WER adaptation plus LM fusion for additional gains.
Significance. If the relative-gain comparison rests on matched SI baselines, training data scale, test partitions, and adaptation regimes, the result would indicate that seq2seq models can be adapted at least as effectively as conventional systems while retaining architectural simplicity. The log-linear scaling observation would also supply a practical empirical guideline for adaptation-data requirements.
major comments (2)
- [Abstract] Abstract: the headline claim of 25% relative WER improvement for KLD-adapted seq2seq versus 18.7% for conventional acoustic-model adaptation is presented without absolute SI WER numbers for either system and without any statement that the speaker-independent baselines, training-data volumes, speaker test partitions, or adaptation-data regimes (up to 20 h per speaker) are matched between the seq2seq and conventional pipelines.
- [Abstract] Abstract: no error bars, confidence intervals, data-split details, or statistical-significance tests are supplied for the reported relative improvements, leaving open the possibility that post-hoc experimental choices affect the central comparison.
Simulated Author's Rebuttal
We thank the referee for the careful reading and for identifying areas where the abstract could be strengthened for clarity. We address each major comment below and will update the abstract in the revised manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline claim of 25% relative WER improvement for KLD-adapted seq2seq versus 18.7% for conventional acoustic-model adaptation is presented without absolute SI WER numbers for either system and without any statement that the speaker-independent baselines, training-data volumes, speaker test partitions, or adaptation-data regimes (up to 20 h per speaker) are matched between the seq2seq and conventional pipelines.
Authors: We agree that the abstract would benefit from greater transparency. In the revision we will insert the absolute SI WER figures for both the seq2seq and conventional systems and add an explicit statement that the SI baselines were trained on the same large dictation corpus, evaluated on identical test partitions, and that the adaptation data volumes (up to 20 h per speaker) follow the same regime. revision: yes
-
Referee: [Abstract] Abstract: no error bars, confidence intervals, data-split details, or statistical-significance tests are supplied for the reported relative improvements, leaving open the possibility that post-hoc experimental choices affect the central comparison.
Authors: The experimental section of the manuscript already details the data splits, speaker partitions, and the range of adaptation data volumes examined. To address the abstract-level concern we will add a concise clause noting that the reported gains are observed consistently across multiple adaptation-data sizes and speakers. Formal error bars and significance tests were not part of the original experimental design; we therefore cannot add them without new analysis and mark this point as only partially addressable in revision. revision: partial
Circularity Check
No circularity: empirical WER comparisons with no derivations or self-referential reductions
full rationale
The paper reports experimental results on KLD and LHN adaptation for seq2seq ASR, claiming relative WER gains (25% vs. 18.7%) from adaptation data up to 20h per speaker. No equations, derivations, or first-principles claims appear in the abstract or described content. Results are direct empirical measurements on dictation data, with no fitted parameters renamed as predictions, no self-citation load-bearing on uniqueness theorems, and no ansatz or renaming of known results. The comparison of relative gains rests on experimental conditions rather than any tautological reduction to inputs. This matches the default case of a self-contained empirical study.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Introduction End-to-end ASR systems have recently received increasing at- tention, due to their ability of integrating all components of an ASR system in a single deep neural network (DNN), which greatly simplifies and unifies the training and decoding process [1, 2, 3]. Sequence-to-sequence (seq2seq) modeling [4, 5] is a state-of-the-art method for end-to-...
-
[2]
However, relatively few studies so far deal with adaptation of the end-to-end ASR systems
Relation to prior work Many approaches for adapting ‘conventional’ DNN acoustic models have been developed over the years, such as linear trans- formation based approaches in [10, 11, 12], training with regular- ization in [13, 14], and various forms of speaker identity vectors in [15, 16, 17]. However, relatively few studies so far deal with adaptation o...
-
[3]
Listen, Attend, Spell and Adapt: Speaker Adapted Sequence-to-Sequence ASR
Methodology In this work, we use an encoder-decoder architecture with at- tention similar to Listen-Attend-Spell (LAS) [4], treating end- to-end ASR as a seq2seq learning task: The goal is to predict a sequenceyi of symbols (here, we use sub-word units) from arXiv:1907.04916v1 [eess.AS] 8 Jul 2019 a sequence of acoustic features xt,t = 1,...,T , where T i...
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[4]
is to encourage the output distributions of the SA and SI models to be similar, by minimizing the loss: LKLD = ∑ i (1−β)LCE(y∗ i,p i) +βLCE(pSI i ,p i). (7) Here,β is the regularization strength (KLD relevance),LCE is the cross-entropy (CE) loss, andpSI i is the output distribution of the SI model. The targety∗ i is represented as a one-hot vector. 3.2. L...
-
[5]
Data set We perform our experiments on a dictation data set
Experiments 4.1. Data set We perform our experiments on a dictation data set. All utter- ances are anonymized field data. The audio is sampled at 8 kHz. A training set of 7.6 k hours from 58 k speakers is used to train SI models. The performance is measured on an evaluation set with 35 speakers (392 k words). The speakers cover various dictation domains wi...
-
[6]
Results 5.1. Adaptation of various parameter subsets We start our evaluation by adapting different parameter subsets of the seq2seq model with the KLD method, using 2 h of adaptation data. We conjecture that due to the importance of language modeling for the dictation task, adapting only thedecoder should yield a significant gain as well. As can be seen fr...
-
[7]
Conclusions In this paper, we have presented several effective techniques for adapting seq2seq ASR systems. Significant gains have been achieved even with a few minutes of speech data, and at the same time we have shown that seq2seq ASR systems can exploit larger amounts of adaptation data effectively. Furthermore, we have achieved state-of-the-art perform...
-
[8]
Acknowledgements We would like to thank Peter Skala and Ming Yang for their help with the baseline ASR adaptation and many helpful discussions
-
[9]
Sequence Transduction with Recurrent Neural Networks
A. Graves, “Sequence transduction with recurrent neural networks,” arXiv:1211.3711, 2012
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[10]
Exploring neural transducers for end-to-end speech recognition,
E. Battenberg, J. Chen, R. Child, A. Coates, Y . G. Y . Li, H. Liu, S. Satheesh, A. Sriram, and Z. Zhu, “Exploring neural transducers for end-to-end speech recognition,” in Proc. of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). Oki- nawa, Japan: IEEE, 2017, pp. 206–213
work page 2017
-
[11]
Advancing acoustic- to-word CTC model,
J. Li, G. Ye, A. Das, R. Zhao, and Y . Gong, “Advancing acoustic- to-word CTC model,” in Proc. of ICASSP . Calgary, Canada: IEEE, 2018, pp. 5794–5798
work page 2018
-
[12]
Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,
W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Proc. of ICASSP. Shanghai, China: IEEE, 2016, pp. 4960–4964
work page 2016
-
[13]
Convolutional sequence to sequence learning,
J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y . N. Dauphin, “Convolutional sequence to sequence learning,” in Proc. of 34th International Conference on Machine Learning (ICML) . Sydney, Australia: PMLR, 2017, pp. 1243–1252
work page 2017
-
[14]
State- of-the-art speech recognition with sequence-to-sequence models,
C.-C. Chiu, T. N. Sainath, Y . Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina et al., “State- of-the-art speech recognition with sequence-to-sequence models,” in Proc. of ICASSP . Calgary, Canada: IEEE, 2018, pp. 4774– 4778
work page 2018
-
[15]
Speech recognition with deep recurrent neural networks,
A. Graves, A. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing , May 2013, pp. 6645–6649
work page 2013
-
[16]
Towards better decoding and language model integration in sequence to sequence models
J. Chorowski and N. Jaitly, “Towards better decoding and language model integration in sequence to sequence models,” arXiv:1612.02695, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[17]
An analysis of incorporating an external language model into a sequence-to-sequence model,
A. Kannan, Y . Wu, P. Nguyen, T. N. Sainath, Z. Chen, and R. Prab- havalkar, “An analysis of incorporating an external language model into a sequence-to-sequence model,” in Proc. of ICASSP. IEEE, 2018, pp. 5824–5828
work page 2018
-
[18]
Speaker adaptation for hybrid HMM-ANN continuous speech recognition system,
J. Neto, L. Almeida, M. Hochberg, C. Martins, L. Nunes, S. Renals, and T. Robinson, “Speaker adaptation for hybrid HMM-ANN continuous speech recognition system,” in European Conference on Speech Communication and Technology, Madrid, Spain, 1995, pp. 2171–2174
work page 1995
-
[19]
Lin- ear hidden transformations for adaptation of hybrid ANN/HMM models,
R. Gemello, F. Mana, S. Scanzio, P. Laface, and R. De Mori, “Lin- ear hidden transformations for adaptation of hybrid ANN/HMM models,” Speech Communication, vol. 49, no. 10-11, pp. 827–835, 2007
work page 2007
-
[20]
Intermediate-layer DNN adaptation for offline and session-based iterative speaker adapta- tion,
K. Kumar, C. Liu, K. Yao, and Y . Gong, “Intermediate-layer DNN adaptation for offline and session-based iterative speaker adapta- tion,” in Proc. of INTERSPEECH . Dresden, Germany: ISCA, 2015
work page 2015
-
[21]
Regularized adaptation of discriminative classifiers,
X. Li and J. Bilmes, “Regularized adaptation of discriminative classifiers,” inProc. of ICASSP, vol. 1. Toulouse, France: IEEE, 2006, pp. 1–237
work page 2006
-
[22]
D. Yu, K. Yao, H. Su, G. Li, and F. Seide, “KL-divergence regular- ized deep neural network adaptation for improved large vocabulary speech recognition,” in Proc. of ICASSP . Vancouver, Canada: IEEE, 2013, pp. 7893–7897
work page 2013
-
[23]
Speaker adap- tation of neural network acoustic models using i-vectors,
G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, “Speaker adap- tation of neural network acoustic models using i-vectors,” in Proc. of IEEE Workshop on Automatic Speech Recognition and Under- standing (ASRU). Olomouc, Czech Republic: IEEE, 2013, pp. 55–59
work page 2013
-
[24]
X-vectors: Robust DNN embeddings for speaker recognition,
D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” in Proc. of ICASSP. Calgary, Canada: IEEE, 2018, pp. 5329–5333
work page 2018
-
[25]
Sequence summarizing neural network for speaker adaptation,
K. Vesel´y, S. Watanabe, K. ˇZmol´ıkov´a, M. Karafi´at, L. Burget, and J. H. ˇCernock´y, “Sequence summarizing neural network for speaker adaptation,” in Proc. of ICASSP. Shanghai, China: IEEE, 2016, pp. 5315–5319
work page 2016
-
[26]
Auxiliary feature based adaptation of end-to-end ASR systems,
M. Delcroix, S. Watanabe, A. Ogawa, S. Karita, and T. Nakatani, “Auxiliary feature based adaptation of end-to-end ASR systems,” in Proc. of INTERSPEECH. Hyderabad, India: ISCA, 2018, pp. 2444–2448
work page 2018
-
[27]
Speaker adaptation for multichannel end-to-end speech recog- nition,
T. Ochiai, S. Watanabe, S. Katagiri, T. Hori, and J. Hershey, “Speaker adaptation for multichannel end-to-end speech recog- nition,” in Proc. of ICASSP. Calgary, Canada: IEEE, 2018, pp. 6707–6711
work page 2018
-
[28]
Speaker adaptation for end-to-end CTC models,
K. Li, J. Li, Y . Zhao, K. Kumar, and Y . Gong, “Speaker adaptation for end-to-end CTC models,” in Proc. of IEEE Spoken Language Technology Workshop (SLT). Athens, Greece: IEEE, 2018, pp. 542–549
work page 2018
-
[29]
Recur- rent neural network language model adaptation for conversational speech recognition,
K. Li, H. Xu, Y . Wang, D. Povey, and S. Khudanpur, “Recur- rent neural network language model adaptation for conversational speech recognition,” inProc. of INTERSPEECH, Hyderabad, India, 2018, pp. 3373–3377
work page 2018
-
[30]
A fast and simple algorithm for training neural probabilistic language models,
A. Mnih and Y . W. Teh, “A fast and simple algorithm for training neural probabilistic language models,” in In Proceedings of the International Conference on Machine Learning , 2012
work page 2012
-
[31]
J. Andr´es-Ferrer, N. Bodenstab, and P. V ozila, “Efficient language model adaptation with noise contrastive estimation and kullback- leibler regularization,” in Proc. of INTERSPEECH. Hyderabad, India: ISCA, 2018, pp. 3368–3372
work page 2018
-
[32]
Deep context: End-to-end contextual speech recog- nition,
G. Pundak, T. N. Sainath, R. Prabhavalkar, A. Kannan, and D. Zhao, “Deep context: End-to-end contextual speech recog- nition,” in SLT. IEEE, 2018, pp. 418–425
work page 2018
-
[33]
Neural machine translation by jointly learning to align and translate,
D. Bahdanau, K. Cho, and Y . Bengio, “Neural machine translation by jointly learning to align and translate,” in Proc. of International Conference on Learning Representations (ICLR). San Diego, CA: open publishing, 2015
work page 2015
-
[34]
DNN online adaptation for automatic speech recognition,
X. Li, Y . Pan, M. Gibson, and P. Zhan, “DNN online adaptation for automatic speech recognition,” in Proc. of 29th Conference on Electronic Speech Signal Processing (ESSV) . Ulm, Germany: TUDpress, 2018
work page 2018
-
[35]
Minimum word error rate training for attention-based sequence-to-sequence models,
R. Prabhavalkar, T. N. Sainath, Y . Wu, P. Nguyen, Z. Chen, C.- C. Chiu, and A. Kannan, “Minimum word error rate training for attention-based sequence-to-sequence models,” inProc. of ICASSP. Calgary, Canada: IEEE, 2018, pp. 4839–4843
work page 2018
-
[36]
B. Kingsbury, T. N. Sainath, and H. Soltau, “Scalable Minimum Bayes Risk training of Deep Neural Network acoustic models using distributed Hessian-free optimization,” in Proc. of INTERSPEECH. Portland, OR: ISCA, 2012
work page 2012
-
[37]
A comparison of techniques for language model inte- gration in encoder-decoder speech recognition,
S. Toshniwal, A. Kannan, C.-C. Chiu, Y . Wu, T. Sainath, and K. Livescu, “A comparison of techniques for language model inte- gration in encoder-decoder speech recognition,” in Proc. of IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 2018, pp. 369–375
work page 2018
-
[38]
Dropout: A simple way to prevent neural net- works from overfitting,
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural net- works from overfitting,”Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014
work page 1929
-
[39]
Re- thinking the inception architecture for computer vision,
C. Szegedy, V . Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Re- thinking the inception architecture for computer vision,” inProc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV: IEEE, 2016, pp. 2818–2826
work page 2016
-
[40]
Recurrent Neural Network Regularization
W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural network regularization,” 2014. [Online]. Available: http://arxiv.org/abs/1409.2329
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[41]
Linearly augmented deep neural network,
P. Ghahremani, J. Droppo, and M. L. Seltzer, “Linearly augmented deep neural network,” in Proc. of ICASSP . Shanghai, China: IEEE, 2016, pp. 5085–5089
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.