Listen, Attend, Spell and Adapt: Speaker Adapted Sequence-to-Sequence ASR

Felix Weninger; Jes\'us Andr\'es-Ferrer; Puming Zhan; Xinwei Li

arxiv: 1907.04916 · v1 · pith:QAB5TIP5new · submitted 2019-07-08 · 📡 eess.AS · cs.CL· cs.LG· cs.SD· stat.ML

Listen, Attend, Spell and Adapt: Speaker Adapted Sequence-to-Sequence ASR

Felix Weninger , Jes\'us Andr\'es-Ferrer , Xinwei Li , Puming Zhan This is my paper

Pith reviewed 2026-05-25 00:51 UTC · model grok-4.3

classification 📡 eess.AS cs.CLcs.LGcs.SDstat.ML

keywords speaker adaptationsequence-to-sequence ASRKullback-Leibler divergenceword error rateautomatic speech recognitionKLD adaptationLHN adaptationdictation data

0 comments

The pith

KLD speaker adaptation on seq2seq ASR delivers 25% relative WER reduction, exceeding the 18.7% gain from conventional acoustic model adaptation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that sequence-to-sequence ASR models, previously compared mostly in speaker-independent settings, can be adapted to individual speakers using Kullback-Leibler divergence regularization or linear hidden networks. With adaptation data up to 20 hours per speaker drawn from dictation material, the adapted seq2seq system records larger relative word error rate drops than acoustic-model adaptation in a conventional pipeline. Performance scales log-linearly downward with more adaptation data, and additional gains come from minimum-WER adaptation and language-model score fusion. A reader would care because seq2seq architectures are simpler than traditional pipelines yet now appear able to reach or surpass their practical robustness without extra system complexity.

Core claim

Speaker-adapted sequence-to-sequence ASR using KLD regularization achieves a 25% relative word error rate improvement, compared with an 18.7% gain obtained by acoustic-model adaptation in a conventional system; the word error rate of the seq2seq model falls log-linearly with the quantity of adaptation data, and further reductions follow from minimum-WER adaptation and language-model fusion.

What carries the argument

Kullback-Leibler divergence (KLD) adaptation, which adds a regularization term that keeps the adapted model's output distribution close to the speaker-independent baseline while the model is fine-tuned on speaker data.

If this is right

Word error rate of the adapted seq2seq model continues to drop in a log-linear fashion as adaptation data increases to 20 hours per speaker.
Adapting under a minimum word-error-rate criterion and fusing scores with an adapted language model each produce additional performance gains beyond KLD alone.
LHN adaptation provides an alternative mechanism that can also be applied to the seq2seq encoder-decoder stack.
The overall seq2seq pipeline reaches or exceeds the accuracy of a fully adapted conventional ASR system under the tested conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Production deployments that already favor seq2seq for its end-to-end simplicity could adopt the same KLD procedure to handle speaker variation without maintaining separate acoustic and language model adaptation pipelines.
The log-linear scaling suggests a predictable schedule for collecting adaptation recordings: each doubling of data yields a fixed fractional error reduction.
The same regularization approach could be tested on other sources of mismatch such as accent or channel variation while keeping the core model unchanged.

Load-bearing premise

The speaker-independent seq2seq baseline trained on large dictation data offers a fair and representative starting point for measuring adaptation gains against conventional systems.

What would settle it

A controlled experiment that retrains both the seq2seq and conventional baselines on identical data volumes and evaluates them on the same held-out speaker sets; if the relative WER advantage disappears or reverses, the central claim does not hold.

Figures

Figures reproduced from arXiv: 1907.04916 by Felix Weninger, Jes\'us Andr\'es-Ferrer, Puming Zhan, Xinwei Li.

**Figure 1.** Figure 1: , instead of the simple CE loss. 3.3. mWER training and adaptation mWER training [27] was introduced as a discriminative training method for seq2seq systems and is similar in spirit to traditional sequence training [28]. In our experiments, we always build SA ℎ1 ℎ1 ℎ1 ℎ2 ℎ2 ℎ𝑇 𝑈 ⋯ ⋯ ℎ2 ℎ𝑇 𝑥1 𝑥2 𝑥𝑇 𝑈 𝑈 𝑈 ′ 𝑈 ′ 𝑈 ′ ℎ𝑇 𝛼𝑖,1 𝛼𝑖,𝑇 𝑠𝑖−1 𝑦𝑖−1 ∗ 𝑝𝑖 ℎ1 SI ℎ1 SI ℎ2 SI ℎ2 SI ℎ𝑇 ⋯ SI ⋯ 𝑥1 𝑥2 𝑥𝑇 ℎ𝑇 SI 𝑠𝑖 SI 𝑠𝑖−1 SI 𝑦𝑖−… view at source ↗

**Figure 2.** Figure 2: WER of the speaker adapted seq2seq system (KLD or encoder LHN adaptation) and the speaker adapted conventional ASR system (KLD) with various amounts of adaptation data [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

Sequence-to-sequence (seq2seq) based ASR systems have shown state-of-the-art performances while having clear advantages in terms of simplicity. However, comparisons are mostly done on speaker independent (SI) ASR systems, though speaker adapted conventional systems are commonly used in practice for improving robustness to speaker and environment variations. In this paper, we apply speaker adaptation to seq2seq models with the goal of matching the performance of conventional ASR adaptation. Specifically, we investigate Kullback-Leibler divergence (KLD) as well as Linear Hidden Network (LHN) based adaptation for seq2seq ASR, using different amounts (up to 20 hours) of adaptation data per speaker. Our SI models are trained on large amounts of dictation data and achieve state-of-the-art results. We obtained 25% relative word error rate (WER) improvement with KLD adaptation of the seq2seq model vs. 18.7% gain from acoustic model adaptation in the conventional system. We also show that the WER of the seq2seq model decreases log-linearly with the amount of adaptation data. Finally, we analyze adaptation based on the minimum WER criterion and adapting the language model (LM) for score fusion with the speaker adapted seq2seq model, which result in further improvements of the seq2seq system performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

KLD adaptation delivers a 25% relative WER gain on seq2seq ASR versus 18.7% on the conventional system, but the comparison rests on unshown absolute baselines and matched conditions.

read the letter

The main result is that Kullback-Leibler divergence adaptation on the seq2seq model yields a 25% relative WER drop, outpacing the 18.7% gain from acoustic-model adaptation in the conventional pipeline on the same dictation task. They also show log-linear WER improvement with up to 20 hours of per-speaker data and test minimum-WER selection plus LM score fusion as add-ons. These are the concrete new pieces: a direct head-to-head on seq2seq rather than another conventional-system tweak, plus the scaling plot and the fusion experiment. The SI baselines are described as trained on large dictation data and reaching state-of-the-art numbers, which gives the comparison a plausible anchor. The methods are standard (KLD and LHN) but applied cleanly to the newer architecture, and the paper keeps the focus on practical amounts of adaptation data. The soft spot is exactly the one flagged in the stress-test note. No absolute SI WER numbers appear in the abstract, and there is no explicit statement that the conventional system used identical training volume, the same speaker test partitions, or the identical adaptation-data regime. Without those, the relative-gain edge cannot be read as a clean demonstration that seq2seq adapts better; it could partly reflect different starting points. The paper would benefit from a table of absolute WERs and a short methods paragraph confirming the controls. This is a targeted empirical paper for ASR groups already working on adaptation or seq2seq systems. It is not a foundational advance, but the comparison is worth checking in review because the numbers are specific and the setup is reproducible enough to test. I would send it to referees rather than desk-reject.

Referee Report

2 major / 0 minor

Summary. The manuscript investigates speaker adaptation for sequence-to-sequence ASR, applying KLD and LHN adaptation to models trained on large dictation data. It reports a 25% relative WER improvement via KLD adaptation (vs. 18.7% gain from acoustic-model adaptation in a conventional system) using up to 20 h of per-speaker adaptation data, observes log-linear WER reduction with adaptation data volume, and examines minimum-WER adaptation plus LM fusion for additional gains.

Significance. If the relative-gain comparison rests on matched SI baselines, training data scale, test partitions, and adaptation regimes, the result would indicate that seq2seq models can be adapted at least as effectively as conventional systems while retaining architectural simplicity. The log-linear scaling observation would also supply a practical empirical guideline for adaptation-data requirements.

major comments (2)

[Abstract] Abstract: the headline claim of 25% relative WER improvement for KLD-adapted seq2seq versus 18.7% for conventional acoustic-model adaptation is presented without absolute SI WER numbers for either system and without any statement that the speaker-independent baselines, training-data volumes, speaker test partitions, or adaptation-data regimes (up to 20 h per speaker) are matched between the seq2seq and conventional pipelines.
[Abstract] Abstract: no error bars, confidence intervals, data-split details, or statistical-significance tests are supplied for the reported relative improvements, leaving open the possibility that post-hoc experimental choices affect the central comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for identifying areas where the abstract could be strengthened for clarity. We address each major comment below and will update the abstract in the revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim of 25% relative WER improvement for KLD-adapted seq2seq versus 18.7% for conventional acoustic-model adaptation is presented without absolute SI WER numbers for either system and without any statement that the speaker-independent baselines, training-data volumes, speaker test partitions, or adaptation-data regimes (up to 20 h per speaker) are matched between the seq2seq and conventional pipelines.

Authors: We agree that the abstract would benefit from greater transparency. In the revision we will insert the absolute SI WER figures for both the seq2seq and conventional systems and add an explicit statement that the SI baselines were trained on the same large dictation corpus, evaluated on identical test partitions, and that the adaptation data volumes (up to 20 h per speaker) follow the same regime. revision: yes
Referee: [Abstract] Abstract: no error bars, confidence intervals, data-split details, or statistical-significance tests are supplied for the reported relative improvements, leaving open the possibility that post-hoc experimental choices affect the central comparison.

Authors: The experimental section of the manuscript already details the data splits, speaker partitions, and the range of adaptation data volumes examined. To address the abstract-level concern we will add a concise clause noting that the reported gains are observed consistently across multiple adaptation-data sizes and speakers. Formal error bars and significance tests were not part of the original experimental design; we therefore cannot add them without new analysis and mark this point as only partially addressable in revision. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical WER comparisons with no derivations or self-referential reductions

full rationale

The paper reports experimental results on KLD and LHN adaptation for seq2seq ASR, claiming relative WER gains (25% vs. 18.7%) from adaptation data up to 20h per speaker. No equations, derivations, or first-principles claims appear in the abstract or described content. Results are direct empirical measurements on dictation data, with no fitted parameters renamed as predictions, no self-citation load-bearing on uniqueness theorems, and no ansatz or renaming of known results. The comparison of relative gains rests on experimental conditions rather than any tautological reduction to inputs. This matches the default case of a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model or derivation; the work is an empirical application study. No free parameters, axioms, or invented entities are introduced or required.

pith-pipeline@v0.9.0 · 5791 in / 1099 out tokens · 21386 ms · 2026-05-25T00:51:07.696404+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 4 internal anchors

[1]

Sequence-to-sequence (seq2seq) modeling [4, 5] is a state-of-the-art method for end-to-end ASR, which has shown competitive results compared to traditional ASR systems [6]

Introduction End-to-end ASR systems have recently received increasing at- tention, due to their ability of integrating all components of an ASR system in a single deep neural network (DNN), which greatly simpliﬁes and uniﬁes the training and decoding process [1, 2, 3]. Sequence-to-sequence (seq2seq) modeling [4, 5] is a state-of-the-art method for end-to-...

work page
[2]

However, relatively few studies so far deal with adaptation of the end-to-end ASR systems

Relation to prior work Many approaches for adapting ‘conventional’ DNN acoustic models have been developed over the years, such as linear trans- formation based approaches in [10, 11, 12], training with regular- ization in [13, 14], and various forms of speaker identity vectors in [15, 16, 17]. However, relatively few studies so far deal with adaptation o...

work page
[3]

Listen, Attend, Spell and Adapt: Speaker Adapted Sequence-to-Sequence ASR

Methodology In this work, we use an encoder-decoder architecture with at- tention similar to Listen-Attend-Spell (LAS) [4], treating end- to-end ASR as a seq2seq learning task: The goal is to predict a sequenceyi of symbols (here, we use sub-word units) from arXiv:1907.04916v1 [eess.AS] 8 Jul 2019 a sequence of acoustic features xt,t = 1,...,T , where T i...

work page internal anchor Pith review Pith/arXiv arXiv 1907
[4]

(7) Here,β is the regularization strength (KLD relevance),LCE is the cross-entropy (CE) loss, andpSI i is the output distribution of the SI model

is to encourage the output distributions of the SA and SI models to be similar, by minimizing the loss: LKLD = ∑ i (1−β)LCE(y∗ i,p i) +βLCE(pSI i ,p i). (7) Here,β is the regularization strength (KLD relevance),LCE is the cross-entropy (CE) loss, andpSI i is the output distribution of the SI model. The targety∗ i is represented as a one-hot vector. 3.2. L...

work page
[5]

Data set We perform our experiments on a dictation data set

Experiments 4.1. Data set We perform our experiments on a dictation data set. All utter- ances are anonymized ﬁeld data. The audio is sampled at 8 kHz. A training set of 7.6 k hours from 58 k speakers is used to train SI models. The performance is measured on an evaluation set with 35 speakers (392 k words). The speakers cover various dictation domains wi...

work page
[6]

Adaptation of various parameter subsets We start our evaluation by adapting different parameter subsets of the seq2seq model with the KLD method, using 2 h of adaptation data

Results 5.1. Adaptation of various parameter subsets We start our evaluation by adapting different parameter subsets of the seq2seq model with the KLD method, using 2 h of adaptation data. We conjecture that due to the importance of language modeling for the dictation task, adapting only thedecoder should yield a signiﬁcant gain as well. As can be seen fr...

work page
[7]

Conclusions In this paper, we have presented several effective techniques for adapting seq2seq ASR systems. Signiﬁcant gains have been achieved even with a few minutes of speech data, and at the same time we have shown that seq2seq ASR systems can exploit larger amounts of adaptation data effectively. Furthermore, we have achieved state-of-the-art perform...

work page
[8]

Acknowledgements We would like to thank Peter Skala and Ming Yang for their help with the baseline ASR adaptation and many helpful discussions

work page
[9]

Sequence Transduction with Recurrent Neural Networks

A. Graves, “Sequence transduction with recurrent neural networks,” arXiv:1211.3711, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012
[10]

Exploring neural transducers for end-to-end speech recognition,

E. Battenberg, J. Chen, R. Child, A. Coates, Y . G. Y . Li, H. Liu, S. Satheesh, A. Sriram, and Z. Zhu, “Exploring neural transducers for end-to-end speech recognition,” in Proc. of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). Oki- nawa, Japan: IEEE, 2017, pp. 206–213

work page 2017
[11]

Advancing acoustic- to-word CTC model,

J. Li, G. Ye, A. Das, R. Zhao, and Y . Gong, “Advancing acoustic- to-word CTC model,” in Proc. of ICASSP . Calgary, Canada: IEEE, 2018, pp. 5794–5798

work page 2018
[12]

Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,

W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Proc. of ICASSP. Shanghai, China: IEEE, 2016, pp. 4960–4964

work page 2016
[13]

Convolutional sequence to sequence learning,

J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y . N. Dauphin, “Convolutional sequence to sequence learning,” in Proc. of 34th International Conference on Machine Learning (ICML) . Sydney, Australia: PMLR, 2017, pp. 1243–1252

work page 2017
[14]

State- of-the-art speech recognition with sequence-to-sequence models,

C.-C. Chiu, T. N. Sainath, Y . Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina et al., “State- of-the-art speech recognition with sequence-to-sequence models,” in Proc. of ICASSP . Calgary, Canada: IEEE, 2018, pp. 4774– 4778

work page 2018
[15]

Speech recognition with deep recurrent neural networks,

A. Graves, A. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing , May 2013, pp. 6645–6649

work page 2013
[16]

Towards better decoding and language model integration in sequence to sequence models

J. Chorowski and N. Jaitly, “Towards better decoding and language model integration in sequence to sequence models,” arXiv:1612.02695, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[17]

An analysis of incorporating an external language model into a sequence-to-sequence model,

A. Kannan, Y . Wu, P. Nguyen, T. N. Sainath, Z. Chen, and R. Prab- havalkar, “An analysis of incorporating an external language model into a sequence-to-sequence model,” in Proc. of ICASSP. IEEE, 2018, pp. 5824–5828

work page 2018
[18]

Speaker adaptation for hybrid HMM-ANN continuous speech recognition system,

J. Neto, L. Almeida, M. Hochberg, C. Martins, L. Nunes, S. Renals, and T. Robinson, “Speaker adaptation for hybrid HMM-ANN continuous speech recognition system,” in European Conference on Speech Communication and Technology, Madrid, Spain, 1995, pp. 2171–2174

work page 1995
[19]

Lin- ear hidden transformations for adaptation of hybrid ANN/HMM models,

R. Gemello, F. Mana, S. Scanzio, P. Laface, and R. De Mori, “Lin- ear hidden transformations for adaptation of hybrid ANN/HMM models,” Speech Communication, vol. 49, no. 10-11, pp. 827–835, 2007

work page 2007
[20]

Intermediate-layer DNN adaptation for ofﬂine and session-based iterative speaker adapta- tion,

K. Kumar, C. Liu, K. Yao, and Y . Gong, “Intermediate-layer DNN adaptation for ofﬂine and session-based iterative speaker adapta- tion,” in Proc. of INTERSPEECH . Dresden, Germany: ISCA, 2015

work page 2015
[21]

Regularized adaptation of discriminative classiﬁers,

X. Li and J. Bilmes, “Regularized adaptation of discriminative classiﬁers,” inProc. of ICASSP, vol. 1. Toulouse, France: IEEE, 2006, pp. 1–237

work page 2006
[22]

KL-divergence regular- ized deep neural network adaptation for improved large vocabulary speech recognition,

D. Yu, K. Yao, H. Su, G. Li, and F. Seide, “KL-divergence regular- ized deep neural network adaptation for improved large vocabulary speech recognition,” in Proc. of ICASSP . Vancouver, Canada: IEEE, 2013, pp. 7893–7897

work page 2013
[23]

Speaker adap- tation of neural network acoustic models using i-vectors,

G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, “Speaker adap- tation of neural network acoustic models using i-vectors,” in Proc. of IEEE Workshop on Automatic Speech Recognition and Under- standing (ASRU). Olomouc, Czech Republic: IEEE, 2013, pp. 55–59

work page 2013
[24]

X-vectors: Robust DNN embeddings for speaker recognition,

D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” in Proc. of ICASSP. Calgary, Canada: IEEE, 2018, pp. 5329–5333

work page 2018
[25]

Sequence summarizing neural network for speaker adaptation,

K. Vesel´y, S. Watanabe, K. ˇZmol´ıkov´a, M. Karaﬁ´at, L. Burget, and J. H. ˇCernock´y, “Sequence summarizing neural network for speaker adaptation,” in Proc. of ICASSP. Shanghai, China: IEEE, 2016, pp. 5315–5319

work page 2016
[26]

Auxiliary feature based adaptation of end-to-end ASR systems,

M. Delcroix, S. Watanabe, A. Ogawa, S. Karita, and T. Nakatani, “Auxiliary feature based adaptation of end-to-end ASR systems,” in Proc. of INTERSPEECH. Hyderabad, India: ISCA, 2018, pp. 2444–2448

work page 2018
[27]

Speaker adaptation for multichannel end-to-end speech recog- nition,

T. Ochiai, S. Watanabe, S. Katagiri, T. Hori, and J. Hershey, “Speaker adaptation for multichannel end-to-end speech recog- nition,” in Proc. of ICASSP. Calgary, Canada: IEEE, 2018, pp. 6707–6711

work page 2018
[28]

Speaker adaptation for end-to-end CTC models,

K. Li, J. Li, Y . Zhao, K. Kumar, and Y . Gong, “Speaker adaptation for end-to-end CTC models,” in Proc. of IEEE Spoken Language Technology Workshop (SLT). Athens, Greece: IEEE, 2018, pp. 542–549

work page 2018
[29]

Recur- rent neural network language model adaptation for conversational speech recognition,

K. Li, H. Xu, Y . Wang, D. Povey, and S. Khudanpur, “Recur- rent neural network language model adaptation for conversational speech recognition,” inProc. of INTERSPEECH, Hyderabad, India, 2018, pp. 3373–3377

work page 2018
[30]

A fast and simple algorithm for training neural probabilistic language models,

A. Mnih and Y . W. Teh, “A fast and simple algorithm for training neural probabilistic language models,” in In Proceedings of the International Conference on Machine Learning , 2012

work page 2012
[31]

Efﬁcient language model adaptation with noise contrastive estimation and kullback- leibler regularization,

J. Andr´es-Ferrer, N. Bodenstab, and P. V ozila, “Efﬁcient language model adaptation with noise contrastive estimation and kullback- leibler regularization,” in Proc. of INTERSPEECH. Hyderabad, India: ISCA, 2018, pp. 3368–3372

work page 2018
[32]

Deep context: End-to-end contextual speech recog- nition,

G. Pundak, T. N. Sainath, R. Prabhavalkar, A. Kannan, and D. Zhao, “Deep context: End-to-end contextual speech recog- nition,” in SLT. IEEE, 2018, pp. 418–425

work page 2018
[33]

Neural machine translation by jointly learning to align and translate,

D. Bahdanau, K. Cho, and Y . Bengio, “Neural machine translation by jointly learning to align and translate,” in Proc. of International Conference on Learning Representations (ICLR). San Diego, CA: open publishing, 2015

work page 2015
[34]

DNN online adaptation for automatic speech recognition,

X. Li, Y . Pan, M. Gibson, and P. Zhan, “DNN online adaptation for automatic speech recognition,” in Proc. of 29th Conference on Electronic Speech Signal Processing (ESSV) . Ulm, Germany: TUDpress, 2018

work page 2018
[35]

Minimum word error rate training for attention-based sequence-to-sequence models,

R. Prabhavalkar, T. N. Sainath, Y . Wu, P. Nguyen, Z. Chen, C.- C. Chiu, and A. Kannan, “Minimum word error rate training for attention-based sequence-to-sequence models,” inProc. of ICASSP. Calgary, Canada: IEEE, 2018, pp. 4839–4843

work page 2018
[36]

Scalable Minimum Bayes Risk training of Deep Neural Network acoustic models using distributed Hessian-free optimization,

B. Kingsbury, T. N. Sainath, and H. Soltau, “Scalable Minimum Bayes Risk training of Deep Neural Network acoustic models using distributed Hessian-free optimization,” in Proc. of INTERSPEECH. Portland, OR: ISCA, 2012

work page 2012
[37]

A comparison of techniques for language model inte- gration in encoder-decoder speech recognition,

S. Toshniwal, A. Kannan, C.-C. Chiu, Y . Wu, T. Sainath, and K. Livescu, “A comparison of techniques for language model inte- gration in encoder-decoder speech recognition,” in Proc. of IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 2018, pp. 369–375

work page 2018
[38]

Dropout: A simple way to prevent neural net- works from overﬁtting,

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural net- works from overﬁtting,”Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014

work page 1929
[39]

Re- thinking the inception architecture for computer vision,

C. Szegedy, V . Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Re- thinking the inception architecture for computer vision,” inProc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV: IEEE, 2016, pp. 2818–2826

work page 2016
[40]

Recurrent Neural Network Regularization

W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural network regularization,” 2014. [Online]. Available: http://arxiv.org/abs/1409.2329

work page internal anchor Pith review Pith/arXiv arXiv 2014
[41]

Linearly augmented deep neural network,

P. Ghahremani, J. Droppo, and M. L. Seltzer, “Linearly augmented deep neural network,” in Proc. of ICASSP . Shanghai, China: IEEE, 2016, pp. 5085–5089

work page 2016

[1] [1]

Sequence-to-sequence (seq2seq) modeling [4, 5] is a state-of-the-art method for end-to-end ASR, which has shown competitive results compared to traditional ASR systems [6]

Introduction End-to-end ASR systems have recently received increasing at- tention, due to their ability of integrating all components of an ASR system in a single deep neural network (DNN), which greatly simpliﬁes and uniﬁes the training and decoding process [1, 2, 3]. Sequence-to-sequence (seq2seq) modeling [4, 5] is a state-of-the-art method for end-to-...

work page

[2] [2]

However, relatively few studies so far deal with adaptation of the end-to-end ASR systems

Relation to prior work Many approaches for adapting ‘conventional’ DNN acoustic models have been developed over the years, such as linear trans- formation based approaches in [10, 11, 12], training with regular- ization in [13, 14], and various forms of speaker identity vectors in [15, 16, 17]. However, relatively few studies so far deal with adaptation o...

work page

[3] [3]

Listen, Attend, Spell and Adapt: Speaker Adapted Sequence-to-Sequence ASR

Methodology In this work, we use an encoder-decoder architecture with at- tention similar to Listen-Attend-Spell (LAS) [4], treating end- to-end ASR as a seq2seq learning task: The goal is to predict a sequenceyi of symbols (here, we use sub-word units) from arXiv:1907.04916v1 [eess.AS] 8 Jul 2019 a sequence of acoustic features xt,t = 1,...,T , where T i...

work page internal anchor Pith review Pith/arXiv arXiv 1907

[4] [4]

(7) Here,β is the regularization strength (KLD relevance),LCE is the cross-entropy (CE) loss, andpSI i is the output distribution of the SI model

is to encourage the output distributions of the SA and SI models to be similar, by minimizing the loss: LKLD = ∑ i (1−β)LCE(y∗ i,p i) +βLCE(pSI i ,p i). (7) Here,β is the regularization strength (KLD relevance),LCE is the cross-entropy (CE) loss, andpSI i is the output distribution of the SI model. The targety∗ i is represented as a one-hot vector. 3.2. L...

work page

[5] [5]

Data set We perform our experiments on a dictation data set

Experiments 4.1. Data set We perform our experiments on a dictation data set. All utter- ances are anonymized ﬁeld data. The audio is sampled at 8 kHz. A training set of 7.6 k hours from 58 k speakers is used to train SI models. The performance is measured on an evaluation set with 35 speakers (392 k words). The speakers cover various dictation domains wi...

work page

[6] [6]

Adaptation of various parameter subsets We start our evaluation by adapting different parameter subsets of the seq2seq model with the KLD method, using 2 h of adaptation data

Results 5.1. Adaptation of various parameter subsets We start our evaluation by adapting different parameter subsets of the seq2seq model with the KLD method, using 2 h of adaptation data. We conjecture that due to the importance of language modeling for the dictation task, adapting only thedecoder should yield a signiﬁcant gain as well. As can be seen fr...

work page

[7] [7]

Conclusions In this paper, we have presented several effective techniques for adapting seq2seq ASR systems. Signiﬁcant gains have been achieved even with a few minutes of speech data, and at the same time we have shown that seq2seq ASR systems can exploit larger amounts of adaptation data effectively. Furthermore, we have achieved state-of-the-art perform...

work page

[8] [8]

Acknowledgements We would like to thank Peter Skala and Ming Yang for their help with the baseline ASR adaptation and many helpful discussions

work page

[9] [9]

Sequence Transduction with Recurrent Neural Networks

A. Graves, “Sequence transduction with recurrent neural networks,” arXiv:1211.3711, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012

[10] [10]

Exploring neural transducers for end-to-end speech recognition,

E. Battenberg, J. Chen, R. Child, A. Coates, Y . G. Y . Li, H. Liu, S. Satheesh, A. Sriram, and Z. Zhu, “Exploring neural transducers for end-to-end speech recognition,” in Proc. of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). Oki- nawa, Japan: IEEE, 2017, pp. 206–213

work page 2017

[11] [11]

Advancing acoustic- to-word CTC model,

J. Li, G. Ye, A. Das, R. Zhao, and Y . Gong, “Advancing acoustic- to-word CTC model,” in Proc. of ICASSP . Calgary, Canada: IEEE, 2018, pp. 5794–5798

work page 2018

[12] [12]

Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,

W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Proc. of ICASSP. Shanghai, China: IEEE, 2016, pp. 4960–4964

work page 2016

[13] [13]

Convolutional sequence to sequence learning,

J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y . N. Dauphin, “Convolutional sequence to sequence learning,” in Proc. of 34th International Conference on Machine Learning (ICML) . Sydney, Australia: PMLR, 2017, pp. 1243–1252

work page 2017

[14] [14]

State- of-the-art speech recognition with sequence-to-sequence models,

C.-C. Chiu, T. N. Sainath, Y . Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina et al., “State- of-the-art speech recognition with sequence-to-sequence models,” in Proc. of ICASSP . Calgary, Canada: IEEE, 2018, pp. 4774– 4778

work page 2018

[15] [15]

Speech recognition with deep recurrent neural networks,

A. Graves, A. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing , May 2013, pp. 6645–6649

work page 2013

[16] [16]

Towards better decoding and language model integration in sequence to sequence models

J. Chorowski and N. Jaitly, “Towards better decoding and language model integration in sequence to sequence models,” arXiv:1612.02695, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[17] [17]

An analysis of incorporating an external language model into a sequence-to-sequence model,

A. Kannan, Y . Wu, P. Nguyen, T. N. Sainath, Z. Chen, and R. Prab- havalkar, “An analysis of incorporating an external language model into a sequence-to-sequence model,” in Proc. of ICASSP. IEEE, 2018, pp. 5824–5828

work page 2018

[18] [18]

Speaker adaptation for hybrid HMM-ANN continuous speech recognition system,

J. Neto, L. Almeida, M. Hochberg, C. Martins, L. Nunes, S. Renals, and T. Robinson, “Speaker adaptation for hybrid HMM-ANN continuous speech recognition system,” in European Conference on Speech Communication and Technology, Madrid, Spain, 1995, pp. 2171–2174

work page 1995

[19] [19]

Lin- ear hidden transformations for adaptation of hybrid ANN/HMM models,

R. Gemello, F. Mana, S. Scanzio, P. Laface, and R. De Mori, “Lin- ear hidden transformations for adaptation of hybrid ANN/HMM models,” Speech Communication, vol. 49, no. 10-11, pp. 827–835, 2007

work page 2007

[20] [20]

Intermediate-layer DNN adaptation for ofﬂine and session-based iterative speaker adapta- tion,

K. Kumar, C. Liu, K. Yao, and Y . Gong, “Intermediate-layer DNN adaptation for ofﬂine and session-based iterative speaker adapta- tion,” in Proc. of INTERSPEECH . Dresden, Germany: ISCA, 2015

work page 2015

[21] [21]

Regularized adaptation of discriminative classiﬁers,

X. Li and J. Bilmes, “Regularized adaptation of discriminative classiﬁers,” inProc. of ICASSP, vol. 1. Toulouse, France: IEEE, 2006, pp. 1–237

work page 2006

[22] [22]

KL-divergence regular- ized deep neural network adaptation for improved large vocabulary speech recognition,

D. Yu, K. Yao, H. Su, G. Li, and F. Seide, “KL-divergence regular- ized deep neural network adaptation for improved large vocabulary speech recognition,” in Proc. of ICASSP . Vancouver, Canada: IEEE, 2013, pp. 7893–7897

work page 2013

[23] [23]

Speaker adap- tation of neural network acoustic models using i-vectors,

G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, “Speaker adap- tation of neural network acoustic models using i-vectors,” in Proc. of IEEE Workshop on Automatic Speech Recognition and Under- standing (ASRU). Olomouc, Czech Republic: IEEE, 2013, pp. 55–59

work page 2013

[24] [24]

X-vectors: Robust DNN embeddings for speaker recognition,

D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” in Proc. of ICASSP. Calgary, Canada: IEEE, 2018, pp. 5329–5333

work page 2018

[25] [25]

Sequence summarizing neural network for speaker adaptation,

K. Vesel´y, S. Watanabe, K. ˇZmol´ıkov´a, M. Karaﬁ´at, L. Burget, and J. H. ˇCernock´y, “Sequence summarizing neural network for speaker adaptation,” in Proc. of ICASSP. Shanghai, China: IEEE, 2016, pp. 5315–5319

work page 2016

[26] [26]

Auxiliary feature based adaptation of end-to-end ASR systems,

M. Delcroix, S. Watanabe, A. Ogawa, S. Karita, and T. Nakatani, “Auxiliary feature based adaptation of end-to-end ASR systems,” in Proc. of INTERSPEECH. Hyderabad, India: ISCA, 2018, pp. 2444–2448

work page 2018

[27] [27]

Speaker adaptation for multichannel end-to-end speech recog- nition,

T. Ochiai, S. Watanabe, S. Katagiri, T. Hori, and J. Hershey, “Speaker adaptation for multichannel end-to-end speech recog- nition,” in Proc. of ICASSP. Calgary, Canada: IEEE, 2018, pp. 6707–6711

work page 2018

[28] [28]

Speaker adaptation for end-to-end CTC models,

K. Li, J. Li, Y . Zhao, K. Kumar, and Y . Gong, “Speaker adaptation for end-to-end CTC models,” in Proc. of IEEE Spoken Language Technology Workshop (SLT). Athens, Greece: IEEE, 2018, pp. 542–549

work page 2018

[29] [29]

Recur- rent neural network language model adaptation for conversational speech recognition,

K. Li, H. Xu, Y . Wang, D. Povey, and S. Khudanpur, “Recur- rent neural network language model adaptation for conversational speech recognition,” inProc. of INTERSPEECH, Hyderabad, India, 2018, pp. 3373–3377

work page 2018

[30] [30]

A fast and simple algorithm for training neural probabilistic language models,

A. Mnih and Y . W. Teh, “A fast and simple algorithm for training neural probabilistic language models,” in In Proceedings of the International Conference on Machine Learning , 2012

work page 2012

[31] [31]

Efﬁcient language model adaptation with noise contrastive estimation and kullback- leibler regularization,

J. Andr´es-Ferrer, N. Bodenstab, and P. V ozila, “Efﬁcient language model adaptation with noise contrastive estimation and kullback- leibler regularization,” in Proc. of INTERSPEECH. Hyderabad, India: ISCA, 2018, pp. 3368–3372

work page 2018

[32] [32]

Deep context: End-to-end contextual speech recog- nition,

G. Pundak, T. N. Sainath, R. Prabhavalkar, A. Kannan, and D. Zhao, “Deep context: End-to-end contextual speech recog- nition,” in SLT. IEEE, 2018, pp. 418–425

work page 2018

[33] [33]

Neural machine translation by jointly learning to align and translate,

D. Bahdanau, K. Cho, and Y . Bengio, “Neural machine translation by jointly learning to align and translate,” in Proc. of International Conference on Learning Representations (ICLR). San Diego, CA: open publishing, 2015

work page 2015

[34] [34]

DNN online adaptation for automatic speech recognition,

X. Li, Y . Pan, M. Gibson, and P. Zhan, “DNN online adaptation for automatic speech recognition,” in Proc. of 29th Conference on Electronic Speech Signal Processing (ESSV) . Ulm, Germany: TUDpress, 2018

work page 2018

[35] [35]

Minimum word error rate training for attention-based sequence-to-sequence models,

R. Prabhavalkar, T. N. Sainath, Y . Wu, P. Nguyen, Z. Chen, C.- C. Chiu, and A. Kannan, “Minimum word error rate training for attention-based sequence-to-sequence models,” inProc. of ICASSP. Calgary, Canada: IEEE, 2018, pp. 4839–4843

work page 2018

[36] [36]

Scalable Minimum Bayes Risk training of Deep Neural Network acoustic models using distributed Hessian-free optimization,

B. Kingsbury, T. N. Sainath, and H. Soltau, “Scalable Minimum Bayes Risk training of Deep Neural Network acoustic models using distributed Hessian-free optimization,” in Proc. of INTERSPEECH. Portland, OR: ISCA, 2012

work page 2012

[37] [37]

A comparison of techniques for language model inte- gration in encoder-decoder speech recognition,

S. Toshniwal, A. Kannan, C.-C. Chiu, Y . Wu, T. Sainath, and K. Livescu, “A comparison of techniques for language model inte- gration in encoder-decoder speech recognition,” in Proc. of IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 2018, pp. 369–375

work page 2018

[38] [38]

Dropout: A simple way to prevent neural net- works from overﬁtting,

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural net- works from overﬁtting,”Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014

work page 1929

[39] [39]

Re- thinking the inception architecture for computer vision,

C. Szegedy, V . Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Re- thinking the inception architecture for computer vision,” inProc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV: IEEE, 2016, pp. 2818–2826

work page 2016

[40] [40]

Recurrent Neural Network Regularization

W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural network regularization,” 2014. [Online]. Available: http://arxiv.org/abs/1409.2329

work page internal anchor Pith review Pith/arXiv arXiv 2014

[41] [41]

Linearly augmented deep neural network,

P. Ghahremani, J. Droppo, and M. L. Seltzer, “Linearly augmented deep neural network,” in Proc. of ICASSP . Shanghai, China: IEEE, 2016, pp. 5085–5089

work page 2016