Phoneme-Based Contextualization for Cross-Lingual Speech Recognition in End-to-End Models

Antoine Bruguier; Golan Pundak; Ke Hu; Rohit Prabhavalkar; Tara N. Sainath

arxiv: 1906.09292 · v3 · pith:4KI3BRD2new · submitted 2019-06-21 · 💻 cs.CL · cs.SD· eess.AS

Phoneme-Based Contextualization for Cross-Lingual Speech Recognition in End-to-End Models

Ke Hu , Antoine Bruguier , Tara N. Sainath , Rohit Prabhavalkar , Golan Pundak This is my paper

Pith reviewed 2026-05-25 18:40 UTC · model grok-4.3

classification 💻 cs.CL cs.SDeess.AS

keywords contextual ASRphoneme modelingend-to-end speech recognitioncross-lingual biasingnamed entity recognitionOOV wordsforeign place names

0 comments

The pith

An E2E speech model mixes wordpieces with phonemes and biases foreign names by mapping their sounds to English phonemes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an end-to-end automatic speech recognition model whose output vocabulary contains both English wordpieces and phonemes. It applies contextual biasing to foreign proper nouns by converting those words' pronunciations into sequences of similar English phonemes. Experiments focus on a task of recognizing unseen geographic place names from other languages. The phoneme-level approach yields measured gains over grapheme-only and wordpiece-only biasing baselines while keeping regular English performance nearly unchanged.

Core claim

The paper shows that performing contextual biasing at the phoneme level inside a joint wordpiece-phoneme E2E model produces a 16 percent relative improvement over a grapheme-only biasing baseline and an 8 percent improvement over a wordpiece-only baseline on a foreign place-name recognition task, accompanied by only slight degradation on standard English test sets.

What carries the argument

Phoneme-level contextual biasing inside a mixed wordpiece-phoneme output space, achieved by mapping foreign-word pronunciations to the closest English phoneme sequences.

If this is right

Contextual lists containing foreign names become usable without expanding the training data.
Rare named entities remain recognizable even when the beam-search decoder keeps only a small number of candidates.
The same model can serve both standard English dictation and cross-lingual name biasing without separate systems.
Acoustic salience of phonemes helps spelling of out-of-vocabulary words that grapheme or wordpiece units miss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be tested on other language pairs where the target phoneme inventory is a superset of English phonemes.
Combining the approach with multilingual pre-training might reduce the slight English degradation further.
The same mapping step could be applied to other contextual lists such as song titles or contact names that contain foreign words.

Load-bearing premise

Pronunciations of foreign words can be mapped to English phoneme sequences without introducing errors that cancel out the benefit of the biasing step.

What would settle it

Measure recognition accuracy on a held-out set of foreign place names whose pronunciations have no close match in English phoneme inventory; if the reported gains vanish, the mapping premise does not hold.

Figures

Figures reproduced from arXiv: 1906.09292 by Antoine Bruguier, Golan Pundak, Ke Hu, Rohit Prabhavalkar, Tara N. Sainath.

**Figure 1.** Figure 1: Contextual FST for the word “Créteil" using a sequence of English phonemes “k r\ E t E j". 3.3. Decoding Graph To generate words as outputs, we search through a decoding graph similar to [16] but accept both phonemes and wordpieces. An example is shown in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Decoding graph for the words “crèche" (daycare) with English cross lingual pronunciation “k r\ E S" and “créteil" (a city) with pronunciation “k r\ E t E j". For clarity, we omitted most wordpieces for the state 0. Based on [16], we add two improvements to the decoding strategy. First, during decoding we consume as many input epsilon arcs as possible thus guaranteeing that all wordpieces in word are produ… view at source ↗

**Figure 3.** Figure 3: WER (%) as a function of the number of biasing words. 4.4. Effect of Number of Biasing Words Given examples in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

read the original abstract

Contextual automatic speech recognition, i.e., biasing recognition towards a given context (e.g. user's playlists, or contacts), is challenging in end-to-end (E2E) models. Such models maintain a limited number of candidates during beam-search decoding, and have been found to recognize rare named entities poorly. The problem is exacerbated when biasing towards proper nouns in foreign languages, e.g., geographic location names, which are virtually unseen in training and are thus out-of-vocabulary (OOV). While grapheme or wordpiece E2E models might have a difficult time spelling OOV words, phonemes are more acoustically salient and past work has shown that E2E phoneme models can better predict such words. In this work, we propose an E2E model containing both English wordpieces and phonemes in the modeling space, and perform contextual biasing of foreign words at the phoneme level by mapping pronunciations of foreign words into similar English phonemes. In experimental evaluations, we find that the proposed approach performs 16% better than a grapheme-only biasing model, and 8% better than a wordpiece-only biasing model on a foreign place name recognition task, with only slight degradation on regular English tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper mixes phonemes into the E2E output space and maps foreign pronunciations to English phonemes for biasing, which yields reported gains on foreign names but leaves the mapping's contribution to those gains unmeasured.

read the letter

The main point is that this work gives a workable way to improve contextual biasing for foreign proper nouns in E2E ASR. The model keeps both wordpieces and phonemes in the output vocabulary, then projects pronunciations of unseen foreign words onto the English phoneme set so that biasing can operate at the phoneme level rather than the grapheme or wordpiece level. That combination is the concrete addition over earlier phoneme or biasing papers. The abstract reports 16% relative improvement over a grapheme-only biasing baseline and 8% over a wordpiece-only one on a foreign place-name task, with only small loss on ordinary English. Those numbers suggest the approach can help in the settings they tested. The soft spot is the mapping step itself. The abstract describes the projection but gives no separate accuracy figure for it, no ablation that removes the mapping, and no direct comparison of the same foreign names under grapheme biasing versus mapped-phoneme biasing. If the mapping introduces substitutions or insertions that the acoustic model cannot fully recover, the net gain could be smaller than claimed. The stress-test note is right on this point. The abstract also omits dataset sizes, error bars, and any statistical tests, so the robustness of the result is hard to judge from the text alone. This is the sort of incremental modeling tweak that matters for people shipping multilingual voice products. A reader who works on production E2E systems or on OOV handling would find the full experiments useful to examine. I would send it to peer review so the mapping details and any additional controls can be checked.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes an end-to-end ASR model whose output vocabulary contains both English wordpieces and phonemes. Contextual biasing for foreign proper nouns (e.g., place names) is performed at the phoneme level after mapping foreign pronunciations onto the closest English phonemes. The abstract states that this yields a 16% relative improvement over grapheme-only biasing and an 8% relative improvement over wordpiece-only biasing on a foreign place-name recognition task, with only slight degradation on standard English tasks.

Significance. If the reported gains prove robust once the mapping step is isolated and validated, the hybrid wordpiece-plus-phoneme modeling space could provide a practical route to better OOV handling for cross-lingual named entities inside E2E systems. The design directly exploits the acoustic salience of phonemes for rare words while retaining subword units for in-vocabulary English.

major comments (1)

[Abstract] Abstract: the headline 16% and 8% relative gains rest on the foreign-to-English phoneme mapping step. No error rate or substitution statistics for the mapping are supplied, no ablation isolates the mapping from the rest of the biasing pipeline, and no head-to-head comparison of the identical foreign names under grapheme versus mapped-phoneme biasing is reported. Without these quantities the net improvement cannot be separated from possible mapping-induced insertion or substitution errors.

minor comments (1)

[Abstract] The abstract would be strengthened by stating the size of the foreign-name test set, whether error bars or statistical tests accompany the relative gains, and the exact foreign-language source of the place names.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for highlighting the need to better isolate the contribution of the phoneme mapping. We address the single major comment below and will revise the manuscript to incorporate the requested analyses.

read point-by-point responses

Referee: [Abstract] Abstract: the headline 16% and 8% relative gains rest on the foreign-to-English phoneme mapping step. No error rate or substitution statistics for the mapping are supplied, no ablation isolates the mapping from the rest of the biasing pipeline, and no head-to-head comparison of the identical foreign names under grapheme versus mapped-phoneme biasing is reported. Without these quantities the net improvement cannot be separated from possible mapping-induced insertion or substitution errors.

Authors: We agree that the reported gains cannot be fully attributed to the hybrid modeling approach without quantifying the mapping step. The original submission does not contain error statistics for the foreign-to-English phoneme mapping, an ablation that removes the mapping, or a direct grapheme-versus-mapped-phoneme comparison on the identical foreign names. In the revision we will add these three elements: (1) word error and substitution rates for the mapping on the foreign-place-name test set, (2) an ablation that runs the full biasing pipeline with and without the mapping, and (3) a side-by-side evaluation of the same foreign names under grapheme-only biasing versus mapped-phoneme biasing. These additions will allow readers to separate mapping-induced errors from the gains of the hybrid wordpiece-plus-phoneme space. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results stand on experimental outcomes

full rationale

The paper reports relative gains (16% over grapheme biasing, 8% over wordpiece) from end-to-end model experiments on foreign place-name recognition. No equations, fitted parameters, or derivation steps appear in the abstract or described methodology. The phoneme-mapping step is presented as an implementation choice whose net benefit is measured empirically rather than derived by construction from prior self-citations or definitions. No load-bearing self-citation chain or renaming of known results is invoked to justify the central claim.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no free parameters, axioms, or invented entities; all modeling choices are described at a high level without explicit assumptions listed.

pith-pipeline@v0.9.0 · 5772 in / 1039 out tokens · 17041 ms · 2026-05-25T18:40:55.028476+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 6 internal anchors

[1]

Phoneme-Based Contextualization for Cross-Lingual Speech Recognition in End-to-End Models

Introduction End-to-end (E2E) models have attracted increasing attention recently. Instead of building an automatic speech recognition (ASR) system from different components such as the acoustic model (AM), language model (LM), and pronunciation model (PM), E2E models rely on a single neural network to directly learn speech-to-text mapping. Representative...

work page internal anchor Pith review Pith/arXiv arXiv 1906
[2]

Shallow Fusion E2E Biasing Shallow fusion has been used in E2E models for decoding [10] and contextual biasing [6]

Prior Work 2.1. Shallow Fusion E2E Biasing Shallow fusion has been used in E2E models for decoding [10] and contextual biasing [6]. Biasing phrases are ﬁrst represented as n-gram WFST in the word level ( G), and then left com- posed with a “speller" FST ( S) to produce a contextual LM: C = min(det(S◦G)). The speller transduces a sequence of subword units ...

work page
[3]

also proposed to only activate biasing phrases when they are proceeded by a set of preﬁxes. 2.2. Phoneme Mapping Cross-lingual phoneme mapping has been used in conventional systems for recognizing foreign words [15]. First, a phoneme mapping is learned by aligning the pronunciations between for- eign and target languages using TTS-synthesized audio and a ...

work page
[4]

Créteil", we tokenize it into phonemes using the French pronunciation lexicon, i.e. “Créteil

Phoneme-Based Biasing The focus of this work is to bias toward rare cross-lingual words which are typically missing from the training set. We propose to do that by utilizing phonemes, which are not affected by orthog- raphy. Speciﬁcally, we augment the wordpiece modeling space of an E2E model with phonemes to train a wordpiece-phoneme model. 3.1. Wordpiec...

work page
[5]

directions to Créteil

Experiments 4.1. Data Sets Our training set contains 35 million English utterances with a total of around 27,500 hours. These utterances are sampled from Google’s general English trafﬁc, and are anonymized and hand-transcribed for training. To increase training diversity, clean utterances are artiﬁcially corrupted by using a room sim- ulator, varying degr...

work page
[6]

Biasing at the phoneme level enables us to avoid the OOV problem in the wordpiece model

Conclusion In this work we proposed a wordpiece-phoneme RNN-T model and phoneme-level contextual biasing to recognize foreign words. Biasing at the phoneme level enables us to avoid the OOV problem in the wordpiece model. Evaluating on a test set containing navigation queries to French place names, we show the proposed approach performs signiﬁcantly bette...

work page
[7]

Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition

H. Soltau, H. Liao, and H. Sak, “Neural speech recognizer: Acoustic-to-word LSTM model for large vocabulary speech recognition,” arXiv preprint arXiv:1610.09975, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[8]

Sequence Transduction with Recurrent Neural Networks

A. Graves, “Sequence transduction with recurrent neural net- works,” arXiv preprint arXiv:1211.3711, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012
[9]

Exploring architectures, data and units for streaming end-to-end speech recognition with rnn-transducer,

K. Rao, H. Sak, and R. Prabhavalkar, “Exploring architectures, data and units for streaming end-to-end speech recognition with rnn-transducer,” in 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2017, pp. 193–199

work page 2017
[10]

Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,

W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016, pp. 4960–4964

work page 2016
[11]

State- of-the-art speech recognition with sequence-to-sequence models,

C.-C. Chiu, T. N. Sainath, Y . Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Goninaet al., “State- of-the-art speech recognition with sequence-to-sequence models,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4774–4778

work page 2018
[12]

Stream- ing end-to-end speech recognition for mobile devices,

Y . He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y . Wu, R. Panget al., “Stream- ing end-to-end speech recognition for mobile devices,” ICASSP,

work page
[13]

Streaming End-to-end Speech Recognition For Mobile Devices

[Online]. Available: https://arxiv.org/pdf/1811.06621.pdf

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Bringing contextual infor- mation to Google speech recognition,

P. Aleksic, M. Ghodsi, A. Michaely, C. Allauzen, K. Hall, B. Roark, D. Rybach, and P. Moreno, “Bringing contextual infor- mation to Google speech recognition,” in Sixteenth Annual Con- ference of the International Speech Communication Association , 2015

work page 2015
[15]

Composition-based on-the-ﬂy rescoring for salient n-gram bias- ing,

K. Hall, E. Cho, C. Allauzen, F. Beaufays, N. Coccaro, K. Nakajima, M. Riley, B. Roark, D. Rybach, and L. Zhang, “Composition-based on-the-ﬂy rescoring for salient n-gram bias- ing,” 2015

work page 2015
[16]

Contextual speech recognition in end-to-end neural network sys- tems using beam search,

I. Williams, A. Kannan, P. Aleksic, D. Rybach, and T. N. Sainath, “Contextual speech recognition in end-to-end neural network sys- tems using beam search,” Proc. Interspeech 2018, pp. 2227–2231, 2018

work page 2018
[17]

An analysis of incorporating an external lan- guage model into a sequence-to-sequence model,

A. Kannan, Y . Wu, P. Nguyen, T. N. Sainath, Z. Chen, and R. Prabhavalkar, “An analysis of incorporating an external lan- guage model into a sequence-to-sequence model,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP). IEEE, 2018, pp. 1–5828

work page 2018
[18]

Deep context: end-to-end contextual speech recognition

G. Pundak, T. N. Sainath, R. Prabhavalkar, A. Kannan, and D. Zhao, “Deep context: end-to-end contextual speech recogni- tion,” arXiv preprint arXiv:1808.02480, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[19]

Japanese and korean voice search,

M. Schuster and K. Nakajima, “Japanese and korean voice search,” in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2012, pp. 5149– 5152

work page 2012
[20]

Shallow-fusion end-to-end contextual biasing,

D. Zhao, T. N. Sainath, D. Rybach, D. Bhatia, B. Li, and R. Pang, “Shallow-fusion end-to-end contextual biasing,” To appear in In- terspeech 2019, 2019

work page 2019
[21]

Phoebe: Pronunciation-aware contextualization for end-to-end speech recognition,

A. Bruguier, R. Prabhavalkar, G. Pundak, and T. N. Sainath, “Phoebe: Pronunciation-aware contextualization for end-to-end speech recognition,” in to appear in Proc. ICASSP , 2019. IEEE

work page 2019
[22]

Cross-lingual phoneme mapping for language robust contextual speech recognition,

A. Patel, D. Li, E. Cho, and P. Aleksic, “Cross-lingual phoneme mapping for language robust contextual speech recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5924–5928

work page 2018
[23]

No need for a lexicon? Evaluating the value of the pronunciation lexica in end-to-end models,

T. N. Sainath, R. Prabhavalkar, S. Kumar, S. Lee, A. Kannan, D. Rybach, V . Schogol, P. Nguyen, B. Li, Y . Wuet al., “No need for a lexicon? Evaluating the value of the pronunciation lexica in end-to-end models,” ICASSP, 2017

work page 2017
[24]

Model unit exploration for sequence-to-sequence speech recognition,

K. Irie, R. Prabhavalkar, A. Kannan, A. Bruguier, D. Rybach, and P. Nguyen, “Model unit exploration for sequence-to-sequence speech recognition,” arXiv preprint arXiv:1902.01955, 2019

work page arXiv 1902
[25]

Pronunciation learning with RNN-transducers,

A. Bruguier, D. Gnanapragasam, L. Johnson, K. Rao, and F. Bea- ufays, “Pronunciation learning with RNN-transducers,” Proc. In- terspeech 2017, pp. 2556–2560, 2017

work page 2017
[26]

Generation of large-scale simulated utterances in virtual rooms to train deep-neural networks for far-ﬁeld speech recognition in google home,

C. Kim, A. Misra, K. Chin, T. Hughes, A. Narayanan, T. Sainath, and M. Bacchiani, “Generation of large-scale simulated utterances in virtual rooms to train deep-neural networks for far-ﬁeld speech recognition in google home,” 2017

work page 2017
[27]

Parallel WaveNet: Fast High-Fidelity Speech Synthesis

A. v. d. Oord, Y . Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. v. d. Driessche, E. Lockhart, L. C. Cobo, F. Stimberg et al. , “Parallel wavenet: Fast high-ﬁdelity speech synthesis,” arXiv preprint arXiv:1711.10433, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[28]

Long short-term memory,

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997

work page 1997
[29]

Tensorﬂow: A system for large-scale machine learning,

M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Tensorﬂow: A system for large-scale machine learning,” in12th USENIX Sym- posium on Operating Systems Design and Implementation (OSDI 16), 2016, pp. 265–283

work page 2016

[1] [1]

Phoneme-Based Contextualization for Cross-Lingual Speech Recognition in End-to-End Models

Introduction End-to-end (E2E) models have attracted increasing attention recently. Instead of building an automatic speech recognition (ASR) system from different components such as the acoustic model (AM), language model (LM), and pronunciation model (PM), E2E models rely on a single neural network to directly learn speech-to-text mapping. Representative...

work page internal anchor Pith review Pith/arXiv arXiv 1906

[2] [2]

Shallow Fusion E2E Biasing Shallow fusion has been used in E2E models for decoding [10] and contextual biasing [6]

Prior Work 2.1. Shallow Fusion E2E Biasing Shallow fusion has been used in E2E models for decoding [10] and contextual biasing [6]. Biasing phrases are ﬁrst represented as n-gram WFST in the word level ( G), and then left com- posed with a “speller" FST ( S) to produce a contextual LM: C = min(det(S◦G)). The speller transduces a sequence of subword units ...

work page

[3] [3]

also proposed to only activate biasing phrases when they are proceeded by a set of preﬁxes. 2.2. Phoneme Mapping Cross-lingual phoneme mapping has been used in conventional systems for recognizing foreign words [15]. First, a phoneme mapping is learned by aligning the pronunciations between for- eign and target languages using TTS-synthesized audio and a ...

work page

[4] [4]

Créteil", we tokenize it into phonemes using the French pronunciation lexicon, i.e. “Créteil

Phoneme-Based Biasing The focus of this work is to bias toward rare cross-lingual words which are typically missing from the training set. We propose to do that by utilizing phonemes, which are not affected by orthog- raphy. Speciﬁcally, we augment the wordpiece modeling space of an E2E model with phonemes to train a wordpiece-phoneme model. 3.1. Wordpiec...

work page

[5] [5]

directions to Créteil

Experiments 4.1. Data Sets Our training set contains 35 million English utterances with a total of around 27,500 hours. These utterances are sampled from Google’s general English trafﬁc, and are anonymized and hand-transcribed for training. To increase training diversity, clean utterances are artiﬁcially corrupted by using a room sim- ulator, varying degr...

work page

[6] [6]

Biasing at the phoneme level enables us to avoid the OOV problem in the wordpiece model

Conclusion In this work we proposed a wordpiece-phoneme RNN-T model and phoneme-level contextual biasing to recognize foreign words. Biasing at the phoneme level enables us to avoid the OOV problem in the wordpiece model. Evaluating on a test set containing navigation queries to French place names, we show the proposed approach performs signiﬁcantly bette...

work page

[7] [7]

Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition

H. Soltau, H. Liao, and H. Sak, “Neural speech recognizer: Acoustic-to-word LSTM model for large vocabulary speech recognition,” arXiv preprint arXiv:1610.09975, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[8] [8]

Sequence Transduction with Recurrent Neural Networks

A. Graves, “Sequence transduction with recurrent neural net- works,” arXiv preprint arXiv:1211.3711, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012

[9] [9]

Exploring architectures, data and units for streaming end-to-end speech recognition with rnn-transducer,

K. Rao, H. Sak, and R. Prabhavalkar, “Exploring architectures, data and units for streaming end-to-end speech recognition with rnn-transducer,” in 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2017, pp. 193–199

work page 2017

[10] [10]

Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,

W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016, pp. 4960–4964

work page 2016

[11] [11]

State- of-the-art speech recognition with sequence-to-sequence models,

C.-C. Chiu, T. N. Sainath, Y . Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Goninaet al., “State- of-the-art speech recognition with sequence-to-sequence models,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4774–4778

work page 2018

[12] [12]

Stream- ing end-to-end speech recognition for mobile devices,

Y . He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y . Wu, R. Panget al., “Stream- ing end-to-end speech recognition for mobile devices,” ICASSP,

work page

[13] [13]

Streaming End-to-end Speech Recognition For Mobile Devices

[Online]. Available: https://arxiv.org/pdf/1811.06621.pdf

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Bringing contextual infor- mation to Google speech recognition,

P. Aleksic, M. Ghodsi, A. Michaely, C. Allauzen, K. Hall, B. Roark, D. Rybach, and P. Moreno, “Bringing contextual infor- mation to Google speech recognition,” in Sixteenth Annual Con- ference of the International Speech Communication Association , 2015

work page 2015

[15] [15]

Composition-based on-the-ﬂy rescoring for salient n-gram bias- ing,

K. Hall, E. Cho, C. Allauzen, F. Beaufays, N. Coccaro, K. Nakajima, M. Riley, B. Roark, D. Rybach, and L. Zhang, “Composition-based on-the-ﬂy rescoring for salient n-gram bias- ing,” 2015

work page 2015

[16] [16]

Contextual speech recognition in end-to-end neural network sys- tems using beam search,

I. Williams, A. Kannan, P. Aleksic, D. Rybach, and T. N. Sainath, “Contextual speech recognition in end-to-end neural network sys- tems using beam search,” Proc. Interspeech 2018, pp. 2227–2231, 2018

work page 2018

[17] [17]

An analysis of incorporating an external lan- guage model into a sequence-to-sequence model,

A. Kannan, Y . Wu, P. Nguyen, T. N. Sainath, Z. Chen, and R. Prabhavalkar, “An analysis of incorporating an external lan- guage model into a sequence-to-sequence model,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP). IEEE, 2018, pp. 1–5828

work page 2018

[18] [18]

Deep context: end-to-end contextual speech recognition

G. Pundak, T. N. Sainath, R. Prabhavalkar, A. Kannan, and D. Zhao, “Deep context: end-to-end contextual speech recogni- tion,” arXiv preprint arXiv:1808.02480, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[19] [19]

Japanese and korean voice search,

M. Schuster and K. Nakajima, “Japanese and korean voice search,” in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2012, pp. 5149– 5152

work page 2012

[20] [20]

Shallow-fusion end-to-end contextual biasing,

D. Zhao, T. N. Sainath, D. Rybach, D. Bhatia, B. Li, and R. Pang, “Shallow-fusion end-to-end contextual biasing,” To appear in In- terspeech 2019, 2019

work page 2019

[21] [21]

Phoebe: Pronunciation-aware contextualization for end-to-end speech recognition,

A. Bruguier, R. Prabhavalkar, G. Pundak, and T. N. Sainath, “Phoebe: Pronunciation-aware contextualization for end-to-end speech recognition,” in to appear in Proc. ICASSP , 2019. IEEE

work page 2019

[22] [22]

Cross-lingual phoneme mapping for language robust contextual speech recognition,

A. Patel, D. Li, E. Cho, and P. Aleksic, “Cross-lingual phoneme mapping for language robust contextual speech recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5924–5928

work page 2018

[23] [23]

No need for a lexicon? Evaluating the value of the pronunciation lexica in end-to-end models,

T. N. Sainath, R. Prabhavalkar, S. Kumar, S. Lee, A. Kannan, D. Rybach, V . Schogol, P. Nguyen, B. Li, Y . Wuet al., “No need for a lexicon? Evaluating the value of the pronunciation lexica in end-to-end models,” ICASSP, 2017

work page 2017

[24] [24]

Model unit exploration for sequence-to-sequence speech recognition,

K. Irie, R. Prabhavalkar, A. Kannan, A. Bruguier, D. Rybach, and P. Nguyen, “Model unit exploration for sequence-to-sequence speech recognition,” arXiv preprint arXiv:1902.01955, 2019

work page arXiv 1902

[25] [25]

Pronunciation learning with RNN-transducers,

A. Bruguier, D. Gnanapragasam, L. Johnson, K. Rao, and F. Bea- ufays, “Pronunciation learning with RNN-transducers,” Proc. In- terspeech 2017, pp. 2556–2560, 2017

work page 2017

[26] [26]

Generation of large-scale simulated utterances in virtual rooms to train deep-neural networks for far-ﬁeld speech recognition in google home,

C. Kim, A. Misra, K. Chin, T. Hughes, A. Narayanan, T. Sainath, and M. Bacchiani, “Generation of large-scale simulated utterances in virtual rooms to train deep-neural networks for far-ﬁeld speech recognition in google home,” 2017

work page 2017

[27] [27]

Parallel WaveNet: Fast High-Fidelity Speech Synthesis

A. v. d. Oord, Y . Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. v. d. Driessche, E. Lockhart, L. C. Cobo, F. Stimberg et al. , “Parallel wavenet: Fast high-ﬁdelity speech synthesis,” arXiv preprint arXiv:1711.10433, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[28] [28]

Long short-term memory,

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997

work page 1997

[29] [29]

Tensorﬂow: A system for large-scale machine learning,

M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Tensorﬂow: A system for large-scale machine learning,” in12th USENIX Sym- posium on Operating Systems Design and Implementation (OSDI 16), 2016, pp. 265–283

work page 2016