Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning

Andrew Rosenberg; Bhuvana Ramabhadran; Heiga Zen; RJ Skerry-Ryan; Ron J. Weiss; Ye Jia; Yonghui Wu; Yu Zhang; Zhifeng Chen

arxiv: 1907.04448 · v2 · pith:E3HEUD7Gnew · submitted 2019-07-09 · 💻 cs.CL · cs.SD· eess.AS

Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning

Yu Zhang , Ron J. Weiss , Heiga Zen , Yonghui Wu , Zhifeng Chen , RJ Skerry-Ryan , Ye Jia , Andrew Rosenberg

show 1 more author

Bhuvana Ramabhadran

This is my paper

Pith reviewed 2026-05-25 00:07 UTC · model grok-4.3

classification 💻 cs.CL cs.SDeess.AS

keywords multilingual TTScross-language voice cloningTacotronadversarial disentanglementphonemic inputspeech synthesisvoice transfer

0 comments

The pith

A Tacotron model transfers an English speaker's voice to fluent Spanish or Mandarin speech without any bilingual or parallel training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a multispeaker multilingual text-to-speech model can produce high-quality speech in several languages while also cloning voices across those languages. The transfer works even between unrelated languages such as English and Mandarin and requires no paired bilingual recordings. Success depends on feeding the model phoneme sequences rather than language-specific text and adding an adversarial term that forces the network to separate speaker identity from linguistic content. Once trained on multiple speakers per language plus an autoencoding path, the same model yields intelligible output for every training speaker in every language, either in a native accent or a foreign one.

Core claim

The model is able to transfer voices across languages, e.g. synthesize fluent Spanish speech using an English speaker's voice, without training on any bilingual or parallel examples. Such transfer works across distantly related languages, e.g. English and Mandarin. Critical to achieving this result are using a phonemic input representation to encourage sharing of model capacity across languages and incorporating an adversarial loss term to encourage the model to disentangle its representation of speaker identity from the speech content. Further scaling up the model by training on multiple speakers of each language and incorporating an autoencoding input results in a model which can be used,

What carries the argument

Phonemic input representation combined with an adversarial loss that disentangles speaker identity from language content inside a Tacotron architecture.

If this is right

The model produces intelligible speech for every training speaker in every language seen during training.
Output can be generated in either a native accent or a foreign accent for the same speaker.
No parallel or bilingual recordings are required for cross-language voice cloning.
Capacity is shared across languages through phoneme-level inputs rather than language-specific text.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation technique could be tested on languages with very small speaker counts to check whether the disentanglement still holds when data is scarce.
If the adversarial term is removed, the model would be expected to collapse speaker and language into a single representation and lose the ability to clone voices across languages.
Extending the phoneme inventory to cover additional languages should allow the same model to add new languages without retraining the entire network from scratch.

Load-bearing premise

The adversarial loss can separate speaker identity from language even though every speaker in the training data speaks only one language.

What would settle it

Measure whether listeners can still identify the original speaker when the model produces the same text in a second language; if identification accuracy drops to chance, the disentanglement failed.

Figures

Figures reproduced from arXiv: 1907.04448 by Andrew Rosenberg, Bhuvana Ramabhadran, Heiga Zen, RJ Skerry-Ryan, Ron J. Weiss, Ye Jia, Yonghui Wu, Yu Zhang, Zhifeng Chen.

**Figure 1.** Figure 1: Overview of the components of the proposed model. Dashed lines denote sampling via reparameterization [21] during training. The prior mean is always use during inference. in both languages using the same voice. [16] studied learning pronunciation from a bilingual TTS model. Most recently, [17] presented a multilingual neural TTS model which supports voice cloning across English, Spanish, and German. It us… view at source ↗

**Figure 2.** Figure 2: and the demo for accent transfer audio examples. We see that cloning the CN voice to other languages (bottom row) has the lowest similarity MOS, although the scores are still much higher than different-speaker similarity MOS in the offdiagonals of [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

We present a multispeaker, multilingual text-to-speech (TTS) synthesis model based on Tacotron that is able to produce high quality speech in multiple languages. Moreover, the model is able to transfer voices across languages, e.g. synthesize fluent Spanish speech using an English speaker's voice, without training on any bilingual or parallel examples. Such transfer works across distantly related languages, e.g. English and Mandarin. Critical to achieving this result are: 1. using a phonemic input representation to encourage sharing of model capacity across languages, and 2. incorporating an adversarial loss term to encourage the model to disentangle its representation of speaker identity (which is perfectly correlated with language in the training data) from the speech content. Further scaling up the model by training on multiple speakers of each language, and incorporating an autoencoding input to help stabilize attention during training, results in a model which can be used to consistently synthesize intelligible speech for training speakers in all languages seen during training, and in native or foreign accents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This Tacotron extension gets cross-lingual voice cloning without parallel data by combining phonemic inputs with an adversarial speaker loss, and the results look usable even if the disentanglement is not perfect.

read the letter

The main point is that the model can take an English speaker and produce fluent Spanish or Mandarin output in that voice, trained only on monolingual data. They do this by switching to phoneme inputs so the acoustic model can share capacity across languages, then adding an adversarial loss that tries to strip language information out of the speaker embedding even though every speaker in the data is tied to exactly one language. They also add an autoencoding path to keep attention stable and train on multiple speakers per language. The outcome is intelligible speech in native and foreign accents for the languages they cover. That combination is the concrete advance over plain multilingual Tacotron. The experiments show the transfer works across distant language pairs, which is the practical payoff. The setup is described clearly enough that the key pieces could be reproduced. The soft spot is the adversarial term itself. Speaker identity and language are perfectly correlated in the training data, so the only thing preventing the embedding from carrying language cues is whether the min-max game actually reaches a useful equilibrium. If the discriminator is under-powered or the gradients are unstable, the model could still be leaking language information and the cross-lingual results would be less general than claimed. Their reported outputs suggest the trick holds up for the tested cases, but an ablation on the loss weight and some embedding analysis would have made the claim tighter. This is for speech synthesis groups that need multilingual systems without bilingual recordings. The method is grounded enough and the claim is specific enough that it should go to referees rather than get desk-rejected.

Referee Report

1 major / 0 minor

Summary. The manuscript presents a Tacotron-based multispeaker multilingual TTS model that synthesizes high-quality speech in multiple languages and performs cross-lingual voice cloning (e.g., English speaker voice in Spanish or Mandarin) without any bilingual or parallel training data. The approach relies on phonemic input representations to share model capacity across languages and an adversarial loss to disentangle speaker identity from language (despite perfect speaker-language correlation in the monolingual data); scaling to multiple speakers per language plus an autoencoding input further stabilizes training and enables intelligible synthesis in native and foreign accents.

Significance. If the empirical results hold with rigorous validation, the work would be significant for multilingual TTS and zero-shot cross-lingual voice cloning. It directly tackles the speaker-language correlation problem via adversarial training and phonemic inputs, offering a practical path to voice transfer across distantly related languages without parallel corpora.

major comments (1)

[Abstract] Abstract: the central claim of successful cross-lingual voice transfer without bilingual data rests on the adversarial loss fully disentangling speaker identity from language. Given that each speaker appears in only one language (perfect correlation), it is unclear whether the min-max equilibrium removes language cues from the speaker embedding or whether residual language-specific artifacts remain; this is load-bearing for the transfer result and requires explicit analysis (e.g., probing the embedding for language predictability or ablation of the adversarial term).

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The concern about verifying the adversarial loss's effectiveness in disentangling speaker and language representations, given the perfect correlation in the data, is well-taken and directly relevant to the central claim. We address this point below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of successful cross-lingual voice transfer without bilingual data rests on the adversarial loss fully disentangling speaker identity from language. Given that each speaker appears in only one language (perfect correlation), it is unclear whether the min-max equilibrium removes language cues from the speaker embedding or whether residual language-specific artifacts remain; this is load-bearing for the transfer result and requires explicit analysis (e.g., probing the embedding for language predictability or ablation of the adversarial term).

Authors: We agree that the perfect speaker-language correlation makes explicit verification of disentanglement essential, and that indirect evidence from synthesis quality alone is insufficient to fully substantiate the claim. In the revised manuscript we will add (1) an ablation comparing performance with and without the adversarial term and (2) a language-prediction probe trained on the speaker embeddings to quantify residual language information before versus after adversarial training. These additions will directly test whether the min-max equilibrium removes language cues. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical ML demonstration

full rationale

The paper presents an empirical TTS model (Tacotron-based) trained on monolingual speaker data across languages, using phonemic representations and an adversarial loss for disentanglement. Cross-lingual synthesis results are shown via training and evaluation on held-out data, not by mathematical derivation that reduces to fitted inputs or self-referential definitions. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are used; the central claims rest on optimization outcomes and external benchmarks rather than construction from the model's own parameters.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard neural network training assumptions plus two domain-specific premises about phoneme sharing and adversarial disentanglement; no new entities are postulated.

free parameters (1)

adversarial loss weight
The balance between the main reconstruction loss and the adversarial term is a tunable hyperparameter required for the disentanglement to succeed.

axioms (2)

domain assumption Phonemic input representations encourage sharing of model capacity across languages
Invoked to justify using a single model for multiple languages without language-specific adaptations.
domain assumption Adversarial training can separate speaker identity from language content despite perfect correlation in training data
Core premise for enabling voice transfer without parallel examples.

pith-pipeline@v0.9.0 · 5747 in / 1450 out tokens · 33609 ms · 2026-05-25T00:07:58.423511+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

incorporating an adversarial loss term to encourage the model to disentangle its representation of speaker identity (which is perfectly correlated with language in the training data) from the speech content
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

using a phonemic input representation to encourage sharing of model capacity across languages

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 5 internal anchors

[1]

prosody, by conditioning synthesis on la- tent representations [8–12] in addition to text

Introduction Recentend-to-endneuralTTSmodels[1–3]havebeenextended to enable control of speaker identity [4–7] as well as unlabelled speech attributes, e.g. prosody, by conditioning synthesis on la- tent representations [8–12] in addition to text. Extending such models to support multiple, unrelated languages is nontrivial when using language-dependent inp...

work page
[2]

Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning

Model Structure WebaseourmultilingualTTSmodelonTacotron2[20],which uses an attention-based sequence-to-sequence model to gener- ateasequenceoflog-melspectrogramframesbasedonaninput text sequence. The architecture is illustrated in Figure 1. It arXiv:1907.04448v2 [cs.CL] 24 Jul 2019 augments the base Tacotron 2 model with additional speaker and, optionally...

work page internal anchor Pith review Pith/arXiv arXiv 1907
[3]

Inourexperiments,weobservethatfeedinginthepriormean (all zeros) during inference, signiﬁcantly improves stability of cross-lingualspeakertransferandleadstoimprovednaturalness as shown by MOS evaluations in Section 3.4. 2.3. Adversarial training OneofthechallengesformultilingualTTSisdatasparsity,where some languages may only have training data for a few sp...

work page
[4]

heavyaccented

Experiments We train models using a proprietary dataset composed of high qualityspeechinthreelanguages: (1)385hoursofEnglish(EN) from 84 professional voice actors with accents from the United States, Great Britain, Australia, and Singapore; (2) 97 hours of Spanish (ES) from 3 female speakers include Castilian and US Spanish; (3) 68 hours of Mandarin (CN) ...

work page
[5]

Conclusions We describe extensions to the Tacotron 2 neural TTS model which allow training of a multilingual model trained only on monolingual speakers, which is able to synthesize high quality speech in three languages, and transfer training voices across languages. Furthermore, the model learns to speak foreign lan- guages with moderate control of accen...

work page
[6]

Acknowledgements We thank Ami Patel, Amanda Ritchart-Scott, Ryan Li, Siamak Tazari, Yutian Chen, Paul McCartney, Eric Battenberg, Toby Hawker, and Rob Clark for discussions and helpful feedback

work page
[7]

WaveNet: A Generative Model for Raw Audio

A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: A generative model for raw audio,” CoRR abs/1609.03499, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[8]

Tacotron: A fully end-to-end text-to-speech synthesis model,

Y.Wang,R.Skerry-Ryan,D.Stanton,Y.Wu,R.J.Weiss,N.Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengioet al., “Tacotron: A fully end-to-end text-to-speech synthesis model,”arXiv preprint, 2017

work page 2017
[9]

DeepVoice2: Multi-speakerneuraltext- to-speech,

S. Arik, G. Diamos, A. Gibiansky, J. Miller, K. Peng, W. Ping, J.Raiman,andY.Zhou,“DeepVoice2: Multi-speakerneuraltext- to-speech,”in AdvancesinNeuralInformationProcessingSystems (NIPS), 2017

work page 2017
[10]

Neuralvoice cloning with a few samples,

S.O.Arik,J.Chen,K.Peng,W.Ping,andY.Zhou,“Neuralvoice cloning with a few samples,” inAdvances in Neural Information Processing Systems, 2018

work page 2018
[11]

Transfer learn- ing from speaker veriﬁcation to multispeaker text-to-speech syn- thesis,

Y. Jia, Y. Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen, P. Nguyen, R. Pang, I. L. Moreno, and Y. Wu, “Transfer learn- ing from speaker veriﬁcation to multispeaker text-to-speech syn- thesis,” inAdvances in Neural Information Processing Systems, 2018

work page 2018
[12]

Fitting new speakers based on a short untranscribed sample,

E. Nachmani, A. Polyak, Y. Taigman, and L. Wolf, “Fitting new speakers based on a short untranscribed sample,” inInternational Conference on Machine Learning (ICML), 2018

work page 2018
[13]

Sample Efficient Adaptive Text-to-Speech

Y. Chen, Y. Assael, B. Shillingford, D. Budden, S. Reed, H. Zen, Q.Wang,L.C.Cobo,A.Trask,B.Laurie etal.,“Sampleeﬃcient adaptive text-to-speech,”arXiv preprint arXiv:1809.10460, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[14]

Style tokens: Unsupervised style modeling, control and transfer in end-to-end speechsynthesis,

Y. Wang, D. Stanton, Y. Zhang, R. Skerry-Ryan, E. Battenberg, J. Shor, Y. Xiao, F. Ren, Y. Jia, and R. A. Saurous, “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speechsynthesis,”in InternationalConferenceonMachineLearn- ing (ICML), 2018

work page 2018
[15]

Towards end- to-endprosodytransferforexpressivespeechsynthesiswithTaco- tron,

R. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton, J. Shor, R. J. Weiss, R. Clark, and R. A. Saurous, “Towards end- to-endprosodytransferforexpressivespeechsynthesiswithTaco- tron,” inInternational Conference on Machine Learning (ICML), 2018

work page 2018
[16]

Expressive speech synthesisviamodelingexpressionswithvariationalautoencoder,

K. Akuzawa, Y. Iwasawa, and Y. Matsuo, “Expressive speech synthesisviamodelingexpressionswithvariationalautoencoder,” inInterspeech, 2018

work page 2018
[17]

Deep Encoder-Decoder Models for Unsupervised Learning of Controllable Speech Synthesis

G. E. Henter, J. Lorenzo-Trueba, X. Wang, and J. Yamagishi, “Deep encoder-decoder models for unsupervised learning of con- trollable speech synthesis,” arXiv preprint arXiv:1807.11470, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[18]

Hierarchical generative modeling for controllable speech synthesis,

W.-N.Hsu,Y.Zhang,R.J.Weiss,H.Zen,Y.Wu,Y.Wang,Y.Cao, Y. Jia, Z. Chen, J. Shen, P. Nguyen, and R. Pang, “Hierarchical generative modeling for controllable speech synthesis,” inICLR, 2019

work page 2019
[19]

Statistical parametric speech syn- thesis based on speaker and language factorization,

H. Zen, N. Braunschweiler, S. Buchholz, M. Gales, K. Knill, S. Krstulović, and J. Latorre, “Statistical parametric speech syn- thesis based on speaker and language factorization,”IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 6, pp. 1713–1724, 2012

work page 2012
[20]

Multi-language multi-speaker acoustic model- ingforLSTM-RNNbasedstatisticalparametricspeechsynthesis,

B. Li and H. Zen, “Multi-language multi-speaker acoustic model- ingforLSTM-RNNbasedstatisticalparametricspeechsynthesis,” inProc. Interspeech, 2016, pp. 2468–2472

work page 2016
[21]

A light-weight method of building an LSTM-RNN-based bilingual TTS system,

H. Ming, Y. Lu, Z. Zhang, and M. Dong, “A light-weight method of building an LSTM-RNN-based bilingual TTS system,” inIn- ternational Conference on Asian Language Processing, 2017, pp. 201–205

work page 2017
[22]

Learning pronunciation from a for- eign language in speech synthesis networks,

Y. Lee and T. Kim, “Learning pronunciation from a for- eign language in speech synthesis networks,” arXiv preprint arXiv:1811.09364, 2018

work page arXiv 2018
[23]

Unsupervisedpolyglottexttospeech,

E.NachmaniandL.Wolf,“Unsupervisedpolyglottexttospeech,” inICASSP, 2019

work page 2019
[24]

WORLD: a vocoder- based high-quality speech synthesis system for real-time applica- tions,

M. Morise, F. Yokomori, and K. Ozawa, “WORLD: a vocoder- based high-quality speech synthesis system for real-time applica- tions,” IEICE Transactions on Information and Systems, vol. 99, no. 7, pp. 1877–1884, 2016

work page 2016
[25]

Bytesareallyou need: End-to-end multilingual speech recognition and synthesis with bytes,

B.Li,Y.Zhang,T.Sainath,Y.Wu,andW.Chan,“Bytesareallyou need: End-to-end multilingual speech recognition and synthesis with bytes,” inICASSP, 2018

work page 2018
[26]

Natural TTS synthesis by conditioning WaveNet on mel spectrogram predic- tions,

J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerry-Ryanet al., “Natural TTS synthesis by conditioning WaveNet on mel spectrogram predic- tions,” inICASSP, 2018

work page 2018
[27]

Auto-encodingvariationalBayes,

D.P.KingmaandM.Welling,“Auto-encodingvariationalBayes,” inInternationalConferenceonLearningRepresentations(ICLR) , 2014

work page 2014
[28]

Eﬃcient neural audio synthesis,

N.Kalchbrenner,E.Elsen,K.Simonyan,S.Noury,N.Casagrande, E. Lockhart, F. Stimberg, A. van den Oord, S. Dieleman, and K. Kavukcuoglu, “Eﬃcient neural audio synthesis,” inICML, 2018

work page 2018
[29]

Char2wav: End-to-endspeechsyn- thesis,

J. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner, A.Courville, andY.Bengio, “Char2wav: End-to-endspeechsyn- thesis,” inICLR: Workshop, 2017

work page 2017
[30]

Deep Voice 3: Scaling text-to-speech with convolutional sequence learning,

W.Ping,K.Peng,A.Gibiansky,S.O.Arik,A.Kannan,S.Narang, J. Raiman, and J. Miller, “Deep Voice 3: Scaling text-to-speech with convolutional sequence learning,” inInternational Confer- ence on Learning Representations (ICLR), 2018

work page 2018
[31]

Representation Mixing for TTS Synthesis

K. Kastner, J. F. Santos, Y. Bengio, and A. C. Courville, “Repre- sentation mixing for TTS synthesis,”arXiv:1811.07240, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[32]

Data-oriented methods for grapheme-to-phoneme conversion,

A. Van Den Bosch and W. Daelemans, “Data-oriented methods for grapheme-to-phoneme conversion,” inProc. Association for Computational Linguistics, 1993, pp. 45–53

work page 1993
[33]

Domain- adversarial training of neural networks,

Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky, “Domain- adversarial training of neural networks,”The Journal of Machine Learning Research, vol. 17, no. 1, pp. 2096–2030, 2016

work page 2096
[34]

Disentangling correlated speaker and noise for speech synthesis via data augmentation and adversarial factor- ization,

W.-N. Hsu, Y. Zhang, R. J. Weiss, Y. an Chung, Y. Wang, Y. Wu, and J. Glass, “Disentangling correlated speaker and noise for speech synthesis via data augmentation and adversarial factor- ization,” inICASSP, 2019

work page 2019
[35]

Cross-lingual speaker discrimination usingnaturalandsyntheticspeech,

M. Wester and H. Liang, “Cross-lingual speaker discrimination usingnaturalandsyntheticspeech,”in TwelfthAnnualConference of the International Speech Communication Association, 2011

work page 2011
[36]

Generalized end- to-end loss for speaker veriﬁcation,

L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end- to-end loss for speaker veriﬁcation,” inProc. ICASSP, 2018

work page 2018

[1] [1]

prosody, by conditioning synthesis on la- tent representations [8–12] in addition to text

Introduction Recentend-to-endneuralTTSmodels[1–3]havebeenextended to enable control of speaker identity [4–7] as well as unlabelled speech attributes, e.g. prosody, by conditioning synthesis on la- tent representations [8–12] in addition to text. Extending such models to support multiple, unrelated languages is nontrivial when using language-dependent inp...

work page

[2] [2]

Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning

Model Structure WebaseourmultilingualTTSmodelonTacotron2[20],which uses an attention-based sequence-to-sequence model to gener- ateasequenceoflog-melspectrogramframesbasedonaninput text sequence. The architecture is illustrated in Figure 1. It arXiv:1907.04448v2 [cs.CL] 24 Jul 2019 augments the base Tacotron 2 model with additional speaker and, optionally...

work page internal anchor Pith review Pith/arXiv arXiv 1907

[3] [3]

Inourexperiments,weobservethatfeedinginthepriormean (all zeros) during inference, signiﬁcantly improves stability of cross-lingualspeakertransferandleadstoimprovednaturalness as shown by MOS evaluations in Section 3.4. 2.3. Adversarial training OneofthechallengesformultilingualTTSisdatasparsity,where some languages may only have training data for a few sp...

work page

[4] [4]

heavyaccented

Experiments We train models using a proprietary dataset composed of high qualityspeechinthreelanguages: (1)385hoursofEnglish(EN) from 84 professional voice actors with accents from the United States, Great Britain, Australia, and Singapore; (2) 97 hours of Spanish (ES) from 3 female speakers include Castilian and US Spanish; (3) 68 hours of Mandarin (CN) ...

work page

[5] [5]

Conclusions We describe extensions to the Tacotron 2 neural TTS model which allow training of a multilingual model trained only on monolingual speakers, which is able to synthesize high quality speech in three languages, and transfer training voices across languages. Furthermore, the model learns to speak foreign lan- guages with moderate control of accen...

work page

[6] [6]

Acknowledgements We thank Ami Patel, Amanda Ritchart-Scott, Ryan Li, Siamak Tazari, Yutian Chen, Paul McCartney, Eric Battenberg, Toby Hawker, and Rob Clark for discussions and helpful feedback

work page

[7] [7]

WaveNet: A Generative Model for Raw Audio

A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: A generative model for raw audio,” CoRR abs/1609.03499, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[8] [8]

Tacotron: A fully end-to-end text-to-speech synthesis model,

Y.Wang,R.Skerry-Ryan,D.Stanton,Y.Wu,R.J.Weiss,N.Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengioet al., “Tacotron: A fully end-to-end text-to-speech synthesis model,”arXiv preprint, 2017

work page 2017

[9] [9]

DeepVoice2: Multi-speakerneuraltext- to-speech,

S. Arik, G. Diamos, A. Gibiansky, J. Miller, K. Peng, W. Ping, J.Raiman,andY.Zhou,“DeepVoice2: Multi-speakerneuraltext- to-speech,”in AdvancesinNeuralInformationProcessingSystems (NIPS), 2017

work page 2017

[10] [10]

Neuralvoice cloning with a few samples,

S.O.Arik,J.Chen,K.Peng,W.Ping,andY.Zhou,“Neuralvoice cloning with a few samples,” inAdvances in Neural Information Processing Systems, 2018

work page 2018

[11] [11]

Transfer learn- ing from speaker veriﬁcation to multispeaker text-to-speech syn- thesis,

Y. Jia, Y. Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen, P. Nguyen, R. Pang, I. L. Moreno, and Y. Wu, “Transfer learn- ing from speaker veriﬁcation to multispeaker text-to-speech syn- thesis,” inAdvances in Neural Information Processing Systems, 2018

work page 2018

[12] [12]

Fitting new speakers based on a short untranscribed sample,

E. Nachmani, A. Polyak, Y. Taigman, and L. Wolf, “Fitting new speakers based on a short untranscribed sample,” inInternational Conference on Machine Learning (ICML), 2018

work page 2018

[13] [13]

Sample Efficient Adaptive Text-to-Speech

Y. Chen, Y. Assael, B. Shillingford, D. Budden, S. Reed, H. Zen, Q.Wang,L.C.Cobo,A.Trask,B.Laurie etal.,“Sampleeﬃcient adaptive text-to-speech,”arXiv preprint arXiv:1809.10460, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[14] [14]

Style tokens: Unsupervised style modeling, control and transfer in end-to-end speechsynthesis,

Y. Wang, D. Stanton, Y. Zhang, R. Skerry-Ryan, E. Battenberg, J. Shor, Y. Xiao, F. Ren, Y. Jia, and R. A. Saurous, “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speechsynthesis,”in InternationalConferenceonMachineLearn- ing (ICML), 2018

work page 2018

[15] [15]

Towards end- to-endprosodytransferforexpressivespeechsynthesiswithTaco- tron,

R. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton, J. Shor, R. J. Weiss, R. Clark, and R. A. Saurous, “Towards end- to-endprosodytransferforexpressivespeechsynthesiswithTaco- tron,” inInternational Conference on Machine Learning (ICML), 2018

work page 2018

[16] [16]

Expressive speech synthesisviamodelingexpressionswithvariationalautoencoder,

K. Akuzawa, Y. Iwasawa, and Y. Matsuo, “Expressive speech synthesisviamodelingexpressionswithvariationalautoencoder,” inInterspeech, 2018

work page 2018

[17] [17]

Deep Encoder-Decoder Models for Unsupervised Learning of Controllable Speech Synthesis

G. E. Henter, J. Lorenzo-Trueba, X. Wang, and J. Yamagishi, “Deep encoder-decoder models for unsupervised learning of con- trollable speech synthesis,” arXiv preprint arXiv:1807.11470, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[18] [18]

Hierarchical generative modeling for controllable speech synthesis,

W.-N.Hsu,Y.Zhang,R.J.Weiss,H.Zen,Y.Wu,Y.Wang,Y.Cao, Y. Jia, Z. Chen, J. Shen, P. Nguyen, and R. Pang, “Hierarchical generative modeling for controllable speech synthesis,” inICLR, 2019

work page 2019

[19] [19]

Statistical parametric speech syn- thesis based on speaker and language factorization,

H. Zen, N. Braunschweiler, S. Buchholz, M. Gales, K. Knill, S. Krstulović, and J. Latorre, “Statistical parametric speech syn- thesis based on speaker and language factorization,”IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 6, pp. 1713–1724, 2012

work page 2012

[20] [20]

Multi-language multi-speaker acoustic model- ingforLSTM-RNNbasedstatisticalparametricspeechsynthesis,

B. Li and H. Zen, “Multi-language multi-speaker acoustic model- ingforLSTM-RNNbasedstatisticalparametricspeechsynthesis,” inProc. Interspeech, 2016, pp. 2468–2472

work page 2016

[21] [21]

A light-weight method of building an LSTM-RNN-based bilingual TTS system,

H. Ming, Y. Lu, Z. Zhang, and M. Dong, “A light-weight method of building an LSTM-RNN-based bilingual TTS system,” inIn- ternational Conference on Asian Language Processing, 2017, pp. 201–205

work page 2017

[22] [22]

Learning pronunciation from a for- eign language in speech synthesis networks,

Y. Lee and T. Kim, “Learning pronunciation from a for- eign language in speech synthesis networks,” arXiv preprint arXiv:1811.09364, 2018

work page arXiv 2018

[23] [23]

Unsupervisedpolyglottexttospeech,

E.NachmaniandL.Wolf,“Unsupervisedpolyglottexttospeech,” inICASSP, 2019

work page 2019

[24] [24]

WORLD: a vocoder- based high-quality speech synthesis system for real-time applica- tions,

M. Morise, F. Yokomori, and K. Ozawa, “WORLD: a vocoder- based high-quality speech synthesis system for real-time applica- tions,” IEICE Transactions on Information and Systems, vol. 99, no. 7, pp. 1877–1884, 2016

work page 2016

[25] [25]

Bytesareallyou need: End-to-end multilingual speech recognition and synthesis with bytes,

B.Li,Y.Zhang,T.Sainath,Y.Wu,andW.Chan,“Bytesareallyou need: End-to-end multilingual speech recognition and synthesis with bytes,” inICASSP, 2018

work page 2018

[26] [26]

Natural TTS synthesis by conditioning WaveNet on mel spectrogram predic- tions,

J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerry-Ryanet al., “Natural TTS synthesis by conditioning WaveNet on mel spectrogram predic- tions,” inICASSP, 2018

work page 2018

[27] [27]

Auto-encodingvariationalBayes,

D.P.KingmaandM.Welling,“Auto-encodingvariationalBayes,” inInternationalConferenceonLearningRepresentations(ICLR) , 2014

work page 2014

[28] [28]

Eﬃcient neural audio synthesis,

N.Kalchbrenner,E.Elsen,K.Simonyan,S.Noury,N.Casagrande, E. Lockhart, F. Stimberg, A. van den Oord, S. Dieleman, and K. Kavukcuoglu, “Eﬃcient neural audio synthesis,” inICML, 2018

work page 2018

[29] [29]

Char2wav: End-to-endspeechsyn- thesis,

J. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner, A.Courville, andY.Bengio, “Char2wav: End-to-endspeechsyn- thesis,” inICLR: Workshop, 2017

work page 2017

[30] [30]

Deep Voice 3: Scaling text-to-speech with convolutional sequence learning,

W.Ping,K.Peng,A.Gibiansky,S.O.Arik,A.Kannan,S.Narang, J. Raiman, and J. Miller, “Deep Voice 3: Scaling text-to-speech with convolutional sequence learning,” inInternational Confer- ence on Learning Representations (ICLR), 2018

work page 2018

[31] [31]

Representation Mixing for TTS Synthesis

K. Kastner, J. F. Santos, Y. Bengio, and A. C. Courville, “Repre- sentation mixing for TTS synthesis,”arXiv:1811.07240, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[32] [32]

Data-oriented methods for grapheme-to-phoneme conversion,

A. Van Den Bosch and W. Daelemans, “Data-oriented methods for grapheme-to-phoneme conversion,” inProc. Association for Computational Linguistics, 1993, pp. 45–53

work page 1993

[33] [33]

Domain- adversarial training of neural networks,

Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky, “Domain- adversarial training of neural networks,”The Journal of Machine Learning Research, vol. 17, no. 1, pp. 2096–2030, 2016

work page 2096

[34] [34]

Disentangling correlated speaker and noise for speech synthesis via data augmentation and adversarial factor- ization,

W.-N. Hsu, Y. Zhang, R. J. Weiss, Y. an Chung, Y. Wang, Y. Wu, and J. Glass, “Disentangling correlated speaker and noise for speech synthesis via data augmentation and adversarial factor- ization,” inICASSP, 2019

work page 2019

[35] [35]

Cross-lingual speaker discrimination usingnaturalandsyntheticspeech,

M. Wester and H. Liang, “Cross-lingual speaker discrimination usingnaturalandsyntheticspeech,”in TwelfthAnnualConference of the International Speech Communication Association, 2011

work page 2011

[36] [36]

Generalized end- to-end loss for speaker veriﬁcation,

L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end- to-end loss for speaker veriﬁcation,” inProc. ICASSP, 2018

work page 2018