Representation Mixing for TTS Synthesis

Aaron Courville; Jo\~ao Felipe Santos; Kyle Kastner; Yoshua Bengio

arxiv: 1811.07240 · v2 · pith:KVD6AQJFnew · submitted 2018-11-17 · 💻 cs.LG · cs.CL· cs.SD· eess.AS· stat.ML

Representation Mixing for TTS Synthesis

Kyle Kastner , Jo\~ao Felipe Santos , Yoshua Bengio , Aaron Courville This is my paper

classification 💻 cs.LG cs.CLcs.SDeess.ASstat.ML

keywords characterchoicemixingphonemerepresentationapproachaudiobookcases

0 comments

read the original abstract

Recent character and phoneme-based parametric TTS systems using deep learning have shown strong performance in natural speech generation. However, the choice between character or phoneme input can create serious limitations for practical deployment, as direct control of pronunciation is crucial in certain cases. We demonstrate a simple method for combining multiple types of linguistic information in a single encoder, named representation mixing, enabling flexible choice between character, phoneme, or mixed representations during inference. Experiments and user studies on a public audiobook corpus show the efficacy of our approach.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning
cs.CL 2019-07 unverdicted novelty 7.0

A Tacotron model with phonemic inputs and adversarial disentanglement enables cross-lingual voice cloning without parallel data, producing intelligible speech in native and foreign accents.