Deep Encoder-Decoder Models for Unsupervised Learning of Controllable Speech Synthesis

Gustav Eje Henter; Jaime Lorenzo-Trueba; Junichi Yamagishi; Xin Wang

arxiv: 1807.11470 · v3 · pith:LO64NP6Nnew · submitted 2018-07-30 · 📡 eess.AS · cs.LG· cs.SD· stat.ML

Deep Encoder-Decoder Models for Unsupervised Learning of Controllable Speech Synthesis

Gustav Eje Henter , Jaime Lorenzo-Trueba , Xin Wang , Junichi Yamagishi This is my paper

classification 📡 eess.AS cs.LGcs.SDstat.ML

keywords speechunsupervisedcontrollearningmodelssynthesisdeepemotional

0 comments

read the original abstract

Generating versatile and appropriate synthetic speech requires control over the output expression separate from the spoken text. Important non-textual speech variation is seldom annotated, in which case output control must be learned in an unsupervised fashion. In this paper, we perform an in-depth study of methods for unsupervised learning of control in statistical speech synthesis. For example, we show that popular unsupervised training heuristics can be interpreted as variational inference in certain autoencoder models. We additionally connect these models to VQ-VAEs, another, recently-proposed class of deep variational autoencoders, which we show can be derived from a very similar mathematical argument. The implications of these new probabilistic interpretations are discussed. We illustrate the utility of the various approaches with an application to acoustic modelling for emotional speech synthesis, where the unsupervised methods for learning expression control (without access to emotional labels) are found to give results that in many aspects match or surpass the previous best supervised approach.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning
cs.CL 2019-07 unverdicted novelty 7.0

A Tacotron model with phonemic inputs and adversarial disentanglement enables cross-lingual voice cloning without parallel data, producing intelligible speech in native and foreign accents.
A Methodology for Controlling the Emotional Expressiveness in Synthetic Speech -- a Deep Learning approach
eess.AS 2019-07 unverdicted novelty 3.0

A methodology is proposed for emotional text-to-speech using emotional data collection, transfer-learning-based annotation of expressiveness features, and fine-tuning of a neutral TTS model.