pith. sign in

arxiv: 1803.09017 · v1 · pith:HCA3UV3Inew · submitted 2018-03-23 · 💻 cs.CL · cs.LG· cs.SD· eess.AS

Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

classification 💻 cs.CL cs.LGcs.SDeess.AS
keywords stylesynthesisgstsspeechtrainedcontrolembeddingsend-to-end
0
0 comments X
read the original abstract

In this work, we propose "global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system. The embeddings are trained with no explicit labels, yet learn to model a large range of acoustic expressiveness. GSTs lead to a rich set of significant results. The soft interpretable "labels" they generate can be used to control synthesis in novel ways, such as varying speed and speaking style - independently of the text content. They can also be used for style transfer, replicating the speaking style of a single audio clip across an entire long-form text corpus. When trained on noisy, unlabeled found data, GSTs learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Forward-Backward Decoding for Regularizing End-to-End TTS

    eess.AS 2019-07 unverdicted novelty 6.0

    Forward-backward decoding with divergence regularization and bidirectional decoder improves end-to-end TTS robustness and naturalness by addressing exposure bias via joint L2R/R2L training.

  2. A Methodology for Controlling the Emotional Expressiveness in Synthetic Speech -- a Deep Learning approach

    eess.AS 2019-07 unverdicted novelty 3.0

    A methodology is proposed for emotional text-to-speech using emotional data collection, transfer-learning-based annotation of expressiveness features, and fine-tuning of a neutral TTS model.