Expressive Speech Synthesis via Modeling Expressions with Variational Autoencoder

Kei Akuzawa; Yusuke Iwasawa; Yutaka Matsuo

Not yet reviewed by Pith; the record is open.

Re-run · record.json Download PDF Read on arXiv ↗

This paper has not been read by Pith yet. Machine review is queued; the pith claim, tier, and objections will appear here once it completes.

SPECIMEN: schema-true, not a live event

T0 review · schema-true

One-sentence machine reading of the paper's core claim.

pith:XXXXXXXX · record.json · timestamp

arxiv 1804.02135 v3 pith:UVJIO3A5 submitted 2018-04-06 cs.CL cs.SDeess.AS

Expressive Speech Synthesis via Modeling Expressions with Variational Autoencoder

Kei Akuzawa , Yusuke Iwasawa , Yutaka Matsuo This is my paper

classification cs.CL cs.SDeess.AS

keywords speechautoregressivecharacteristicsglobalmodelautoencoderexpressionsexpressive

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

0 comments

read the original abstract

Recent advances in neural autoregressive models have improve the performance of speech synthesis (SS). However, as they lack the ability to model global characteristics of speech (such as speaker individualities or speaking styles), particularly when these characteristics have not been labeled, making neural autoregressive SS systems more expressive is still an open issue. In this paper, we propose to combine VoiceLoop, an autoregressive SS model, with Variational Autoencoder (VAE). This approach, unlike traditional autoregressive SS systems, uses VAE to model the global characteristics explicitly, enabling the expressiveness of the synthesized speech to be controlled in an unsupervised manner. Experiments using the VCTK and Blizzard2012 datasets show the VAE helps VoiceLoop to generate higher quality speech and to control the expressions in its synthesized speech by incorporating global characteristics into the speech generating process.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Fine-grained robust prosody transfer for single-speaker neural text-to-speech
eess.AS 2019-07 unverdicted novelty 6.0

Decouples prosody alignment via pre-computed phoneme timestamps and adds VAE to achieve robust fine-grained prosody transfer in single-speaker neural TTS from unseen speakers.
A Methodology for Controlling the Emotional Expressiveness in Synthetic Speech -- a Deep Learning approach
eess.AS 2019-07 unverdicted novelty 3.0

A methodology is proposed for emotional text-to-speech using emotional data collection, transfer-learning-based annotation of expressiveness features, and fine-tuning of a neutral TTS model.