A comparison of recent waveform generation and acoustic modeling methods for neural-network-based speech synthesis

Jaime Lorenzo-Trueba; Junichi Yamagishi; Lauri Juvela; Shinji Takaki; Xin Wang

arxiv: 1804.02549 · v1 · pith:X3HLBY3Dnew · submitted 2018-04-07 · 📡 eess.AS · cs.CL· cs.SD· stat.ML

A comparison of recent waveform generation and acoustic modeling methods for neural-network-based speech synthesis

Xin Wang , Jaime Lorenzo-Trueba , Shinji Takaki , Lauri Juvela , Junichi Yamagishi This is my paper

classification 📡 eess.AS cs.CLcs.SDstat.ML

keywords acousticspeechmodelmodelingapproachesevaluationperformedrecent

0 comments

read the original abstract

Recent advances in speech synthesis suggest that limitations such as the lossy nature of the amplitude spectrum with minimum phase approximation and the over-smoothing effect in acoustic modeling can be overcome by using advanced machine learning approaches. In this paper, we build a framework in which we can fairly compare new vocoding and acoustic modeling techniques with conventional approaches by means of a large scale crowdsourced evaluation. Results on acoustic models showed that generative adversarial networks and an autoregressive (AR) model performed better than a normal recurrent network and the AR model performed best. Evaluation on vocoders by using the same AR acoustic model demonstrated that a Wavenet vocoder outperformed classical source-filter-based vocoders. Particularly, generated speech waveforms from the combination of AR acoustic model and Wavenet vocoder achieved a similar score of speech quality to vocoded speech.

This paper has not been read by Pith yet.

A comparison of recent waveform generation and acoustic modeling methods for neural-network-based speech synthesis

discussion (0)