pith. sign in

arxiv: 1804.02549 · v1 · pith:X3HLBY3Dnew · submitted 2018-04-07 · 📡 eess.AS · cs.CL· cs.SD· stat.ML

A comparison of recent waveform generation and acoustic modeling methods for neural-network-based speech synthesis

classification 📡 eess.AS cs.CLcs.SDstat.ML
keywords acousticspeechmodelmodelingapproachesevaluationperformedrecent
0
0 comments X
read the original abstract

Recent advances in speech synthesis suggest that limitations such as the lossy nature of the amplitude spectrum with minimum phase approximation and the over-smoothing effect in acoustic modeling can be overcome by using advanced machine learning approaches. In this paper, we build a framework in which we can fairly compare new vocoding and acoustic modeling techniques with conventional approaches by means of a large scale crowdsourced evaluation. Results on acoustic models showed that generative adversarial networks and an autoregressive (AR) model performed better than a normal recurrent network and the AR model performed best. Evaluation on vocoders by using the same AR acoustic model demonstrated that a Wavenet vocoder outperformed classical source-filter-based vocoders. Particularly, generated speech waveforms from the combination of AR acoustic model and Wavenet vocoder achieved a similar score of speech quality to vocoded speech.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.