Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning

Wei Ping , Kainan Peng , Andrew Gibiansky , Sercan O. Arik , Ajay Kannan , Sharan Narang , Jonathan Raiman , John Miller

Authors on Pith no claims yet

classification 💻 cs.SD cs.AIcs.CLcs.LGeess.AS

keywords deepvoicesynthesisattention-basedneuralscalespeechtext-to-speech

0 comments

read the original abstract

We present Deep Voice 3, a fully-convolutional attention-based neural text-to-speech (TTS) system. Deep Voice 3 matches state-of-the-art neural speech synthesis systems in naturalness while training ten times faster. We scale Deep Voice 3 to data set sizes unprecedented for TTS, training on more than eight hundred hours of audio from over two thousand speakers. In addition, we identify common error modes of attention-based speech synthesis networks, demonstrate how to mitigate them, and compare several different waveform synthesis methods. We also describe how to scale inference to ten million queries per day on one single-GPU server.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
cs.SD 2024-12 unverdicted novelty 5.0

CosyVoice 2 delivers human-parity naturalness and near-lossless streaming speech synthesis by combining finite-scalar quantization, a streamlined pre-trained LLM, and chunk-aware causal flow matching on large multilin...