pith. machine review for the scientific record. sign in

arxiv: 1710.07654 · v3 · submitted 2017-10-20 · 💻 cs.SD · cs.AI· cs.CL· cs.LG· eess.AS

Recognition: unknown

Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning

Authors on Pith no claims yet
classification 💻 cs.SD cs.AIcs.CLcs.LGeess.AS
keywords deepvoicesynthesisattention-basedneuralscalespeechtext-to-speech
0
0 comments X
read the original abstract

We present Deep Voice 3, a fully-convolutional attention-based neural text-to-speech (TTS) system. Deep Voice 3 matches state-of-the-art neural speech synthesis systems in naturalness while training ten times faster. We scale Deep Voice 3 to data set sizes unprecedented for TTS, training on more than eight hundred hours of audio from over two thousand speakers. In addition, we identify common error modes of attention-based speech synthesis networks, demonstrate how to mitigate them, and compare several different waveform synthesis methods. We also describe how to scale inference to ten million queries per day on one single-GPU server.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

    cs.SD 2024-12 unverdicted novelty 5.0

    CosyVoice 2 delivers human-parity naturalness and near-lossless streaming speech synthesis by combining finite-scalar quantization, a streamlined pre-trained LLM, and chunk-aware causal flow matching on large multilin...