Efficient Neural Audio Synthesis

Aaron van den Oord; Edward Lockhart; Erich Elsen; Florian Stimberg; Karen Simonyan; Koray Kavukcuoglu; Nal Kalchbrenner; Norman Casagrande; Sander Dieleman; Seb Noury

arxiv: 1802.08435 · v2 · pith:SHC65RYFnew · submitted 2018-02-23 · 💻 cs.SD · cs.LG· eess.AS

Efficient Neural Audio Synthesis

Nal Kalchbrenner , Erich Elsen , Karen Simonyan , Seb Noury , Norman Casagrande , Edward Lockhart , Florian Stimberg , Aaron van den Oord

show 2 more authors

Sander Dieleman Koray Kavukcuoglu

This is my paper

classification 💻 cs.SD cs.LGeess.AS

keywords audiowavernnnumberqualitysamplessamplingtimedescribe

0 comments

read the original abstract

Sequential models achieve state-of-the-art results in audio, visual and textual domains with respect to both estimating the data distribution and generating high-quality samples. Efficient sampling for this class of models has however remained an elusive problem. With a focus on text-to-speech synthesis, we describe a set of general techniques for reducing sampling time while maintaining high output quality. We first describe a single-layer recurrent neural network, the WaveRNN, with a dual softmax layer that matches the quality of the state-of-the-art WaveNet model. The compact form of the network makes it possible to generate 24kHz 16-bit audio 4x faster than real time on a GPU. Second, we apply a weight pruning technique to reduce the number of weights in the WaveRNN. We find that, for a constant number of parameters, large sparse networks perform better than small dense networks and this relationship holds for sparsity levels beyond 96%. The small number of weights in a Sparse WaveRNN makes it possible to sample high-fidelity audio on a mobile CPU in real time. Finally, we propose a new generation scheme based on subscaling that folds a long sequence into a batch of shorter sequences and allows one to generate multiple samples at once. The Subscale WaveRNN produces 16 samples per step without loss of quality and offers an orthogonal method for increasing sampling efficiency.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ST-MoE: Designing Stable and Transferable Sparse Expert Models
cs.CL 2022-02 unverdicted novelty 6.0

ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost ...
Generalization of Spectrum Differential based Direct Waveform Modification for Voice Conversion
eess.AS 2019-07 unverdicted novelty 6.0

Residual-domain F0 transformation generalizes spectrum-differential direct waveform modification to arbitrary spectral conversion models in voice conversion.
Improving Performance of End-to-End ASR on Numeric Sequences
eess.AS 2019-07 unverdicted novelty 4.0

TTS-generated numeric training data plus a compact neural denormalizer improve E2E ASR word error rates on numeric sequences by up to a factor of 8 for the longest cases.
A Methodology for Controlling the Emotional Expressiveness in Synthetic Speech -- a Deep Learning approach
eess.AS 2019-07 unverdicted novelty 3.0

A methodology is proposed for emotional text-to-speech using emotional data collection, transfer-learning-based annotation of expressiveness features, and fine-tuning of a neutral TTS model.