WaveNet: A Generative Model for Raw Audio
Pith reviewed 2026-05-12 20:23 UTC · model grok-4.3
The pith
WaveNet generates raw audio waveforms by predicting each sample from all previous ones and yields more natural text-to-speech than prior systems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
WaveNet is a fully probabilistic autoregressive deep neural network for raw audio waveforms, with the predictive distribution for each audio sample conditioned on all previous ones. It can be efficiently trained on data with tens of thousands of samples per second. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity by conditioning on speaker identity, and when trained to model music it generates novel and often highly realistic
What carries the argument
The autoregressive predictive distribution over each raw audio sample conditioned on all prior samples, realized in a deep neural network architecture.
If this is right
- Text-to-speech systems can achieve higher naturalness as judged by human listeners.
- A single model can represent the voices of many different speakers through conditioning on speaker identity.
- The architecture can generate novel and realistic musical fragments when trained on music data.
- The same network can be repurposed for discriminative tasks such as phoneme recognition with promising results.
Where Pith is reading between the lines
- The autoregressive sample-by-sample approach might extend to other high-rate sequential signals such as video or sensor streams.
- Further conditioning inputs could allow finer control over generated audio content beyond speaker identity.
- Efficiency improvements could support real-time interactive audio generation applications.
Load-bearing premise
Human listener ratings of naturalness provide a reliable and unbiased measure of generated audio quality.
What would settle it
A controlled blind listening test in which average naturalness ratings for WaveNet audio are not higher than those for the best parametric or concatenative text-to-speech systems.
read the original abstract
This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces WaveNet, a deep autoregressive neural network for raw audio waveform generation based on dilated causal convolutions. It demonstrates that the model can be trained efficiently on high-sample-rate audio data. When applied to text-to-speech, it claims state-of-the-art performance with human listeners rating the synthesized speech as significantly more natural than the best parametric and concatenative systems for both English and Mandarin. A single model captures multiple speakers via conditioning on speaker identity. Additional results include realistic music generation and promising phoneme recognition performance when used discriminatively.
Significance. If the human evaluation results hold under scrutiny, this represents a significant advance in audio generation by showing that direct probabilistic modeling of raw waveforms can outperform traditional TTS pipelines. The dilated convolution architecture efficiently captures long-range temporal structure, which is a key technical contribution. Credit is given for the explicit demonstration of efficient training despite the autoregressive formulation and for the multi-speaker conditioning results.
major comments (1)
- [TTS experiments (Section 4)] The central SOTA claim for TTS (abstract and experiments section) rests on human naturalness ratings being significantly higher than parametric/concatenative baselines. The manuscript provides no details on the number of raters, statistical significance testing, rating scale or protocol (e.g., blind presentation, sample selection), or objective corroborating metrics such as MCD or PESQ. This information is load-bearing for verifying that the preference reflects model quality rather than test artifacts.
minor comments (2)
- [Abstract] The abstract states performance improvements via human evaluations but does not reference the specific quantitative results or tables that support the 'significantly more natural' claim.
- [Model architecture (Section 2)] The description of speaker conditioning could be strengthened by an explicit equation or diagram showing how the speaker embedding is injected into the dilated layers.
Simulated Author's Rebuttal
We thank the referee for the careful review and for identifying the need for greater transparency in the TTS evaluation protocol. We address the single major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [TTS experiments (Section 4)] The central SOTA claim for TTS (abstract and experiments section) rests on human naturalness ratings being significantly higher than parametric/concatenative baselines. The manuscript provides no details on the number of raters, statistical significance testing, rating scale or protocol (e.g., blind presentation, sample selection), or objective corroborating metrics such as MCD or PESQ. This information is load-bearing for verifying that the preference reflects model quality rather than test artifacts.
Authors: We agree that the manuscript would benefit from additional details on the human evaluation to allow readers to fully assess the strength of the SOTA claims. The current version does not sufficiently describe the listening test protocol, including participant numbers, rating scale, blinding procedures, sample selection, statistical testing, or objective corroborating metrics. In the revised manuscript we will expand the relevant section to specify the mean opinion score (MOS) protocol, the number of raters and their selection criteria, confirmation that samples were presented blindly in randomized order, the statistical tests used to establish significance, and any objective metrics (such as MCD) that were computed alongside the perceptual ratings. These additions will directly address the concern that the reported preference could stem from test artifacts rather than model quality. revision: yes
Circularity Check
No circularity: WaveNet architecture and TTS claims rest on explicit model definition plus external human ratings
full rationale
The paper defines the autoregressive dilated-convolution architecture, softmax output, and conditioning mechanisms directly from first principles (causal convolutions, residual/skip connections). Training maximizes the standard next-sample log-likelihood on external audio corpora. The central TTS claim (SOTA naturalness) is supported solely by separate human listening tests whose ratings are not algebraically or statistically forced by any fitted parameter inside the model equations. No self-citation chain, ansatz smuggling, or renaming of known results occurs for the performance assertions. The derivation chain is therefore self-contained and non-circular.
Axiom & Free-Parameter Ledger
free parameters (2)
- dilation schedule and network depth
- conditioning mechanisms for text and speaker
axioms (2)
- domain assumption Raw audio waveforms can be modeled as autoregressive sequences where each sample depends statistically on all previous samples.
- domain assumption Dilated convolutions efficiently capture long-range temporal dependencies in audio data at high sample rates.
Forward citations
Cited by 60 Pith papers
-
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.
-
Efficiently Modeling Long Sequences with Structured State Spaces
S4 is an efficient state space sequence model that captures long-range dependencies via structured parameterization of the SSM, achieving state-of-the-art results on the Long Range Arena and other benchmarks while bei...
-
Denoising Diffusion Implicit Models
DDIMs construct non-Markovian diffusion processes that share DDPM training objectives but allow much faster reverse sampling, demonstrated empirically at 10-50x wall-clock speedup.
-
DiffWave: A Versatile Diffusion Model for Audio Synthesis
DiffWave is a non-autoregressive diffusion model that generates high-fidelity audio waveforms from noise in constant steps, matching WaveNet vocoder quality while being orders of magnitude faster and outperforming pri...
-
Denoising Diffusion Probabilistic Models
Denoising diffusion probabilistic models generate high-quality images by learning to reverse a fixed forward diffusion process, achieving FID 3.17 on CIFAR10.
-
Contrast to Detect: Dynamic Graph Contrastive Regularization for Unsupervised Anomaly Detection in Multivariate Time Series
ContrastAD achieves highest mean F1 on all five MTS benchmarks and highest AUC on three by building DTW-based sparse graph snapshots and contrasting divergent pairs with a stable anchor instead of enforcing invariance.
-
Scale-Equivariant Generative Forecasting: Weight-Tied Dilated Convolutions, Wavelet Scattering Inputs, and Spectral-Consistency Training for Self-Similar Time Series
Presents SE-WaveNet with weight-tied dilated convolutions plus wavelet and spectral components that reproduces empirical scaling collapse on financial returns while using L times fewer convolutional parameters.
-
Neural network modeling of many-body super- and sub-radiant dynamics
Neural quantum states simulate dissipative many-body emission dynamics for approximately 40 atoms in dense 1D and 2D arrays, revealing prominent subradiant behavior at late times.
-
MelShield: Robust Mel-Domain Audio Watermarking for Provenance Attribution of AI Generated Synthesized Speech
MelShield adds keyed low-energy spread-spectrum perturbations to Mel-spectrograms inside TTS pipelines before vocoding to enable robust extraction of user-specific attribution signals even after compression or noise.
-
DiffAnon: Diffusion-based Prosody Control for Voice Anonymization
DiffAnon introduces the first diffusion model for voice anonymization that supplies structured, continuous, inference-time control over prosody preservation via classifier-free guidance on RVQ semantic embeddings.
-
Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages
Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.
-
ReLU Networks for Exact Generation of Similar Graphs
Constant-depth ReLU networks of size O(n²d) exist that deterministically generate graphs within edit distance d from any given n-vertex input graph.
-
Moshi: a speech-text foundation model for real-time dialogue
Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
-
Deep Time Series Models: A Comprehensive Survey and Benchmark
This survey and benchmark of deep time series models using the released TSLib library finds that models with specific structures perform well only on distinct analysis tasks.
-
Chronos: Learning the Language of Time Series
Chronos pretrains transformer models on tokenized time series to deliver strong zero-shot forecasting across diverse domains.
-
Massive Activations in Large Language Models
Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.
-
A decoder-only foundation model for time-series forecasting
A pretrained decoder-only patched transformer achieves near state-of-the-art zero-shot forecasting performance across diverse time series datasets and settings.
-
High Fidelity Neural Audio Compression
EnCodec is an end-to-end trained streaming neural audio codec that uses a single multiscale spectrogram discriminator and a gradient-normalizing loss balancer to achieve higher fidelity than prior methods at the same ...
-
A Generalist Agent
Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
-
Diffusion Models Beat GANs on Image Synthesis
Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.
-
Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning
A Tacotron model with phonemic inputs and adversarial disentanglement enables cross-lingual voice cloning without parallel data, producing intelligible speech in native and foreign accents.
-
Analysis by Adversarial Synthesis -- A Novel Approach for Speech Vocoding
A conditional GAN is used to synthesize speech waveforms from compressed glottal excitation, refined by LPC parameters, yielding higher quality reconstructions than traditional methods on a 30-speaker dataset.
-
RUSLAN: Russian Spoken Language Corpus for Speech Synthesis
RUSLAN is a 31-hour single-speaker Russian speech corpus for TTS containing 22200 annotated samples, with a baseline end-to-end model scoring 4.05 naturalness and 3.78 intelligibility on MOS tests.
-
Generating Long Sequences with Sparse Transformers
Sparse Transformers factorize attention to handle sequences tens of thousands long, achieving new SOTA density modeling on Enwik8, CIFAR-10, and ImageNet-64.
-
Progressive Growing of GANs for Improved Quality, Stability, and Variation
Progressive growing stabilizes GAN training to produce high-resolution images of unprecedented quality and achieves a record unsupervised inception score of 8.80 on CIFAR10.
-
WavFlow: Audio Generation in Waveform Space
WavFlow performs direct waveform audio generation via flow matching on 2D token grids from raw patches plus amplitude lifting, matching latent-based methods on VGGSound and AudioCaps without intermediate compression.
-
GenTS: A Comprehensive Benchmark Library for Generative Time Series Models
GenTS is a modular benchmark library providing unified data pipelines, generative models, and evaluation metrics for time series synthesis, forecasting, and imputation, with open-source code and initial benchmarking e...
-
Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation
Chain-of-Details (CoD) is a cascaded TTS method that explicitly models temporal coarse-to-fine dynamics with a shared decoder, achieving competitive performance using significantly fewer parameters.
-
AIBuildAI: An AI Agent for Automatically Building AI Models
AIBuildAI uses a manager agent and three LLM sub-agents to fully automate AI model development and achieves a 63.1% medal rate on MLE-Bench, matching experienced human engineers.
-
A Novel Automatic Framework for Speaker Drift Detection in Synthesized Speech
A framework detects speaker drift in TTS outputs by computing cosine similarities across speech segments and using LLMs for binary classification, supported by a human-validated synthetic benchmark.
-
Optimal-Transport-Guided Functional Flow Matching for Turbulent Field Generation in Hilbert Space
FOT-CFM generates turbulent fields in function space with superior high-order statistics and energy spectra on Navier-Stokes, Kolmogorov flow, and Hasegawa-Wakatani equations compared to baselines.
-
Borderless Long Speech Synthesis
Borderless Long Speech Synthesis unifies voice design, multi-speaker TTS, and long-form generation via Global-Sentence-Token annotations, CoT reasoning, and a Structured Semantic Interface for agent-centric control.
-
mGRADE: Minimal Recurrent Gating Meets Delay Convolutions for Lightweight Sequence Modeling
mGRADE uses learnable-spaced convolutions shown to be equivalent to delay embeddings plus a lightweight gated recurrent component to achieve low-memory multi-timescale sequence modeling.
-
SwitchCodec: A High-Fidelity Nerual Audio Codec With Sparse Quantization
SwitchCodec introduces Residual Experts Vector Quantization and a multi-tiered STFT discriminator to achieve PESQ 2.87 and ViSQOL 4.27 at 2.67 kbps while halving training time via post-training.
-
Is Conditional Generative Modeling all you need for Decision-Making?
Return-conditional diffusion models for policies outperform offline RL on benchmarks by circumventing dynamic programming and enable constraint or skill composition.
-
Simplified State Space Layers for Sequence Modeling
S5 uses a single MIMO state space model with S4-derived initialization to match S4 efficiency and reach 87.4% average accuracy on the Long Range Arena benchmark.
-
Text and Code Embeddings by Contrastive Pre-Training
Contrastive pre-training on unsupervised data at scale creates text and code embeddings that set new state-of-the-art results on classification and semantic search benchmarks.
-
Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges
Geometric deep learning provides a unified mathematical framework based on grids, groups, graphs, geodesics, and gauges to explain and extend neural network architectures by incorporating physical regularities.
-
VideoGPT: Video Generation using VQ-VAE and Transformers
VideoGPT generates competitive natural videos by learning discrete latents with VQ-VAE and modeling them autoregressively with a transformer.
-
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
GShard supplies automatic sharding and conditional computation support that enabled training a 600-billion-parameter multilingual translation model on thousands of TPUs with superior quality.
-
Jukebox: A Generative Model for Music
Jukebox generates high-fidelity and diverse songs with singing and coherence up to multiple minutes by compressing raw audio via multi-scale VQ-VAE and modeling the codes with large autoregressive Transformers conditi...
-
Compressive Transformers for Long-Range Sequence Modelling
Compressive Transformer sets new records on WikiText-103 (17.1 ppl) and Enwik8 (0.97 bpc) via memory compression and introduces the PG-19 long-range language benchmark.
-
Non-Parallel Voice Conversion with Cyclic Variational Autoencoder
CycleVAE optimizes non-parallel voice conversion indirectly via cyclic reconstructed spectra, yielding higher spectral accuracy, latent feature correlation, and improved converted speech quality.
-
DNN-based Speaker Embedding Using Subjective Inter-speaker Similarity for Multi-speaker Modeling in Speech Synthesis
Two new embedding algorithms (similarity vector prediction and Frobenius-norm matrix matching) trained on subjective inter-speaker scores yield d-vectors more correlated with human similarity judgments and improve TTS...
-
Forward-Backward Decoding for Regularizing End-to-End TTS
Forward-backward decoding with divergence regularization and bidirectional decoder improves end-to-end TTS robustness and naturalness by addressing exposure bias via joint L2R/R2L training.
-
A Dialogue between Causal and Traditional Representation Learning: Toward Mutual Benefits in a Unified Formulation
The paper introduces a unified formulation for representation learning with task and constraint components, arguing for mutual benefits between causal and traditional approaches and showing via experiments that causal...
-
Sessa: Selective State Space Attention
Sessa integrates attention within recurrent paths to achieve power-law memory tails and flexible non-decaying selective retrieval, outperforming baselines on long-context tasks.
-
Learning Invariant Modality Representation for Robust Multimodal Learning from a Causal Inference Perspective
CmIR uses causal inference to separate invariant causal representations from spurious ones in multimodal data, improving generalization under distribution shifts and noise via invariance, mutual information, and recon...
-
Applied AI-Enhanced RF Interference Rejection
Autoregressive transformer decoders suppress OFDM interference in FM radio signals to restore intelligible speech with low latency on GPUs like Jetson AGX Orin.
-
STAG-CN: Spatio-Temporal Apiary Graph Convolutional Network for Disease Onset Prediction in Beehive Sensor Networks
STAG-CN applies a spatio-temporal graph convolutional network to beehive sensor streams on a dual physical-climatic adjacency graph, achieving F1=0.607 at three-day disease onset prediction where climatic correlations...
-
XR-1: Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion Representations
XR-1 introduces Unified Vision-Motion Codes learned by dual-branch VQ-VAE and applies them in a three-stage training pipeline to outperform prior VLA models on 120+ real-world manipulation tasks across six robot embodiments.
-
Learning Minimal Representations of Many-Body Physics from Snapshots of a Quantum Simulator
A VAE learns a minimal latent representation from noisy quantum simulator snapshots that correlates with the sine-Gordon equilibrium parameter and detects anomalous post-quench dynamics including frozen-in solitons.
-
End-to-End Emotional Speech Synthesis Using Style Tokens and Semi-Supervised Training
GST-Tacotron with cross-entropy loss on style tokens outperforms standard Tacotron for emotional speech synthesis with only 5% emotion-labeled data and approaches full-label performance.
-
Singing Voice Synthesis Using Deep Autoregressive Neural Networks for Acoustic Modeling
Deep autoregressive models with F0 discretization, post-processing, and self-attention prenet outperform RNNs in objective and subjective metrics for singing voice synthesis on a Chinese corpus.
-
Federated Parameter-Efficient Adaptation for Interference Mitigation at the Wireless Edge
Federated LoRA on TCNs for wireless interference suppression reduces per-round communication up to 20x while delivering 12.6% average BER improvement comparable to local adaptation.
-
Multimodal Large Language Model-Enabled Video Translation: A Role-Oriented Survey
The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.
-
ASVspoof 5: Evaluation of Spoofing, Deepfake, and Adversarial Attack Detection Using Crowdsourced Speech
The ASVspoof 5 challenge shows detection systems perform well on crowdsourced speech data but degrade under adversarial attacks and neural encoding or compression schemes.
-
Hierarchical Sequence to Sequence Voice Conversion with Limited Data
Hierarchical seq2seq model for parallel voice conversion pretrained as autoencoder on single-speaker data then adapted to limited multispeaker data, using mel spectrograms converted via wavenet vocoder.
-
Autoencoding sensory substitution
Deep recurrent autoencoders convert images to shortened audio signals that incorporate hearing models, enabling above-chance hand posture discrimination and object reaching after a few hours of training instead of months.
-
Multitasking with Alexa Multitasking with Alexa: How Using Intelligent Personal Assistants Impacts Language-based Primary Task Performance
IPA use disrupts content-generation writing tasks more than copying tasks because they share more cognitive resources.
Reference graph
Works this paper leans on
-
[1]
Vocaine the vocoder and applications is speech synthesis
Agiomyrgiannakis, Yannis. Vocaine the vocoder and applications is speech synthesis. In ICASSP, pp.\ 4230--4234, 2015
work page 2015
-
[2]
Bishop, Christopher M. Mixture density networks. Technical Report NCRG/94/004, Neural Computing Research Group, Aston University, 1994
work page 1994
-
[3]
Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs
Chen, Liang-Chieh, Papandreou, George, Kokkinos, Iasonas, Murphy, Kevin, and Yuille, Alan L. Semantic image segmentation with deep convolutional nets and fully connected CRF s. In ICLR, 2015. URL http://arxiv.org/abs/1412.7062
work page Pith review arXiv 2015
-
[4]
The Vowel: I ts Nature and Structure
Chiba, Tsutomu and Kajiyama, Masato. The Vowel: I ts Nature and Structure . Tokyo-Kaiseikan, 1942
work page 1942
-
[5]
Dudley, Homer. Remaking speech. The Journal of the Acoustical Society of America, 11 0 (2): 0 169--177, 1939
work page 1939
-
[6]
An implementation of the ``algorithme \`a trous'' to compute the wavelet transform
Dutilleux, Pierre. An implementation of the ``algorithme \`a trous'' to compute the wavelet transform. In Combes, Jean-Michel, Grossmann, Alexander, and Tchamitchian, Philippe (eds.), Wavelets: Time-Frequency Methods and Phase Space, pp.\ 298--304. Springer Berlin Heidelberg, 1989
work page 1989
-
[7]
TTS synthesis with bidirectional LSTM based recurrent neural networks
Fan, Yuchen, Qian, Yao, and Xie, Feng-Long, Soong Frank K. TTS synthesis with bidirectional LSTM based recurrent neural networks. In Interspeech, pp.\ 1964--1968, 2014
work page 1964
-
[8]
Acoustic Theory of Speech Production
Fant, Gunnar. Acoustic Theory of Speech Production. Mouton De Gruyter, 1970
work page 1970
-
[9]
DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM
Garofolo, John S., Lamel, Lori F., Fisher, William M., Fiscus, Jonathon G., and Pallett, David S. DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM . NIST speech disc 1-1.1. NASA STI/Recon technical report, 93, 1993
work page 1993
-
[10]
Recent advances in G oogle real-time HMM -driven unit selection synthesizer
Gonzalvo, Xavi, Tazari, Siamak, Chan, Chun-an, Becker, Markus, Gutkin, Alexander, and Silen, Hanna. Recent advances in G oogle real-time HMM -driven unit selection synthesizer. In Interspeech, 2016. URL http://research.google.com/pubs/pub45564.html
work page 2016
-
[11]
Deep Residual Learning for Image Recognition
He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[12]
Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Comput., 9 0 (8): 0 1735--1780, 1997
work page 1997
-
[13]
A real-time algorithm for signal analysis with the help of the wavelet transform
Holschneider, Matthias, Kronland-Martinet, Richard, Morlet, Jean, and Tchamitchian, Philippe. A real-time algorithm for signal analysis with the help of the wavelet transform. In Combes, Jean-Michel, Grossmann, Alexander, and Tchamitchian, Philippe (eds.), Wavelets: Time-Frequency Methods and Phase Space, pp.\ 286--297. Springer Berlin Heidelberg, 1989
work page 1989
-
[14]
Speech acoustic modeling from raw multichannel waveforms
Hoshen, Yedid, Weiss, Ron J., and Wilson, Kevin W. Speech acoustic modeling from raw multichannel waveforms. In ICASSP, pp.\ 4624--4628. IEEE, 2015
work page 2015
-
[15]
Hunt, Andrew J. and Black, Alan W. Unit selection in a concatenative speech synthesis system using a large speech database. In ICASSP, pp.\ 373--376, 1996
work page 1996
-
[16]
Unbiased estimation of log spectrum
Imai, Satoshi and Furuichi, Chieko. Unbiased estimation of log spectrum. In EURASIP, pp.\ 203--206, 1988
work page 1988
-
[17]
Line spectrum representation of linear predictor coefficients of speech signals
Itakura, Fumitada. Line spectrum representation of linear predictor coefficients of speech signals. The Journal of the Acoust. Society of America, 57 0 (S1): 0 S35--S35, 1975
work page 1975
-
[18]
A statistical method for estimation of speech spectral density and formant frequencies
Itakura, Fumitada and Saito, Shuzo. A statistical method for estimation of speech spectral density and formant frequencies. Trans. IEICE, J53–A: 0 35--42, 1970
work page 1970
-
[19]
ITU-T. Recommendation G . 711. Pulse Code Modulation (PCM) of voice frequencies, 1988
work page 1988
-
[20]
Exploring the Limits of Language Modeling
J \' o zefowicz, Rafal, Vinyals, Oriol, Schuster, Mike, Shazeer, Noam, and Wu, Yonghui. Exploring the limits of language modeling. CoRR, abs/1602.02410, 2016. URL http://arxiv.org/abs/1602.02410
work page Pith review arXiv 2016
-
[21]
Mixture autoregressive hidden M arkov models for speech signals
Juang, Biing-Hwang and Rabiner, Lawrence. Mixture autoregressive hidden M arkov models for speech signals. IEEE Trans. Acoust. Speech Signal Process., pp.\ 1404--1413, 1985
work page 1985
-
[22]
Speech analysis with multi-kernel linear prediction
Kameoka, Hirokazu, Ohishi, Yasunori, Mochihashi, Daichi, and Le Roux, Jonathan. Speech analysis with multi-kernel linear prediction. In Spring Conference of ASJ, pp.\ 499--502, 2010. (in Japanese)
work page 2010
-
[23]
Text-to-speech conversion with neural networks: A recurrent TDNN approach
Karaali, Orhan, Corrigan, Gerald, Gerson, Ira, and Massey, Noel. Text-to-speech conversion with neural networks: A recurrent TDNN approach. In Eurospeech, pp.\ 561--564, 1997
work page 1997
-
[24]
Kawahara, Hideki, Masuda-Katsuse, Ikuyo, and de Cheveign \'e , Alain. Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based f_ 0 extraction: possible role of a repetitive structure in sounds. Speech Commn., 27: 0 187--207, 1999
work page 1999
-
[25]
Kawahara, Hideki, Estill, Jo, and Fujimura, Osamu. Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT . In MAVEBA, pp.\ 13--15, 2001
work page 2001
-
[26]
Input-agreement: a new mechanism for collecting data using human computation games
Law, Edith and Von Ahn, Luis. Input-agreement: a new mechanism for collecting data using human computation games. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp.\ 1197--1206. ACM, 2009
work page 2009
-
[27]
Maia, Ranniery, Zen, Heiga, and Gales, Mark J. F. Statistical parametric speech synthesis with joint estimation of acoustic and excitation model parameters. In ISCA SSW7, pp.\ 88--93, 2010
work page 2010
-
[28]
WORLD : A vocoder-based high-quality speech synthesis system for real-time applications
Morise, Masanori, Yokomori, Fumiya, and Ozawa, Kenji. WORLD : A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans. Inf. Syst., E99-D 0 (7): 0 1877--1884, 2016
work page 2016
-
[29]
Pitch synchronous waveform processing techniques for text-to-speech synthesis using diphones
Moulines, Eric and Charpentier, Francis. Pitch synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Commn., 9: 0 453--467, 1990
work page 1990
-
[30]
Muthukumar, P. and Black, Alan W. A deep learning approach to data-driven parameterizations for statistical parametric speech synthesis. arXiv:1409.8558, 2014
-
[31]
Rectified linear units improve restricted B oltzmann machines
Nair, Vinod and Hinton, Geoffrey E. Rectified linear units improve restricted B oltzmann machines. In ICML, pp.\ 807--814, 2010
work page 2010
-
[32]
Integration of spectral feature extraction and modeling for HMM -based speech synthesis
Nakamura, Kazuhiro, Hashimoto, Kei, Nankaku, Yoshihiko, and Tokuda, Keiichi. Integration of spectral feature extraction and modeling for HMM -based speech synthesis. IEICE Trans. Inf. Syst., E97-D 0 (6): 0 1438--1448, 2014
work page 2014
-
[33]
Palaz, Dimitri, Collobert, Ronan, and Magimai-Doss, Mathew. Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks. In Interspeech, pp.\ 1766--1770, 2013
work page 2013
-
[34]
Nonlinear filter design: methodologies and challenges
Peltonen, Sari, Gabbouj, Moncef, and Astola, Jaakko. Nonlinear filter design: methodologies and challenges. In IEEE ISPA, pp.\ 102--107, 2001
work page 2001
-
[35]
Linear predictive hidden M arkov models and the speech signal
Poritz, Alan B. Linear predictive hidden M arkov models and the speech signal. In ICASSP, pp.\ 1291--1294, 1982
work page 1982
-
[36]
Fundamentals of Speech Recognition
Rabiner, Lawrence and Juang, Biing-Hwang. Fundamentals of Speech Recognition. PrenticeHall, 1993
work page 1993
-
[37]
ATR -talk speech synthesis system
Sagisaka, Yoshinori, Kaiki, Nobuyoshi, Iwahashi, Naoto, and Mimura, Katsuhiko. ATR -talk speech synthesis system. In ICSLP, pp.\ 483--486, 1992
work page 1992
-
[38]
Learning the speech front-end with raw waveform CLDNN s
Sainath, Tara N., Weiss, Ron J., Senior, Andrew, Wilson, Kevin W., and Vinyals, Oriol. Learning the speech front-end with raw waveform CLDNN s. In Interspeech, pp.\ 1--5, 2015
work page 2015
-
[39]
Takaki, Shinji and Yamagishi, Junichi. A deep auto-encoder based low-dimensional feature extraction from FFT spectral envelopes for statistical parametric speech synthesis. In ICASSP, pp.\ 5535--5539, 2016
work page 2016
-
[40]
Postfilters to modify the modulation spectrum for statistical parametric speech synthesis
Takamichi, Shinnosuke, Toda, Tomoki, Black, Alan W., Neubig, Graham, Sakriani, Sakti, and Nakamura, Satoshi. Postfilters to modify the modulation spectrum for statistical parametric speech synthesis. IEEE/ACM Trans. Audio Speech Lang. Process., 24 0 (4): 0 755--767, 2016
work page 2016
-
[41]
Generative image modeling using spatial LSTM s
Theis, Lucas and Bethge, Matthias. Generative image modeling using spatial LSTM s. In NIPS, pp.\ 1927--1935, 2015
work page 1927
-
[42]
A speech parameter generation algorithm considering global variance for HMM -based speech synthesis
Toda, Tomoki and Tokuda, Keiichi. A speech parameter generation algorithm considering global variance for HMM -based speech synthesis. IEICE Trans. Inf. Syst., E90-D 0 (5): 0 816--824, 2007
work page 2007
-
[43]
Toda, Tomoki and Tokuda, Keiichi. Statistical approach to vocal tract transfer function estimation based on factor analyzed trajectory hmm. In ICASSP, pp.\ 3925--3928, 2008
work page 2008
-
[44]
Speech synthesis as a statistical machine learning problem
Tokuda, Keiichi. Speech synthesis as a statistical machine learning problem. http://www.sp.nitech.ac.jp/ tokuda/tokuda_asru2011_for_pdf.pdf, 2011. Invited talk given at ASRU
work page 2011
-
[45]
Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis
Tokuda, Keiichi and Zen, Heiga. Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis. In ICASSP, pp.\ 4215--4219, 2015
work page 2015
-
[46]
Directly modeling voiced and unvoiced components in speech waveforms by neural networks
Tokuda, Keiichi and Zen, Heiga. Directly modeling voiced and unvoiced components in speech waveforms by neural networks. In ICASSP, pp.\ 5640--5644, 2016
work page 2016
-
[47]
Speech synthesis using artificial neural networks trained on cepstral coefficients
Tuerk, Christine and Robinson, Tony. Speech synthesis using artificial neural networks trained on cepstral coefficients. In Proc. Eurospeech, pp.\ 1713--1716, 1993
work page 1993
-
[48]
u ske, Zolt \'a n, Golik, Pavel, Schl \
T \"u ske, Zolt \'a n, Golik, Pavel, Schl \"u ter, Ralf, and Ney, Hermann. Acoustic modeling with deep neural networks using raw time signal for LVCSR . In Interspeech, pp.\ 890--894, 2014
work page 2014
-
[49]
Modelling acoustic feature dependencies with artificial neural networks: T rajectory- RNADE
Uria, Benigno, Murray, Iain, Renals, Steve, Valentini-Botinhao, Cassia, and Bridle, John. Modelling acoustic feature dependencies with artificial neural networks: T rajectory- RNADE . In ICASSP, pp.\ 4465--4469, 2015
work page 2015
-
[50]
Pixel Recurrent Neural Networks
van den Oord, A \" a ron, Kalchbrenner, Nal, and Kavukcuoglu, Koray. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016 a
work page Pith review arXiv 2016
-
[51]
Conditional Image Generation with PixelCNN Decoders
van den Oord, A \" a ron, Kalchbrenner, Nal, Vinyals, Oriol, Espeholt, Lasse, Graves, Alex, and Kavukcuoglu, Koray. Conditional image generation with PixelCNN decoders. CoRR, abs/1606.05328, 2016 b . URL http://arxiv.org/abs/1606.05328
work page Pith review arXiv 2016
-
[52]
Wu, Yi-Jian and Tokuda, Keiichi. Minimum generation error training with direct log spectral distortion on LSP s for HMM -based speech synthesis. In Interspeech, pp.\ 577--580, 2008
work page 2008
-
[53]
English multi-speaker corpus for CSTR voice cloning toolkit, 2012
Yamagishi, Junichi. English multi-speaker corpus for CSTR voice cloning toolkit, 2012. URL http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html
work page 2012
-
[54]
Yoshimura, Takayoshi. Simultaneous modeling of phonetic and prosodic parameters, and characteristic conversion for HMM -based text-to-speech systems . PhD thesis, Nagoya Institute of Technology, 2002
work page 2002
-
[55]
Multi-Scale Context Aggregation by Dilated Convolutions
Yu, Fisher and Koltun, Vladlen. Multi-scale context aggregation by dilated convolutions. In ICLR, 2016. URL http://arxiv.org/abs/1511.07122
work page Pith review arXiv 2016
-
[56]
An example of context-dependent label format for HMM -based speech synthesis in E nglish, 2006
Zen, Heiga. An example of context-dependent label format for HMM -based speech synthesis in E nglish, 2006. URL http://hts.sp.nitech.ac.jp/?Download
work page 2006
-
[57]
Zen, Heiga, Tokuda, Keiichi, and Kitamura, Tadashi. Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic features. Comput. Speech Lang., 21 0 (1): 0 153--173, 2007
work page 2007
-
[58]
Statistical parametric speech synthesis
Zen, Heiga, Tokuda, Keiichi, and Black, Alan W. Statistical parametric speech synthesis. Speech Commn., 51 0 (11): 0 1039--1064, 2009
work page 2009
-
[59]
Statistical parametric speech synthesis using deep neural networks
Zen, Heiga, Senior, Andrew, and Schuster, Mike. Statistical parametric speech synthesis using deep neural networks. In Proc. ICASSP, pp.\ 7962--7966, 2013
work page 2013
-
[60]
Zen, Heiga, Agiomyrgiannakis, Yannis, Egberts, Niels, Henderson, Fergus, and Szczepaniak, Przemys aw. Fast, compact, and high quality LSTM - RNN based statistical parametric speech synthesizers for mobile devices. In Interspeech, 2016. URL https://arxiv.org/abs/1606.06061
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.