Speech bandwidth extension with WaveNet

Archit Gupta; Brendan Shillingford; Thomas C. Walters; Yannis Assael

arxiv: 1907.04927 · v1 · pith:ANL6KNO2new · submitted 2019-07-05 · 📡 eess.AS · cs.LG· cs.SD· eess.SP

Speech bandwidth extension with WaveNet

Archit Gupta , Brendan Shillingford , Yannis Assael , Thomas C. Walters This is my paper

Pith reviewed 2026-05-25 01:29 UTC · model grok-4.3

classification 📡 eess.AS cs.LGcs.SDeess.SP

keywords speech bandwidth extensionWaveNetGSM-FRAMR-WBMUSHRAaudio upsamplingneural vocoder

0 comments

The pith

A WaveNet model upsamples 8 kHz GSM-FR speech to 24 kHz with quality nearly matching AMR-WB at 16 kHz.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a WaveNet neural network can extend the bandwidth of speech signals limited to 8 kHz by legacy codecs such as GSM full-rate. Conditioned on the log-mel spectrogram of the narrowband input, the model generates a 24 kHz output signal. In listening tests, this reconstruction achieves quality only slightly below that of the AMR-WB wideband codec while recovering about half the perceptual difference from the original 24 kHz speech. This approach offers a way to improve audio quality at receiving nodes without requiring upgrades to the entire transmission chain.

Core claim

A WaveNet model trained to upsample to 24 kHz speech signals from audio passed through the 8 kHz GSM-FR codec reconstructs audio only slightly lower in quality than the AMR-WB codec at 16 kHz and closes around half the gap in perceptual quality between the original encoded signal and the original speech sampled at 24 kHz. The same model applied to uncompressed 8 kHz audio reconstructs audio of slightly better quality than 16 kHz AMR-WB.

What carries the argument

WaveNet model conditioned on log-mel spectrogram of the bandwidth-constrained input signal, which generates the higher sample rate output.

If this is right

The model can be deployed at communication nodes to extend bandwidth of incoming legacy signals without changing transmission channels.
It achieves this for both GSM-FR compressed inputs and clean 8 kHz inputs.
The approach closes roughly half the perceptual quality gap to full 24 kHz originals.
Quality remains only slightly below AMR-WB at 16 kHz in the reported tests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Nodes equipped with this model could deliver incremental quality gains in heterogeneous networks that mix legacy and modern codecs.
The log-mel conditioning technique might transfer to related audio tasks such as denoising or packet-loss concealment.
Larger training sets covering more acoustic conditions could narrow the remaining quality gap further.

Load-bearing premise

The MUSHRA scores from the authors' chosen test set and listener pool serve as a reliable proxy for real-world perceptual quality across diverse speakers, accents, and network conditions.

What would settle it

A new MUSHRA test using a larger and more diverse set of speakers, accents, and transmission conditions that yields model scores substantially below AMR-WB would falsify the central claim.

Figures

Figures reproduced from arXiv: 1907.04927 by Archit Gupta, Brendan Shillingford, Thomas C. Walters, Yannis Assael.

**Figure 3.** Figure 3: Our model (WAVENET 8KHZ and WAVENET GSMFR), trained on 8kHz GSM-FR audio signals and evaluated with 8kHz uncompressed and 8kHz GSM-FR audio, is evaluated using the MUSHRA listening test methodology. The model is compared against the initial audio at 24KHZ and 8KHZ, and the AMR-WB 16kHz and GSM-FR 8kHz codecs. MUSHRA tests show that the model trained to predict to 24kHz from 8kHz audio directly performs sl… view at source ↗

**Figure 2.** Figure 2: Illustration of the processing pipeline. The input audio, [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

read the original abstract

Large-scale mobile communication systems tend to contain legacy transmission channels with narrowband bottlenecks, resulting in characteristic "telephone-quality" audio. While higher quality codecs exist, due to the scale and heterogeneity of the networks, transmitting higher sample rate audio with modern high-quality audio codecs can be difficult in practice. This paper proposes an approach where a communication node can instead extend the bandwidth of a band-limited incoming speech signal that may have been passed through a low-rate codec. To this end, we propose a WaveNet-based model conditioned on a log-mel spectrogram representation of a bandwidth-constrained speech audio signal of 8 kHz and audio with artifacts from GSM full-rate (FR) compression to reconstruct the higher-resolution signal. In our experimental MUSHRA evaluation, we show that a model trained to upsample to 24kHz speech signals from audio passed through the 8kHz GSM-FR codec is able to reconstruct audio only slightly lower in quality to that of the Adaptive Multi-Rate Wideband audio codec (AMR-WB) codec at 16kHz, and closes around half the gap in perceptual quality between the original encoded signal and the original speech sampled at 24kHz. We further show that when the same model is passed 8kHz audio that has not been compressed, is able to again reconstruct audio of slightly better quality than 16kHz AMR-WB, in the same MUSHRA evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WaveNet bandwidth extension from GSM-FR gets close to AMR-WB on MUSHRA but the results lack the experimental details needed to judge reliability.

read the letter

The one thing to know is that this WaveNet model, conditioned on log-mel spectrograms from 8 kHz GSM-FR input, produces 24 kHz output that scores only slightly below AMR-WB in their MUSHRA test and closes about half the quality gap to the original 24 kHz speech. What is new is the specific setup for handling GSM full-rate compression artifacts with this architecture and the direct perceptual comparison to the wideband codec. The paper does well at showing a workable receiver-side upgrade for legacy narrowband channels that could be deployed without changing the transmission side. The soft spots come down to the experimental evidence. The abstract gives no information on training data volume or diversity, listener count, randomization procedures, or any statistical analysis of the MUSHRA scores. Without those, it's difficult to know if the reported improvement is reliable or if it depends on particular test conditions. The stress-test note about the scores not generalizing as a proxy for real perceptual quality across diverse conditions holds up here because nothing in the provided description counters it. If the full paper has those details, that would change the picture, but based on the abstract they are absent. This paper is for audio engineers and researchers working on speech enhancement and bandwidth extension in telecommunications. Someone looking for applied neural methods in codec improvement would find the approach and results relevant. I would recommend sending it for peer review to allow proper checking of the methods and to see if the listening test results can be replicated or extended.

Referee Report

2 major / 0 minor

Summary. The paper proposes conditioning a WaveNet on log-mel spectrograms extracted from 8 kHz GSM-FR compressed speech to generate 24 kHz output. MUSHRA listening tests are reported to show that the model produces audio only slightly lower in quality than AMR-WB at 16 kHz and closes roughly half the perceptual gap to the 24 kHz reference; a similar result holds for uncompressed 8 kHz input.

Significance. If the listening-test results prove robust, the approach offers a practical post-processing method to mitigate narrowband bottlenecks in legacy mobile networks without requiring codec or transmission changes. The direct comparison against AMR-WB and the use of a generative model for this task are positive aspects.

major comments (2)

[abstract and experimental evaluation] The central claim rests on MUSHRA scores that close half the quality gap to the 24 kHz reference and place the model only slightly below AMR-WB. However, the manuscript provides no details on test-set size, speaker/accent diversity, number of listeners, randomization, error bars, or statistical significance testing (see abstract and experimental evaluation sections). This absence prevents assessment of whether the reported numbers support the generalization implied by the claim.
[model and experimental sections] No ablation of the log-mel conditioning or of the GSM-FR artifact handling is presented, leaving unclear how much of the reported improvement is attributable to the WaveNet architecture versus the conditioning features.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly where feasible.

read point-by-point responses

Referee: [abstract and experimental evaluation] The central claim rests on MUSHRA scores that close half the quality gap to the 24 kHz reference and place the model only slightly below AMR-WB. However, the manuscript provides no details on test-set size, speaker/accent diversity, number of listeners, randomization, error bars, or statistical significance testing (see abstract and experimental evaluation sections). This absence prevents assessment of whether the reported numbers support the generalization implied by the claim.

Authors: We agree that these experimental details are necessary to properly assess the results. The revised manuscript will expand the experimental evaluation section to report the test-set size, speaker diversity, number of listeners, randomization procedure, and any available statistical information. We will explicitly note that error bars and formal significance testing were not performed in the original experiments. revision: yes
Referee: [model and experimental sections] No ablation of the log-mel conditioning or of the GSM-FR artifact handling is presented, leaving unclear how much of the reported improvement is attributable to the WaveNet architecture versus the conditioning features.

Authors: We acknowledge that no explicit ablation studies are included. In the revision we will add discussion in the model section on the rationale for log-mel conditioning and its role in handling GSM-FR artifacts. A full ablation would require new training runs that are not feasible at this stage; the existing comparisons against AMR-WB and the uncompressed 8 kHz case provide supporting evidence for the overall approach. revision: partial

Circularity Check

0 steps flagged

No circularity: claims rest on independent MUSHRA listening tests

full rationale

The paper trains a conditional WaveNet on log-mel features of 8 kHz GSM-FR input to generate 24 kHz output and reports perceptual quality via MUSHRA scores. These scores are external human judgments on held-out utterances; they are not computed from model parameters, not used as training targets, and not obtained by fitting or renaming any internal quantity. No equations, self-citations, or uniqueness theorems are invoked to derive the reported gap-closing performance; the result is a direct empirical measurement.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that MUSHRA scores on the authors' test material generalize to real telephony conditions and that the WaveNet architecture can be trained stably on the chosen spectrogram representation; no new physical entities or ad-hoc constants are introduced.

axioms (1)

domain assumption MUSHRA listening tests provide a valid ordinal ranking of perceptual audio quality that transfers across listener pools and acoustic conditions.
Invoked when the abstract equates the reported MUSHRA scores with the quality gap closure.

pith-pipeline@v0.9.0 · 5801 in / 1425 out tokens · 17957 ms · 2026-05-25T01:29:04.240111+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 3 internal anchors

[1]

Speech bandwidth extension with WaveNet

INTRODUCTION AND RELA TED WORK Legacy transmission channels are still part of many large-scale com- munication systems. These channels introduce bottlenecks, limiting the bandwidth and the quality of speech. Often this is referenced as ‘telephone-quality’ audio. Upgrading all parts of the infrastructure to be compatible with higher quality audio codecs ca...

work page internal anchor Pith review Pith/arXiv arXiv 1907
[2]

In our experimental evaluation, we assess the ability of our pro- posed model to perform bandwidth extension on narrowband sig- nals

or to predict the upsampled waveform directly [6, 9, 10], leading to quality gains over earlier methods. In our experimental evaluation, we assess the ability of our pro- posed model to perform bandwidth extension on narrowband sig- nals. To illustrate the impact of our work, we show that a model trained to upsample to 24kHz speech signals passed through ...

work page
[3]

Model Architecture WaveNet is a generative model that models the joint probability of a waveform x = {x1,

TRAINING SETUP 2.1. Model Architecture WaveNet is a generative model that models the joint probability of a waveform x = {x1, . . . , xT } as a product of conditional prob- abilities given the samples at previous timesteps. A conditional WaveNet model takes an additional input variable h and models this conditional distribution as p(x|h) = T∏ t=1 p (xt|x1...

work page
[4]

EXPERIMENTAL EV ALUA TION 3.1. Setup In this evaluation we are primarily interested in the case of speech enhancement in the setting of a ﬁxed legacy audio coding pathway, such as calls on standard GSM mobile networks. In this case, the codec typically operates with a bandwidth of 4kHz, leading to an audio waveform with an 8kHz sample rate. To generate th...

work page
[5]

HD-V oice

CONCLUSIONS This work introduces a new WaveNet-based model for speech band- width extension. The model is able to reconstruct 24kHz audio from 8kHz signals that is of similar or better quality to that produced by the AMR-WB codec at 16kHz. Our upsampling method pro- duces “HD-V oice”-quality audio from standard telephony-quality and GSM-quality audio, sho...

work page
[6]

WaveNet: A generative model for raw au- dio

A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, “WaveNet: A generative model for raw au- dio.” in SSW, 2016, p. 125

work page 2016
[7]

Natu- ral tts synthesis by conditioning wavenet on mel spectrogram predictions,

J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y . Zhang, Y . Wang, R. Skerrv-Ryan, et al. , “Natu- ral tts synthesis by conditioning wavenet on mel spectrogram predictions,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2018, pp. 4779–4783

work page 2018
[8]

WaveNet based low rate speech coding,

W. B. Kleijn, F. S. Lim, A. Luebs, J. Skoglund, F. Stim- berg, Q. Wang, and T. C. Walters, “WaveNet based low rate speech coding,” in IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP) . IEEE, 2018, pp. 676–680

work page 2018
[9]

Low bit-rate speech coding with VQ-V AE and a WaveNet decoder,

C. Garbacea, A. van den Oord, Y . Li, F. S. C. Lim, A. Luebs, O. Vinyals, and T. C. Walters, “Low bit-rate speech coding with VQ-V AE and a WaveNet decoder,” in IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019

work page 2019
[10]

E. R. Larsen and R. M. Aarts, Audio Bandwidth Extension: Application of Psychoacoustics, Signal Processing and Loud- speaker Design. USA: John Wiley &; Sons, Inc., 2004

work page 2004
[11]

Audio Super Resolution using Neural Networks

V . Kuleshov, S. Z. Enam, and S. Ermon, “Audio su- per resolution using neural networks,” arXiv preprint arXiv:1708.00853, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[12]

GSM Full Rate Speech Transcoding,

ESTI, “GSM Full Rate Speech Transcoding,” Euro- pean Digital Cellular Telecommunications System, Tech. Rep. 06.10, 02 1992, version 3.2.0. [Online]. Avail- able: https://www.etsi.org/deliver/etsi gts/06/0610/03.02. 00 60/gsmts 0610sv030200p.pdf

work page 1992
[13]

Artiﬁcial speech bandwidth extension using deep neural networks for wideband spec- tral envelope estimation,

J. Abel and T. Fingscheidt, “Artiﬁcial speech bandwidth extension using deep neural networks for wideband spec- tral envelope estimation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. PP, pp. 1–1, 10 2017

work page 2017
[14]

Waveform modeling and generation using hierarchical recurrent neural networks for speech bandwidth extension,

Z.-H. Ling, Y . Ai, Y . Gu, and L.-R. Dai, “Waveform modeling and generation using hierarchical recurrent neural networks for speech bandwidth extension,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , vol. 26, no. 5, pp. 883–894, 2018

work page 2018
[15]

Waveform modeling using stacked di- lated convolutional neural networks for speech bandwidth ex- tension

Y . Gu and Z.-H. Ling, “Waveform modeling using stacked di- lated convolutional neural networks for speech bandwidth ex- tension.” in INTERSPEECH, 2017, pp. 1123–1127

work page 2017
[16]

Mandatory speech CODEC speech processing func- tions; AMR speech CODEC; General description,

3GPP, “Mandatory speech CODEC speech processing func- tions; AMR speech CODEC; General description,” 3rd Generation Partnership Project (3GPP), Technical Speciﬁca- tion (TS) 26.071, 06 2018, version 15.0.0. [Online]. Avail- able: https://portal.3gpp.org/desktopmodules/Speciﬁcations/ SpeciﬁcationDetails.aspx?speciﬁcationId=1386

work page 2018
[17]

Parallel WaveNet: Fast high-ﬁdelity speech synthesis,

A. van den Oord, Y . Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. van den Driessche, E. Lock- hart, L. Cobo, F. Stimberg, N. Casagrande, D. Grewe, S. Noury, S. Dieleman, E. Elsen, N. Kalchbrenner, H. Zen, A. Graves, H. King, T. Walters, D. Belov, and D. Hassabis, “Parallel WaveNet: Fast high-ﬁdelity speech synthesis,” in Proceedings of t...

work page 2018
[18]

Waveglow: A ﬂow- based generative network for speech synthesis,

R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A ﬂow- based generative network for speech synthesis,” in IEEE In- ternational Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP). IEEE, 2019

work page 2019
[19]

Efﬁcient neural audio synthesis,

N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. Oord, S. Diele- man, and K. Kavukcuoglu, “Efﬁcient neural audio synthesis,” in International Conference on Machine Learning , 2018, pp. 2415–2424

work page 2018
[20]

Pix- elcnn++: A pixelcnn implementation with discretized logistic mixture likelihood and other modiﬁcations,

T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma, “Pix- elcnn++: A pixelcnn implementation with discretized logistic mixture likelihood and other modiﬁcations,” in International Conference on Learning Representations (ICLR) , 2017

work page 2017
[21]

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

H. Zen, V . Dang, R. Clark, Y . Zhang, R. J. Weiss, Y . Jia, Z. Chen, and Y . Wu, “LibriTTS: A corpus de- rived from librispeech for text-to-speech,” arXiv preprint arXiv:1904.02882, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[22]

Lib- rispeech: an asr corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: an asr corpus based on public domain audio books,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 5206–5210

work page 2015
[23]

ADAM: A method for stochastic optimization,

D. P. Kingma and J. Ba, “ADAM: A method for stochastic optimization,” in International Conference on Learning Rep- resentations (ICLR), 2015

work page 2015
[24]

In-datacenter performance analysis of a tensor processing unit,

N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, et al., “In-datacenter performance analysis of a tensor processing unit,” in International Symposium on Computer Architecture (ISCA). IEEE, 2017, pp. 1–12

work page 2017
[25]

Method for the subjective assessment of intermediate sound quality (MUSHRA),

International Telecommunication Union, “Method for the subjective assessment of intermediate sound quality (MUSHRA),” ITU-R Recommendation BS.1534-1, Tech. Rep., 2001

work page 2001

[1] [1]

Speech bandwidth extension with WaveNet

INTRODUCTION AND RELA TED WORK Legacy transmission channels are still part of many large-scale com- munication systems. These channels introduce bottlenecks, limiting the bandwidth and the quality of speech. Often this is referenced as ‘telephone-quality’ audio. Upgrading all parts of the infrastructure to be compatible with higher quality audio codecs ca...

work page internal anchor Pith review Pith/arXiv arXiv 1907

[2] [2]

In our experimental evaluation, we assess the ability of our pro- posed model to perform bandwidth extension on narrowband sig- nals

or to predict the upsampled waveform directly [6, 9, 10], leading to quality gains over earlier methods. In our experimental evaluation, we assess the ability of our pro- posed model to perform bandwidth extension on narrowband sig- nals. To illustrate the impact of our work, we show that a model trained to upsample to 24kHz speech signals passed through ...

work page

[3] [3]

Model Architecture WaveNet is a generative model that models the joint probability of a waveform x = {x1,

TRAINING SETUP 2.1. Model Architecture WaveNet is a generative model that models the joint probability of a waveform x = {x1, . . . , xT } as a product of conditional prob- abilities given the samples at previous timesteps. A conditional WaveNet model takes an additional input variable h and models this conditional distribution as p(x|h) = T∏ t=1 p (xt|x1...

work page

[4] [4]

EXPERIMENTAL EV ALUA TION 3.1. Setup In this evaluation we are primarily interested in the case of speech enhancement in the setting of a ﬁxed legacy audio coding pathway, such as calls on standard GSM mobile networks. In this case, the codec typically operates with a bandwidth of 4kHz, leading to an audio waveform with an 8kHz sample rate. To generate th...

work page

[5] [5]

HD-V oice

CONCLUSIONS This work introduces a new WaveNet-based model for speech band- width extension. The model is able to reconstruct 24kHz audio from 8kHz signals that is of similar or better quality to that produced by the AMR-WB codec at 16kHz. Our upsampling method pro- duces “HD-V oice”-quality audio from standard telephony-quality and GSM-quality audio, sho...

work page

[6] [6]

WaveNet: A generative model for raw au- dio

A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, “WaveNet: A generative model for raw au- dio.” in SSW, 2016, p. 125

work page 2016

[7] [7]

Natu- ral tts synthesis by conditioning wavenet on mel spectrogram predictions,

J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y . Zhang, Y . Wang, R. Skerrv-Ryan, et al. , “Natu- ral tts synthesis by conditioning wavenet on mel spectrogram predictions,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2018, pp. 4779–4783

work page 2018

[8] [8]

WaveNet based low rate speech coding,

W. B. Kleijn, F. S. Lim, A. Luebs, J. Skoglund, F. Stim- berg, Q. Wang, and T. C. Walters, “WaveNet based low rate speech coding,” in IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP) . IEEE, 2018, pp. 676–680

work page 2018

[9] [9]

Low bit-rate speech coding with VQ-V AE and a WaveNet decoder,

C. Garbacea, A. van den Oord, Y . Li, F. S. C. Lim, A. Luebs, O. Vinyals, and T. C. Walters, “Low bit-rate speech coding with VQ-V AE and a WaveNet decoder,” in IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019

work page 2019

[10] [10]

E. R. Larsen and R. M. Aarts, Audio Bandwidth Extension: Application of Psychoacoustics, Signal Processing and Loud- speaker Design. USA: John Wiley &; Sons, Inc., 2004

work page 2004

[11] [11]

Audio Super Resolution using Neural Networks

V . Kuleshov, S. Z. Enam, and S. Ermon, “Audio su- per resolution using neural networks,” arXiv preprint arXiv:1708.00853, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[12] [12]

GSM Full Rate Speech Transcoding,

ESTI, “GSM Full Rate Speech Transcoding,” Euro- pean Digital Cellular Telecommunications System, Tech. Rep. 06.10, 02 1992, version 3.2.0. [Online]. Avail- able: https://www.etsi.org/deliver/etsi gts/06/0610/03.02. 00 60/gsmts 0610sv030200p.pdf

work page 1992

[13] [13]

Artiﬁcial speech bandwidth extension using deep neural networks for wideband spec- tral envelope estimation,

J. Abel and T. Fingscheidt, “Artiﬁcial speech bandwidth extension using deep neural networks for wideband spec- tral envelope estimation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. PP, pp. 1–1, 10 2017

work page 2017

[14] [14]

Waveform modeling and generation using hierarchical recurrent neural networks for speech bandwidth extension,

Z.-H. Ling, Y . Ai, Y . Gu, and L.-R. Dai, “Waveform modeling and generation using hierarchical recurrent neural networks for speech bandwidth extension,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , vol. 26, no. 5, pp. 883–894, 2018

work page 2018

[15] [15]

Waveform modeling using stacked di- lated convolutional neural networks for speech bandwidth ex- tension

Y . Gu and Z.-H. Ling, “Waveform modeling using stacked di- lated convolutional neural networks for speech bandwidth ex- tension.” in INTERSPEECH, 2017, pp. 1123–1127

work page 2017

[16] [16]

Mandatory speech CODEC speech processing func- tions; AMR speech CODEC; General description,

3GPP, “Mandatory speech CODEC speech processing func- tions; AMR speech CODEC; General description,” 3rd Generation Partnership Project (3GPP), Technical Speciﬁca- tion (TS) 26.071, 06 2018, version 15.0.0. [Online]. Avail- able: https://portal.3gpp.org/desktopmodules/Speciﬁcations/ SpeciﬁcationDetails.aspx?speciﬁcationId=1386

work page 2018

[17] [17]

Parallel WaveNet: Fast high-ﬁdelity speech synthesis,

A. van den Oord, Y . Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. van den Driessche, E. Lock- hart, L. Cobo, F. Stimberg, N. Casagrande, D. Grewe, S. Noury, S. Dieleman, E. Elsen, N. Kalchbrenner, H. Zen, A. Graves, H. King, T. Walters, D. Belov, and D. Hassabis, “Parallel WaveNet: Fast high-ﬁdelity speech synthesis,” in Proceedings of t...

work page 2018

[18] [18]

Waveglow: A ﬂow- based generative network for speech synthesis,

R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A ﬂow- based generative network for speech synthesis,” in IEEE In- ternational Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP). IEEE, 2019

work page 2019

[19] [19]

Efﬁcient neural audio synthesis,

N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. Oord, S. Diele- man, and K. Kavukcuoglu, “Efﬁcient neural audio synthesis,” in International Conference on Machine Learning , 2018, pp. 2415–2424

work page 2018

[20] [20]

Pix- elcnn++: A pixelcnn implementation with discretized logistic mixture likelihood and other modiﬁcations,

T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma, “Pix- elcnn++: A pixelcnn implementation with discretized logistic mixture likelihood and other modiﬁcations,” in International Conference on Learning Representations (ICLR) , 2017

work page 2017

[21] [21]

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

H. Zen, V . Dang, R. Clark, Y . Zhang, R. J. Weiss, Y . Jia, Z. Chen, and Y . Wu, “LibriTTS: A corpus de- rived from librispeech for text-to-speech,” arXiv preprint arXiv:1904.02882, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904

[22] [22]

Lib- rispeech: an asr corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: an asr corpus based on public domain audio books,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 5206–5210

work page 2015

[23] [23]

ADAM: A method for stochastic optimization,

D. P. Kingma and J. Ba, “ADAM: A method for stochastic optimization,” in International Conference on Learning Rep- resentations (ICLR), 2015

work page 2015

[24] [24]

In-datacenter performance analysis of a tensor processing unit,

N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, et al., “In-datacenter performance analysis of a tensor processing unit,” in International Symposium on Computer Architecture (ISCA). IEEE, 2017, pp. 1–12

work page 2017

[25] [25]

Method for the subjective assessment of intermediate sound quality (MUSHRA),

International Telecommunication Union, “Method for the subjective assessment of intermediate sound quality (MUSHRA),” ITU-R Recommendation BS.1534-1, Tech. Rep., 2001

work page 2001