Analysis by Adversarial Synthesis -- A Novel Approach for Speech Vocoding

Ahmed Mustafa; Andreas Maier; Arijit Biswas; Christian Bergler; Julia Schottenhamml

arxiv: 1907.00772 · v1 · pith:6GQOUCRUnew · submitted 2019-07-01 · 📡 eess.AS · cs.LG· cs.SD

Analysis by Adversarial Synthesis -- A Novel Approach for Speech Vocoding

Ahmed Mustafa , Arijit Biswas , Christian Bergler , Julia Schottenhamml , Andreas Maier This is my paper

Pith reviewed 2026-05-25 11:29 UTC · model grok-4.3

classification 📡 eess.AS cs.LGcs.SD

keywords speech vocodinggenerative adversarial networksglottal excitationlinear predictive codingneural vocoderperceptual qualityone-shot generation

0 comments

The pith

Conditional GANs generate speech from compressed glottal excitation and LPC refinement yields higher perceptual quality than classical vocoders.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a neural vocoder that first uses a conditional GAN to synthesize a speech waveform from a highly compressed representation of the glottal excitation signal. This synthesized waveform is then refined by applying the original speech's linear predictive coding coefficients to restore spectral envelope details. The resulting signals are evaluated on a 30-speaker dataset using both subjective listening tests and objective metrics, where they score higher than conventional parametric vocoders. Because the GAN operates in a single forward pass rather than sample-by-sample autoregression, generation is much faster than WaveNet-style models while maintaining the low bit-rate advantage of parametric coding. The central demonstration is that adversarial training on excitation can replace hand-crafted excitation models without sacrificing, and sometimes improving, naturalness after LPC post-processing.

Core claim

A conditional GAN is trained to map a compact glottal-excitation code to a full-band speech waveform; the output is then filtered with the original LPC coefficients to enforce the correct spectral envelope. On a dataset of 30 male and female speakers this pipeline produces waveforms whose subjective and objective quality exceeds that of classical vocoders while allowing one-shot generation instead of autoregressive sampling.

What carries the argument

Conditional GAN that maps compressed glottal excitation to speech waveform, followed by LPC-based spectral refinement

If this is right

One-pass generation removes the real-time latency penalty of autoregressive vocoders.
The same architecture can in principle be retrained on any parametric front-end that supplies an excitation code and LPC coefficients.
Objective and subjective scores both improve, suggesting the adversarial loss captures perceptual attributes missed by traditional excitation models.
The method keeps the transmission rate of classical parametric coders while raising reconstruction quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the GAN can be conditioned on additional side information such as speaker identity, the approach could support speaker-adaptive low-rate coding without retraining the entire pipeline.
The same excitation-to-waveform GAN might be tested on non-speech audio such as music or environmental sounds that also admit an LPC-like decomposition.
Integration with modern neural audio codecs could replace the classical LPC stage with a learned spectral envelope estimator while retaining the adversarial excitation generator.

Load-bearing premise

The GAN output, once multiplied by the original LPC filter, will be perceptually cleaner than classical excitation models without introducing new artifacts or requiring speaker-specific retuning.

What would settle it

A controlled AB listening test on the same 30-speaker set in which listeners consistently prefer the classical vocoder output or rate the GAN output as containing audible artifacts.

Figures

Figures reproduced from arXiv: 1907.00772 by Ahmed Mustafa, Andreas Maier, Arijit Biswas, Christian Bergler, Julia Schottenhamml.

**Figure 1.** Figure 1: Illustration of AbAS. verted into a parametric representation for the desired speech signal. To accomplish this, the glottal excitation signal, represented by the residual from an LPC analysis filtering of the speech waveform [1], is fed to a neural encoder network. The residual is a noise-like signal as it is uncorrelated and almost spectrally-flat [1]. Thus, it is a good candidate to be compressed by t… view at source ↗

**Figure 2.** Figure 2: The adversarial upsampler network. shaped using transposed convolution without activation. This noise is used for compensating the missing fine details of the speech signal during the residual compression task, e.g. unvoiced speech parts and background noise. It is then concatenated along the channel dimension with the actual signal generation path at every upsampling stage. The upsampler block diagram … view at source ↗

**Figure 3.** Figure 3: GANs for speech vocoding: A fake speech signal is generated by CGAN (middle) at 16 kHz from the 1 kHz learned compression of the residual signal. This fake signal preserves the main spectral and prosodic features of the original speech (top) especially at the low frequency bands. However, it is more challenging to accurately reconstruct the high frequency details and the background noise of the original si… view at source ↗

**Figure 4.** Figure 4: Outperformance of the proposed softmax gating over the sigmoid one in terms of the L1 reconstruction loss. The proposed AbAS approach is assessed by objective and subjective perceptual evaluation measures. This is done in comparison with the classical vocoder introduced by Hedelin [22] and refined by Klejsa et al. [6]. There is no quantization applied to the compressed representation of signals for both … view at source ↗

**Figure 6.** Figure 6: Higher discriminator loss for generating fake residual compared to fake speech, which indicates a lower quality for the generated residual samples. 5. Conclusions This paper introduces a new method for neural speech vocoding, with much faster generation than autoregressive generative models and higher perceptual quality than classical vocoding. The method, which is called analysis by adversarial synthesi… view at source ↗

read the original abstract

Classical parametric speech coding techniques provide a compact representation for speech signals. This affords a very low transmission rate but with a reduced perceptual quality of the reconstructed signals. Recently, autoregressive deep generative models such as WaveNet and SampleRNN have been used as speech vocoders to scale up the perceptual quality of the reconstructed signals without increasing the coding rate. However, such models suffer from a very slow signal generation mechanism due to their sample-by-sample modelling approach. In this work, we introduce a new methodology for neural speech vocoding based on generative adversarial networks (GANs). A fake speech signal is generated from a very compressed representation of the glottal excitation using conditional GANs as a deep generative model. This fake speech is then refined using the LPC parameters of the original speech signal to obtain a natural reconstruction. The reconstructed speech waveforms based on this approach show a higher perceptual quality than the classical vocoder counterparts according to subjective and objective evaluation scores for a dataset of 30 male and female speakers. Moreover, the usage of GANs enables to generate signals in one-shot compared to autoregressive generative models. This makes GANs promising for exploration to implement high-quality neural vocoders.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The GAN vocoder idea is reasonable for speed but the quality superiority claim rests on an unfair comparison that supplies unquantized original LPC for refinement.

read the letter

The main thing to know is that the central quality claim does not hold up cleanly. The method generates a signal via conditional GAN from compressed glottal excitation and then refines it with the LPC parameters of the original speech signal. That refinement step supplies a perfect spectral envelope that classical parametric vocoders do not have access to once quantization is applied for transmission. The abstract states this explicitly, and nothing in the description indicates that the classical baselines received the same unquantized LPC or that the experiments re-ran the comparison under quantization. Any reported gains on the 30-speaker set could therefore trace to the envelope rather than the adversarial excitation modeling. That is a real problem for the main result. The new element is the specific pipeline of adversarial synthesis on glottal excitation followed by LPC refinement, done in one shot rather than autoregressively. This directly targets the slow generation speed of WaveNet-style models, which is a practical concern in the subfield. The paper does a straightforward job of laying out that motivation and showing that GANs can produce usable signals without sample-by-sample iteration. The evaluation reports higher subjective and objective scores, which is at least a starting point even if the abstract gives few numbers or statistical details. No other major internal contradictions appear. The approach stays empirical and cites the relevant autoregressive baselines without circularity. This is for people working on neural vocoders and low-bitrate speech coding. A reader already following the shift from parametric to generative methods would get a clear description of one alternative, though they would need the quantization issue fixed before treating the quality numbers as decisive. I would send it to peer review so referees can examine the full experimental setup and ask for a corrected comparison.

Referee Report

2 major / 1 minor

Summary. The paper introduces a neural vocoding method that uses a conditional GAN to generate a speech signal from a compressed glottal excitation representation; the output is then refined by the LPC parameters of the original signal to produce the final waveform. It claims this yields higher perceptual quality than classical vocoders (per subjective and objective scores on 30 speakers) while enabling one-shot generation, unlike slow autoregressive models such as WaveNet.

Significance. If the quality gains are shown to hold under a fair comparison that respects standard quantization constraints, the hybrid GAN-plus-LPC approach would be a useful contribution to low-rate parametric coding by combining the speed of non-autoregressive generation with established spectral modeling.

major comments (2)

[Abstract] Abstract (refinement step): the method explicitly refines the cGAN output using 'the LPC parameters of the original speech signal.' Classical vocoders quantize LPC coefficients (typically 10-20 bits per frame); supplying unquantized originals supplies a perfect envelope unavailable at a real decoder. The manuscript must clarify the bit allocation used for LPC in both the proposed system and the classical baselines, and must report results with quantized LPC to substantiate the superiority claim.
[Evaluation] Evaluation (subjective/objective scores): the abstract asserts higher quality than classical counterparts on 30 speakers but supplies no numerical values, statistical tests, or description of the exact classical baselines (including their quantization settings). Without these details the central claim cannot be assessed.

minor comments (1)

The abstract states that GANs enable 'one-shot' generation; the manuscript should quantify the actual inference latency relative to WaveNet/SampleRNN on the same hardware.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below, with plans for revision where appropriate to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract (refinement step): the method explicitly refines the cGAN output using 'the LPC parameters of the original speech signal.' Classical vocoders quantize LPC coefficients (typically 10-20 bits per frame); supplying unquantized originals supplies a perfect envelope unavailable at a real decoder. The manuscript must clarify the bit allocation used for LPC in both the proposed system and the classical baselines, and must report results with quantized LPC to substantiate the superiority claim.

Authors: We agree that the current description relies on unquantized LPC parameters from the original signal, which does not fully reflect a realistic low-rate coding scenario. In the revised manuscript we will explicitly state the bit allocation for LPC coefficients (e.g., bits per frame) used in both the proposed system and all classical baselines, and we will add new results obtained with quantized LPC parameters to demonstrate that the reported quality advantage is retained under standard quantization constraints. revision: yes
Referee: [Evaluation] Evaluation (subjective/objective scores): the abstract asserts higher quality than classical counterparts on 30 speakers but supplies no numerical values, statistical tests, or description of the exact classical baselines (including their quantization settings). Without these details the central claim cannot be assessed.

Authors: The abstract is space-constrained and therefore summarizes the contribution at a high level; the full numerical scores, statistical tests, and precise baseline descriptions (including quantization) appear in the evaluation section of the manuscript. To improve accessibility we will revise the abstract to include a concise statement of the key quantitative gains and will ensure the main text explicitly lists the classical vocoders, their quantization settings, and any statistical analysis performed. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper introduces an empirical neural vocoding method based on conditional GANs for glottal excitation synthesis followed by LPC refinement, with claims resting on subjective and objective evaluations over 30 speakers. No equations, self-citations, or derivations are presented that reduce by construction to fitted inputs or prior self-referential results; the approach is described as a new methodology without self-definitional loops, uniqueness theorems imported from the authors, or renaming of known results. The central claim of perceptual improvement is supported by external evaluation metrics rather than internal redefinition of inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Based on abstract only; the central claim relies on the effectiveness of the GAN-LPC pipeline which may involve fitted parameters in training and domain assumptions about LPC refinement.

free parameters (1)

GAN training hyperparameters
Likely many hyperparameters for the GAN are tuned but not specified in abstract.

axioms (1)

domain assumption LPC parameters accurately represent the vocal tract filter for refinement
The method assumes LPC can effectively refine the GAN output to natural speech.

pith-pipeline@v0.9.0 · 5753 in / 1124 out tokens · 27710 ms · 2026-05-25T11:29:29.358350+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 4 internal anchors

[1]

Introduction Speech coding is one of the fundamental functionalities of current multimedia communication systems over band limited transmission channels [1]. The conventional approaches for coding speech signals are based on the source-ﬁlter model, in which a speech signal is decomposed into its glottal excitation source signal and its vocal tract ﬁlter p...

work page
[2]

Analysis by Adversarial Synthesis -- A Novel Approach for Speech Vocoding

Analysis by Adversarial Synthesis Besides the ability of one-shot sample generation, GANs can create realistic data from a totally-abstract noise prior (e.g., Gaussian noise). The adversarial training makes it possible to map a simple prior distribution into complicated real-world dis- tributions in a high-dimensional space. This has been achieved efﬁcien...

work page internal anchor Pith review Pith/arXiv arXiv 1907
[3]

Model Conﬁguration and Training Setup For training and testing the generative model, we used the clean speech signals of the dataset created by Valentini et al. [13]. It is an open source dataset of 15 male and 15 female speakers selected from the V oice Bank corpus introduced by Veaux et al. [14]. The training data is constructed by the speech signals of...

work page 2000
[4]

The channel depths starting from the input until the output ofD are: 2, 16, 16, 32, 32, 64 and 32

with a leakage factor of 0.2 is used for activating all lay- ers, except the last one where only the convolution operation is applied. The channel depths starting from the input until the output ofD are: 2, 16, 16, 32, 32, 64 and 32. Spectral normal- ization [17] is applied to all convolutional layers ofD to ensure the Lipschitz continuity that is require...

work page
[5]

This is en- hanced by the cross synthesis step in order to obtain a natural reconstruction, as illustrated in Figure 3

Results The main outcome of this work is the ability of CGANs to create realistic speech waveforms in one-shot from a highly compressed representation of the glottal excitation. This is en- hanced by the cross synthesis step in order to obtain a natural reconstruction, as illustrated in Figure 3. Figure 3: GANs for speech vocoding: A fake speech signal is...

work page 1973
[6]

Conclusions This paper introduces a new method for neural speech vocod- ing, with much faster generation than autoregressive generative models and higher perceptual quality than classical vocoding. The method, which is called analysis by adversarial synthe- sis (AbAS), starts with generating a fake speech signal from a neurally-learned parametric represen...

work page
[7]

Vary and R

P. Vary and R. Martin, Digital speech transmission: Enhance- ment, coding and error concealment. John Wiley & Sons, 2006

work page 2006
[8]

The adaptive multirate wideband speech codec (amr-wb),

B. Bessette, R. Salami, R. Lefebvre, M. Jelinek, J. Rotola-Pukkila, J. Vainio, H. Mikkola, and K. Jarvinen, “The adaptive multirate wideband speech codec (amr-wb),” IEEE transactions on speech and audio processing, vol. 10, no. 8, pp. 620–636, 2002

work page 2002
[9]

WaveNet: A Generative Model for Raw Audio

A. Van Den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” CoRR abs/1609.03499, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[10]

SampleRNN: An Unconditional End-to-End Neural Audio Generation Model

S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y . Bengio, “Samplernn: An unconditional end-to-end neural audio generation model,” arXiv preprint arXiv:1612.07837, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[11]

Wavenet based low rate speech cod- ing,

W. B. Kleijn, F. S. Lim, A. Luebs, J. Skoglund, F. Stimberg, Q. Wang, and T. C. Walters, “Wavenet based low rate speech cod- ing,” in Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 676– 680

work page 2018
[12]

High- quality speech coding with sample rnn,

J. Klejsa, P. Hedelin, C. Zhou, R. Fejgin, and L. Villemoes, “High- quality speech coding with sample rnn,” in Proc. of the IEEE In- ternational Conference on Acoustics, Speech and Signal Process- ing (ICASSP). IEEE, 2019, pp. 7155–7159

work page 2019
[13]

Parallel WaveNet: Fast high- ﬁdelity speech synthesis,

A. van den Oord, Y . Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. van den Driessche, E. Lockhart, L. Cobo, F. Stimberg, N. Casagrande, D. Grewe, S. Noury, S. Dieleman, E. Elsen, N. Kalchbrenner, H. Zen, A. Graves, H. King, T. Wal- ters, D. Belov, and D. Hassabis, “Parallel WaveNet: Fast high- ﬁdelity speech synthesis,” in Proceedings of ...

work page 2018
[14]

NIPS 2016 Tutorial: Generative Adversarial Networks

I. Goodfellow, “Nips 2016 tutorial: Generative adversarial net- works,” arXiv preprint arXiv:1701.00160, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[15]

Generative adversarial network-based glottal waveform model for statistical parametric speech synthesis,

B. Bollepalli, L. Juvela, and P. Alku, “Generative adversarial network-based glottal waveform model for statistical parametric speech synthesis,” in Proc. of Interspeech, 2017, pp. 3394–3398

work page 2017
[16]

Speech waveform synthesis from mfcc sequences with generative adversarial networks,

L. Juvela, B. Bollepalli, X. Wang, H. Kameoka, M. Airaksinen, J. Yamagishi, and P. Alku, “Speech waveform synthesis from mfcc sequences with generative adversarial networks,” in Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5679–5683

work page 2018
[17]

Large scale GAN training for high ﬁdelity natural image synthesis,

A. Brock, J. Donahue, and K. Simonyan, “Large scale GAN training for high ﬁdelity natural image synthesis,” in Proc. of the International Conference on Learning Representations (ICLR), 2019. [Online]. Available: https://openreview.net/forum? id=B1xsqj09Fm

work page 2019
[18]

GANSynth: Adversarial neural audio synthesis,

J. Engel, K. K. Agrawal, S. Chen, I. Gulrajani, C. Donahue, and A. Roberts, “GANSynth: Adversarial neural audio synthesis,” in Proc. of the International Conference on Learning Representations (ICLR) , 2019. [Online]. Available: https://openreview.net/forum?id=H1xQVn09FX

work page 2019
[19]

Investigating rnn-based speech enhancement methods for noise- robust text-to-speech,

C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi, “Investigating rnn-based speech enhancement methods for noise- robust text-to-speech,” in 9th ISCA Speech Synthesis Workshop , 2016, pp. 146–152

work page 2016
[20]

The voice bank corpus: Design, collection and data analysis of a large regional accent speech database,

C. Veaux, J. Yamagishi, and S. King, “The voice bank corpus: Design, collection and data analysis of a large regional accent speech database,” in Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), 2013 International Conference. IEEE, 2013, pp. 1–4

work page 2013
[21]

Delving deep into rectiﬁers: Surpassing human-level performance on imagenet classiﬁcation,

K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectiﬁers: Surpassing human-level performance on imagenet classiﬁcation,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034

work page 2015
[22]

Image-to-image translation with conditional adversarial networks,

P. Isola, J.-Y . Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” inProceedings of the IEEE conference on computer vision and pattern recogni- tion, 2017, pp. 1125–1134

work page 2017
[23]

Spectral normalization for generative adversarial networks,

T. Miyato, T. Kataoka, M. Koyama, and Y . Yoshida, “Spectral normalization for generative adversarial networks,” in Proc. of the International Conference on Learning Representations (ICLR), 2018. [Online]. Available: https://openreview.net/forum? id=B1QRgziT-

work page 2018
[24]

Wasserstein generative adversarial networks,

M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” inProc. of the International Conference on Machine Learning, 2017, pp. 214–223

work page 2017
[25]

Self-attention generative adversarial networks,

H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, “Self-attention generative adversarial networks,” in Proceedings of the 36th International Conference on Machine Learning , K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR, 2019, pp. 7354–7363. [Online]. Available: http://proceedings.mlr.press/ v97/zhang19d.html

work page 2019
[26]

On the convergence of adam and beyond,

S. J. Reddi, S. Kale, and S. Kumar, “On the convergence of adam and beyond,” in Proc. of the 6th International Conference on Learning Representations (ICLR) , 2018. [Online]. Available: https://openreview.net/forum?id=ryQu7f-RZ

work page 2018
[27]

Understanding the difﬁculty of train- ing deep feedforward neural networks,

X. Glorot and Y . Bengio, “Understanding the difﬁculty of train- ing deep feedforward neural networks,” in Proceedings of the thirteenth international conference on artiﬁcial intelligence and statistics, 2010, pp. 249–256

work page 2010
[28]

A sinusoidal lpc vocoder,

P. Hedelin, “A sinusoidal lpc vocoder,” inProc. of the IEEE Work- shop on Speech Coding. IEEE, 2000, pp. 2–4

work page 2000
[29]

Segan: Speech en- hancement generative adversarial network,

S. Pascual, A. Bonafonte, and J. Serr `a, “Segan: Speech en- hancement generative adversarial network,” in Proc. of INTER- SPEECH, 2017, pp. 3642–3646

work page 2017
[30]

Visqol: an objective speech quality model,

A. Hines, J. Skoglund, A. C. Kokaram, and N. Harte, “Visqol: an objective speech quality model,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2015, no. 1, p. 13, May 2015. [Online]. Available: https://doi.org/10.1186/s13636-015-0054-9

work page doi:10.1186/s13636-015-0054-9 2015
[31]

1534-1, method for the subjective assessment of intermediate quality levels of coding systems (mushra),

R. B. ITU-R, “1534-1, method for the subjective assessment of intermediate quality levels of coding systems (mushra),” Interna- tional Telecommunication Union, 2003

work page 2003

[1] [1]

Introduction Speech coding is one of the fundamental functionalities of current multimedia communication systems over band limited transmission channels [1]. The conventional approaches for coding speech signals are based on the source-ﬁlter model, in which a speech signal is decomposed into its glottal excitation source signal and its vocal tract ﬁlter p...

work page

[2] [2]

Analysis by Adversarial Synthesis -- A Novel Approach for Speech Vocoding

Analysis by Adversarial Synthesis Besides the ability of one-shot sample generation, GANs can create realistic data from a totally-abstract noise prior (e.g., Gaussian noise). The adversarial training makes it possible to map a simple prior distribution into complicated real-world dis- tributions in a high-dimensional space. This has been achieved efﬁcien...

work page internal anchor Pith review Pith/arXiv arXiv 1907

[3] [3]

Model Conﬁguration and Training Setup For training and testing the generative model, we used the clean speech signals of the dataset created by Valentini et al. [13]. It is an open source dataset of 15 male and 15 female speakers selected from the V oice Bank corpus introduced by Veaux et al. [14]. The training data is constructed by the speech signals of...

work page 2000

[4] [4]

The channel depths starting from the input until the output ofD are: 2, 16, 16, 32, 32, 64 and 32

with a leakage factor of 0.2 is used for activating all lay- ers, except the last one where only the convolution operation is applied. The channel depths starting from the input until the output ofD are: 2, 16, 16, 32, 32, 64 and 32. Spectral normal- ization [17] is applied to all convolutional layers ofD to ensure the Lipschitz continuity that is require...

work page

[5] [5]

This is en- hanced by the cross synthesis step in order to obtain a natural reconstruction, as illustrated in Figure 3

Results The main outcome of this work is the ability of CGANs to create realistic speech waveforms in one-shot from a highly compressed representation of the glottal excitation. This is en- hanced by the cross synthesis step in order to obtain a natural reconstruction, as illustrated in Figure 3. Figure 3: GANs for speech vocoding: A fake speech signal is...

work page 1973

[6] [6]

Conclusions This paper introduces a new method for neural speech vocod- ing, with much faster generation than autoregressive generative models and higher perceptual quality than classical vocoding. The method, which is called analysis by adversarial synthe- sis (AbAS), starts with generating a fake speech signal from a neurally-learned parametric represen...

work page

[7] [7]

Vary and R

P. Vary and R. Martin, Digital speech transmission: Enhance- ment, coding and error concealment. John Wiley & Sons, 2006

work page 2006

[8] [8]

The adaptive multirate wideband speech codec (amr-wb),

B. Bessette, R. Salami, R. Lefebvre, M. Jelinek, J. Rotola-Pukkila, J. Vainio, H. Mikkola, and K. Jarvinen, “The adaptive multirate wideband speech codec (amr-wb),” IEEE transactions on speech and audio processing, vol. 10, no. 8, pp. 620–636, 2002

work page 2002

[9] [9]

WaveNet: A Generative Model for Raw Audio

A. Van Den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” CoRR abs/1609.03499, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[10] [10]

SampleRNN: An Unconditional End-to-End Neural Audio Generation Model

S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y . Bengio, “Samplernn: An unconditional end-to-end neural audio generation model,” arXiv preprint arXiv:1612.07837, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[11] [11]

Wavenet based low rate speech cod- ing,

W. B. Kleijn, F. S. Lim, A. Luebs, J. Skoglund, F. Stimberg, Q. Wang, and T. C. Walters, “Wavenet based low rate speech cod- ing,” in Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 676– 680

work page 2018

[12] [12]

High- quality speech coding with sample rnn,

J. Klejsa, P. Hedelin, C. Zhou, R. Fejgin, and L. Villemoes, “High- quality speech coding with sample rnn,” in Proc. of the IEEE In- ternational Conference on Acoustics, Speech and Signal Process- ing (ICASSP). IEEE, 2019, pp. 7155–7159

work page 2019

[13] [13]

Parallel WaveNet: Fast high- ﬁdelity speech synthesis,

A. van den Oord, Y . Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. van den Driessche, E. Lockhart, L. Cobo, F. Stimberg, N. Casagrande, D. Grewe, S. Noury, S. Dieleman, E. Elsen, N. Kalchbrenner, H. Zen, A. Graves, H. King, T. Wal- ters, D. Belov, and D. Hassabis, “Parallel WaveNet: Fast high- ﬁdelity speech synthesis,” in Proceedings of ...

work page 2018

[14] [14]

NIPS 2016 Tutorial: Generative Adversarial Networks

I. Goodfellow, “Nips 2016 tutorial: Generative adversarial net- works,” arXiv preprint arXiv:1701.00160, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[15] [15]

Generative adversarial network-based glottal waveform model for statistical parametric speech synthesis,

B. Bollepalli, L. Juvela, and P. Alku, “Generative adversarial network-based glottal waveform model for statistical parametric speech synthesis,” in Proc. of Interspeech, 2017, pp. 3394–3398

work page 2017

[16] [16]

Speech waveform synthesis from mfcc sequences with generative adversarial networks,

L. Juvela, B. Bollepalli, X. Wang, H. Kameoka, M. Airaksinen, J. Yamagishi, and P. Alku, “Speech waveform synthesis from mfcc sequences with generative adversarial networks,” in Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5679–5683

work page 2018

[17] [17]

Large scale GAN training for high ﬁdelity natural image synthesis,

A. Brock, J. Donahue, and K. Simonyan, “Large scale GAN training for high ﬁdelity natural image synthesis,” in Proc. of the International Conference on Learning Representations (ICLR), 2019. [Online]. Available: https://openreview.net/forum? id=B1xsqj09Fm

work page 2019

[18] [18]

GANSynth: Adversarial neural audio synthesis,

J. Engel, K. K. Agrawal, S. Chen, I. Gulrajani, C. Donahue, and A. Roberts, “GANSynth: Adversarial neural audio synthesis,” in Proc. of the International Conference on Learning Representations (ICLR) , 2019. [Online]. Available: https://openreview.net/forum?id=H1xQVn09FX

work page 2019

[19] [19]

Investigating rnn-based speech enhancement methods for noise- robust text-to-speech,

C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi, “Investigating rnn-based speech enhancement methods for noise- robust text-to-speech,” in 9th ISCA Speech Synthesis Workshop , 2016, pp. 146–152

work page 2016

[20] [20]

The voice bank corpus: Design, collection and data analysis of a large regional accent speech database,

C. Veaux, J. Yamagishi, and S. King, “The voice bank corpus: Design, collection and data analysis of a large regional accent speech database,” in Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), 2013 International Conference. IEEE, 2013, pp. 1–4

work page 2013

[21] [21]

Delving deep into rectiﬁers: Surpassing human-level performance on imagenet classiﬁcation,

K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectiﬁers: Surpassing human-level performance on imagenet classiﬁcation,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034

work page 2015

[22] [22]

Image-to-image translation with conditional adversarial networks,

P. Isola, J.-Y . Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” inProceedings of the IEEE conference on computer vision and pattern recogni- tion, 2017, pp. 1125–1134

work page 2017

[23] [23]

Spectral normalization for generative adversarial networks,

T. Miyato, T. Kataoka, M. Koyama, and Y . Yoshida, “Spectral normalization for generative adversarial networks,” in Proc. of the International Conference on Learning Representations (ICLR), 2018. [Online]. Available: https://openreview.net/forum? id=B1QRgziT-

work page 2018

[24] [24]

Wasserstein generative adversarial networks,

M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” inProc. of the International Conference on Machine Learning, 2017, pp. 214–223

work page 2017

[25] [25]

Self-attention generative adversarial networks,

H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, “Self-attention generative adversarial networks,” in Proceedings of the 36th International Conference on Machine Learning , K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR, 2019, pp. 7354–7363. [Online]. Available: http://proceedings.mlr.press/ v97/zhang19d.html

work page 2019

[26] [26]

On the convergence of adam and beyond,

S. J. Reddi, S. Kale, and S. Kumar, “On the convergence of adam and beyond,” in Proc. of the 6th International Conference on Learning Representations (ICLR) , 2018. [Online]. Available: https://openreview.net/forum?id=ryQu7f-RZ

work page 2018

[27] [27]

Understanding the difﬁculty of train- ing deep feedforward neural networks,

X. Glorot and Y . Bengio, “Understanding the difﬁculty of train- ing deep feedforward neural networks,” in Proceedings of the thirteenth international conference on artiﬁcial intelligence and statistics, 2010, pp. 249–256

work page 2010

[28] [28]

A sinusoidal lpc vocoder,

P. Hedelin, “A sinusoidal lpc vocoder,” inProc. of the IEEE Work- shop on Speech Coding. IEEE, 2000, pp. 2–4

work page 2000

[29] [29]

Segan: Speech en- hancement generative adversarial network,

S. Pascual, A. Bonafonte, and J. Serr `a, “Segan: Speech en- hancement generative adversarial network,” in Proc. of INTER- SPEECH, 2017, pp. 3642–3646

work page 2017

[30] [30]

Visqol: an objective speech quality model,

A. Hines, J. Skoglund, A. C. Kokaram, and N. Harte, “Visqol: an objective speech quality model,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2015, no. 1, p. 13, May 2015. [Online]. Available: https://doi.org/10.1186/s13636-015-0054-9

work page doi:10.1186/s13636-015-0054-9 2015

[31] [31]

1534-1, method for the subjective assessment of intermediate quality levels of coding systems (mushra),

R. B. ITU-R, “1534-1, method for the subjective assessment of intermediate quality levels of coding systems (mushra),” Interna- tional Telecommunication Union, 2003

work page 2003