Generalization of Spectrum Differential based Direct Waveform Modification for Voice Conversion

Hsin-Min Wang; Hsin-Te Hwang; Kazuhiro Kobayashi; Patrick Lumban Tobing; Tomoki Toda; Wen-Chin Huang; Yi-Chiao Wu; Yu-Huai Peng; Yu Tsao

arxiv: 1907.11898 · v1 · pith:LJMPJLMKnew · submitted 2019-07-27 · 📡 eess.AS · eess.SP

Generalization of Spectrum Differential based Direct Waveform Modification for Voice Conversion

Wen-Chin Huang , Yi-Chiao Wu , Kazuhiro Kobayashi , Yu-Huai Peng , Hsin-Te Hwang , Patrick Lumban Tobing , Yu Tsao , Hsin-Min Wang

show 1 more author

Tomoki Toda

This is my paper

Pith reviewed 2026-05-24 14:49 UTC · model grok-4.3

classification 📡 eess.AS eess.SP

keywords voice conversionDIFFVCspectral conversiondirect waveform modificationF0 transformationresidual domainVAEvocoder-free synthesis

0 comments

The pith

Inverse and synthesis filtering on residuals lets any spectral conversion model generate waveforms directly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper modifies the DIFFVC method for voice conversion so it works with any spectral conversion model. The change performs inverse filtering on the input speech, transforms F0 in the residual domain, and applies synthesis filtering using the already-converted spectral features. This removes the requirement that the conversion model must predict spectral differentials or be retrained for each pair. When tested on a non-parallel VAE-based spectral model, the approach produces higher quality output than a vocoder-based baseline while keeping rich spectral details.

Core claim

By performing inverse filtering on the input signal followed by synthesis filtering on the F0 transformed residual signal using the converted spectral features directly, the spectral conversion model does not need to be retrained or capable of predicting the spectral differential.

What carries the argument

F0 transformation in the residual domain through inverse filtering followed by synthesis filtering with converted spectral features.

If this is right

Any spectral conversion model can serve as the waveform generation module without extra training steps.
Non-parallel models such as VAE-based converters become compatible with direct waveform modification.
The need for parallel training data and pair-specific retraining disappears.
Waveform output avoids vocoder processing while retaining spectral detail.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The filtering steps could be applied to other residual-based signal processing tasks outside voice conversion.
Optimizing the inverse and synthesis filters might enable lower-latency conversion pipelines.
The approach may reduce training data requirements for new voice pairs in deployment.

Load-bearing premise

Synthesis filtering applied to the F0-transformed residual using converted spectral features will preserve waveform quality and avoid artifacts that differential prediction would have handled.

What would settle it

A side-by-side listening test or objective quality measure on the same spectral conversion model showing that the proposed method produces more artifacts or lower naturalness than the original DIFFVC that requires explicit differential estimation.

Figures

Figures reproduced from arXiv: 1907.11898 by Hsin-Min Wang, Hsin-Te Hwang, Kazuhiro Kobayashi, Patrick Lumban Tobing, Tomoki Toda, Wen-Chin Huang, Yi-Chiao Wu, Yu-Huai Peng, Yu Tsao.

**Figure 1.** Figure 1: The spectrum differential based direct waveform modification for vocoder free voice conversion, where the spectrum differential is estimated using a Gaussian mixture model (GMM). Such a GMM model is termed DIFFGMM. State-of-the-art VC systems have shown remarkable results by combining such neural waveform generation process with conversion models such as GMMs [20], or other methods based on recent DNN met… view at source ↗

**Figure 2.** Figure 2: The proposed direct waveform modification framework, where the F0 transformation is realized in the residual domain. sig, res, mcp, and env represent the waveform signal, residual signal, mel-cepsturm, and envelope signal, respectively. the converted spectral features given source spectral features extracted from a normal source speech, regardless of whether a parallel training dataset is available. In th… view at source ↗

**Figure 3.** Figure 3: An illustration of the calculated envelopes envW (e world), envGV (e gv) and envDIF F (e diff), which are from the converted sample SF1-TF1-30006. The red dashed line denotes the threshold, which is set to be 10000 here. 3.3. Collapsed waveform detection and feature substitution In our initial experiments, we often observed collapsed waveform segments in sig(y) GV . This is a combined result of the frequ… view at source ↗

**Figure 4.** Figure 4: The top, bottom rows are the spectrograms of the source, converted speeches using our method, respectively. Top left: SF1-30001. Top right: SM1-30001. Bottom left: SF1-TF1-30001. Bottom right: SM1-TF1-30001. choose the one with more natural voice among two converted utterances generated by the two methods for the same sentence (content) in random order. In the conversion similarity test, a natural speech s… view at source ↗

read the original abstract

We present a modification to the spectrum differential based direct waveform modification for voice conversion (DIFFVC) so that it can be directly applied as a waveform generation module to voice conversion models. The recently proposed DIFFVC avoids the use of a vocoder, meanwhile preserves rich spectral details hence capable of generating high quality converted voice. To apply the DIFFVC framework, a model that can estimate the spectral differential from the F0 transformed input speech needs to be trained beforehand. This requirement imposes several constraints, including a limitation on the estimation model to parallel training and the need of extra training on each conversion pair, which make DIFFVC inflexible. Based on the above motivations, we propose a new DIFFVC framework based on an F0 transformation in the residual domain. By performing inverse filtering on the input signal followed by synthesis filtering on the F0 transformed residual signal using the converted spectral features directly, the spectral conversion model does not need to be retrained or capable of predicting the spectral differential. We describe several details that need to be taken care of under this modification, and by applying our proposed method to a non-parallel, variational autoencoder (VAE)-based spectral conversion model, we demonstrate that this framework can be generalized to any spectral conversion model, and experimental evaluations show that it can outperform a baseline framework whose waveform generation process is carried out by a vocoder.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Shifting F0 transformation into the residual domain removes the need for differential prediction, letting any spectral model feed a DIFFVC-style waveform stage.

read the letter

The core change is straightforward: inverse-filter the input, apply F0 shift to the residual, then synthesis-filter using the converted spectral features directly. This drops the requirement that the spectral model predict differentials or be trained per conversion pair, which is the actual extension past the cited DIFFVC work. The VAE demonstration shows the framework now works with non-parallel data, and the authors report it beats their vocoder baseline on quality. That is useful for anyone already running a spectral converter who wants to avoid vocoder artifacts without retraining everything. The filtering steps themselves are standard, so the math does not introduce new free parameters or circular definitions. The authors correctly flag that several implementation details still need attention, which lines up with the stress-test concern about possible mismatches once the differential correction step is removed. Without the full metrics, ablations, or listening-test numbers it is hard to judge how often those details cause audible problems versus how often the method stays clean. The paper is aimed at voice-conversion engineers who have a working spectral model and want a direct waveform path. A reader in that niche will find the rearrangement practical even if the quality gain is modest. It is solid enough on its own terms to merit referee time rather than a desk reject.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a modification to the spectrum differential based direct waveform modification (DIFFVC) for voice conversion. By performing inverse filtering on the input signal, F0 transformation on the residual domain, and synthesis filtering using converted spectral features directly, the approach eliminates the need for a separate model to estimate spectral differentials or for retraining the spectral conversion model on each pair. The authors apply this to a non-parallel VAE-based spectral conversion model and claim that it generalizes to arbitrary spectral models while outperforming a vocoder-based baseline in experimental evaluations.

Significance. If the quality preservation holds without the original differential correction, the framework would increase flexibility for high-quality waveform generation across diverse spectral conversion models, removing constraints like parallel training data requirements.

major comments (2)

[Abstract] Abstract: the claim that experiments 'show that it can outperform a baseline framework whose waveform generation process is carried out by a vocoder' provides no metrics, controls, ablation details, or evaluation protocol, which is load-bearing for the generalization claim.
[Abstract] Abstract (method description): the premise that synthesis filtering on the F0-transformed residual using directly converted (non-differential) spectral features will preserve waveform quality and avoid artifacts previously compensated by explicit differential estimation is unverified; no comparison to original DIFFVC or quantitative evidence of artifact-free output is referenced.

minor comments (1)

[Abstract] Abstract: the reference to 'several details that need to be taken care of under this modification' is vague and should be expanded with explicit pointers to the relevant implementation steps or sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address the major comments point by point below, clarifying the manuscript's contributions and experimental support while noting where revisions could strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that experiments 'show that it can outperform a baseline framework whose waveform generation process is carried out by a vocoder' provides no metrics, controls, ablation details, or evaluation protocol, which is load-bearing for the generalization claim.

Authors: The abstract is intended as a concise summary. The full manuscript details the experimental protocol, including application to a non-parallel VAE spectral conversion model, objective measures (e.g., MCD), subjective evaluations, and direct comparison against the vocoder baseline under matched conditions. We agree the abstract could be strengthened by briefly referencing key results and will revise it accordingly. revision: partial
Referee: [Abstract] Abstract (method description): the premise that synthesis filtering on the F0-transformed residual using directly converted (non-differential) spectral features will preserve waveform quality and avoid artifacts previously compensated by explicit differential estimation is unverified; no comparison to original DIFFVC or quantitative evidence of artifact-free output is referenced.

Authors: Section 3 explains the residual-domain approach: inverse filtering isolates the excitation, F0 is transformed there, and converted spectra are used directly for synthesis filtering. This design removes the differential estimation step and its associated compensation. Experiments with the VAE model show the resulting quality exceeds the vocoder baseline. A head-to-head comparison with original DIFFVC is not performed because the original requires parallel data and pair-specific differential models—the very constraints our generalization removes. The reported results provide indirect quantitative support via the baseline comparison and lack of reported artifacts. revision: no

Circularity Check

0 steps flagged

No significant circularity; pipeline rearrangement is self-contained

full rationale

The paper proposes a signal-processing modification to DIFFVC: inverse filtering of input, F0 transform on residual, then synthesis filtering with directly converted spectral features. This rearrangement is described via standard operations and does not reduce any claimed prediction or result to a fitted parameter, self-definition, or self-citation chain. The central claim (generalization to arbitrary spectral models without differential estimation) is presented as an engineering change whose validity is checked by experiments on a VAE model; no load-bearing step equates output to input by construction. Minor self-citations to prior DIFFVC work are not used to import uniqueness theorems or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on standard digital signal processing assumptions for speech; no new entities or fitted constants are introduced in the abstract description.

axioms (1)

standard math Inverse filtering and synthesis filtering operations are invertible and preserve sufficient information for high-quality waveform reconstruction when applied to speech residuals.
Invoked when the paper states that synthesis filtering on the F0-transformed residual using converted spectral features replaces the original differential estimation.

pith-pipeline@v0.9.0 · 5811 in / 1157 out tokens · 23150 ms · 2026-05-24T14:49:29.773566+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 5 internal anchors

[1]

Generalization of Spectrum Differential based Direct Waveform Modification for Voice Conversion

Introduction V oice conversion (VC) aims to convert the speech from a source to that of a target without changing the linguistic content. Nu- merous approaches have been proposed, such as Gaussian mix- ture model (GMM)-based methods [1, 2], deep neural net- work (DNN)-based methods [3, 4], and exemplar-based meth- ods [5, 6, 7]. While most VC researchers ...

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

Spectrum Differential based Direct Waveform Modiﬁcation for Voice Conversion (DIFFVC) 2.1. DIFFVC based on DIFFGMM DIFFVC is a conversion framework (not restricted to VC but also other applications like singing VC) that does not employ a parametric vocoder as the waveform generation module [25, 27, 28, 30]. In this section we describe the DIFFVC framework...

work page
[3]

Speciﬁcally, the residual sig- nal is shrunk then up-sampled if the F0 transformation ratio is smaller than 1 and, conversely, expanded then down-sampled if the ratio larger than 1

and resampling process can be performed on the residual signal in order to transform F0. Speciﬁcally, the residual sig- nal is shrunk then up-sampled if the F0 transformation ratio is smaller than 1 and, conversely, expanded then down-sampled if the ratio larger than 1. Finally, the F0 transformed speech is restored by ﬁltering the modiﬁed residual signal...

work page
[4]

Proposed Method based on Residual Transformation Our goal is to extend the vocoder-free DIFFVC framework to any arbitrary VC model, which only knows how to convert nor- mal source features to target features. To impose as few con- straints as possible, we only demand the VC model to estimate Figure 2: The proposed direct waveform modiﬁcation framework, wh...

work page
[5]

Experimental Evaluation 4.1. Experimental settings We evaluated our proposed methods on the SPOKE task of V oice Conversion Challenge 2018 (VCC2018) [36], which in- cluded recordings of professional US English speakers with a sampling rate of 22050 Hz. The dataset consisted of 81/35 ut- terances per speaker for training/testing sets, respectively. We used...

work page 2018
[6]

Conclusions and Future Work In this paper, we introduced a generalization of the DIFFVC framework to make it applicable to general VC models. The proposed method is based on an F0 transformation in the resid- ual domain, so that synthesis ﬁltering is performed directly us- ing the converted spectral features, thus removing the need for the conversion mode...

work page
[7]

Continuous probabilis- tic transform for voice conversion,

Y . Stylianou, O. Cappe, and E. Moulines, “Continuous probabilis- tic transform for voice conversion,”IEEE Transactions on Speech and Audio Processing, vol. 6, no. 2, pp. 131–142, Mar 1998

work page 1998
[8]

V oice conversion based on maximum-likelihood estimation of spectral parameter trajectory,

T. Toda, A. W. Black, and K. Tokuda, “V oice conversion based on maximum-likelihood estimation of spectral parameter trajectory,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 8, pp. 2222–2235, Nov 2007

work page 2007
[9]

Spectral mapping using artiﬁcial neural networks for voice con- version,

S. Desai, A. W. Black, B. Yegnanarayana, and K. Prahallad, “Spectral mapping using artiﬁcial neural networks for voice con- version,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 5, pp. 954–964, July 2010

work page 2010
[10]

V oice conversion using deep neural networks with layer-wise generative training,

L. H. Chen, Z. H. Ling, L. J. Liu, and L. R. Dai, “V oice conversion using deep neural networks with layer-wise generative training,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 22, no. 12, pp. 1859–1872, Dec 2014

work page 2014
[11]

Exemplar-based voice conversion in noisy environment,

R. Takashima, T. Takiguchi, and Y . Ariki, “Exemplar-based voice conversion in noisy environment,” in Proc. SLT, 2012, pp. 313– 317

work page 2012
[12]

Exemplar-based sparse representation with residual compensation for voice con- version,

Z. Wu, T. Virtanen, E. S. Chng, and H. Li, “Exemplar-based sparse representation with residual compensation for voice con- version,” IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, vol. 22, no. 10, pp. 1506–1521, Oct 2014

work page 2014
[13]

Locally linear embedding for exemplar-based spectral conver- sion,

Y .-C. Wu, H.-T. Hwang, C.-C. Hsu, Y . Tsao, and H.-M. Wang, “Locally linear embedding for exemplar-based spectral conver- sion,” in Proc. Interspeech, 2016, pp. 1652–1656

work page 2016
[14]

Speech analysis and synthesis by linear prediction of the speech wave,

B. S. Atal and S. L. Hanauer, “Speech analysis and synthesis by linear prediction of the speech wave,”The Journal of the Acousti- cal Society of America, vol. 50, no. 2B, pp. 637–655, 1971

work page 1971
[15]

Kawahara, I

H. Kawahara, I. Masuda-Katsuse, and A. de Cheveign, “Re- structuring speech representations using a pitch-adaptive time- frequency smoothing and an instantaneous-frequency-based f0 extraction: Possible role of a repetitive structure in sounds,” Speech Communication, vol. 27, no. 3, pp. 187 – 207, 1999

work page 1999
[16]

WORLD: A V ocoder- Based High-Quality Speech Synthesis System for Real-Time Ap- plications,

M. Morise, F. Yokomori, and K. Ozawa, “WORLD: A V ocoder- Based High-Quality Speech Synthesis System for Real-Time Ap- plications,” IEICE Transactions on Information and Systems , vol. 99, pp. 1877–1884, 2016

work page 2016
[17]

Speaker-dependent wavenet vocoder,

A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda, and T. Toda, “Speaker-dependent wavenet vocoder,” in Proc. Interspeech , 2017, pp. 1118–1122

work page 2017
[18]

An investigation of multi-speaker training for wavenet vocoder,

T. Hayashi, A. Tamamori, K. Kobayashi, K. Takeda, and T. Toda, “An investigation of multi-speaker training for wavenet vocoder,” in IEEE Automatic Speech Recognition and Understanding Work- shop (ASRU), Dec 2017, pp. 712–718

work page 2017
[19]

Efficient Neural Audio Synthesis

N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. v. d. Oord, S. Dieleman, and K. Kavukcuoglu, “Efﬁcient neural audio synthesis,” arXiv preprint arXiv:1802.08435, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[20]

SampleRNN: An Unconditional End-to-End Neural Audio Generation Model

S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y . Bengio, “Samplernn: An unconditional end-to-end neural audio generation model,” arXiv preprint arXiv:1612.07837, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[21]

Fftnet: A real- time speaker-dependent neural vocoder,

Z. Jin, A. Finkelstein, G. J. Mysore, and J. Lu, “Fftnet: A real- time speaker-dependent neural vocoder,” in 2018 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 2251–2255

work page 2018
[22]

Lpcnet: Improving neural speech synthesis through linear prediction,

J.-M. Valin and J. Skoglund, “Lpcnet: Improving neural speech synthesis through linear prediction,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP). IEEE, 2019, pp. 5891–5895

work page 2019
[23]

Parallel WaveNet: Fast High-Fidelity Speech Synthesis

A. v. d. Oord, Y . Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. v. d. Driessche, E. Lockhart, L. C. Cobo, F. Stimberg et al. , “Parallel wavenet: Fast high-ﬁdelity speech synthesis,” arXiv preprint arXiv:1711.10433, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[24]

Waveglow: A ﬂow-based generative network for speech synthesis,

R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A ﬂow-based generative network for speech synthesis,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 3617–3621

work page 2019
[25]

Neural source-ﬁlter-based waveform model for statistical parametric speech synthesis,

X. Wang, S. Takaki, and J. Yamagishi, “Neural source-ﬁlter-based waveform model for statistical parametric speech synthesis,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 5916– 5920

work page 2019
[26]

Statistical voice conversion with wavenet-based waveform generation,

K. Kobayashi, T. Hayashi, A. Tamamori, and T. Toda, “Statistical voice conversion with wavenet-based waveform generation,” in Proc. Interspeech, 2017, pp. 1138–1142

work page 2017
[27]

High-quality voice conver- sion using spectrogram-based wavenet vocoder,

K. Chen, B. Chen, J. Lai, and K. Yu, “High-quality voice conver- sion using spectrogram-based wavenet vocoder,” in Proc. Inter- speech, 2018, pp. 1993–1997

work page 2018
[28]

AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss

K. Qian, Y . Zhang, S. Chang, X. Yang, and M. Hasegawa- Johnson, “Zero-shot voice style transfer with only autoencoder loss,” arXiv preprint arXiv:1905.05879, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905
[29]

Sequence-to-sequence acoustic modeling for voice conversion,

J.-X. Zhang, Z.-H. Ling, L.-J. Liu, Y . Jiang, and L.-R. Dai, “Sequence-to-sequence acoustic modeling for voice conversion,” IEEE/ACM Transactions on Audio, Speech and Language Pro- cessing (TASLP), vol. 27, no. 3, pp. 631–644, 2019

work page 2019
[30]

Adaptive wavenet vocoder for residual compensation in gan-based voice conversion,

B. Sisman, M. Zhang, S. Sakti, H. Li, and S. Nakamura, “Adaptive wavenet vocoder for residual compensation in gan-based voice conversion,” in 2018 IEEE Spoken Language Technology Work- shop (SLT). IEEE, 2018, pp. 282–289

work page 2018
[31]

The nu- naist voice conversion system for the voice conversion challenge 2016,

K. Kobayashi, S. Takamichi, S. Nakamura, and T. Toda, “The nu- naist voice conversion system for the voice conversion challenge 2016,” in Interspeech, 2016, pp. 1667–1671

work page 2016
[32]

The voice conversion challenge 2016,

T. Toda, L.-H. Chen, D. Saito, F. Villavicencio, M. Wester, Z. Wu, and J. Yamagishi, “The voice conversion challenge 2016,” in In- terspeech 2016, 2016, pp. 1632–1636

work page 2016
[33]

sprocket: Open-source voice conver- sion software,

K. Kobayashi and T. Toda, “sprocket: Open-source voice conver- sion software,” inProc. Odyssey 2018 The Speaker and Language Recognition Workshop, 2018, pp. 203–210

work page 2018
[34]

Implementation of f0 transformation for statistical singing voice conversion based on direct waveform modiﬁcation,

K. Kobayashi, T. Toda, and S. Nakamura, “Implementation of f0 transformation for statistical singing voice conversion based on direct waveform modiﬁcation,” 2016 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP) , pp. 5670–5674, 2016

work page 2016
[35]

V oice conversion from non-parallel corpora using variational auto-encoder,

C.-C. Hsu, H.-T. Hwang, Y .-C. Wu, Y . Tsao, and H.-M. Wang, “V oice conversion from non-parallel corpora using variational auto-encoder,” in Proc. APISPA ASC, 2016, pp. 1–6

work page 2016
[36]

F0 transformation techniques for statistical voice conversion with direct waveform modiﬁcation with spectral differential,

K. Kobayashi, T. Toda, and S. Nakamura, “F0 transformation techniques for statistical voice conversion with direct waveform modiﬁcation with spectral differential,” 2016 IEEE Spoken Lan- guage Technology Workshop (SLT), pp. 693–700, 2016

work page 2016
[37]

Ways to imple- ment global variance in statistical speech synthesis,

H. Siln, E. Hel, J. Nurminen, and M. Gabbouj, “Ways to imple- ment global variance in statistical speech synthesis,” in Proc. In- terspeech, 2012, pp. 1436–1439

work page 2012
[38]

An overlap-add technique based on waveform similarity (wsola) for high quality time-scale modi- ﬁcation of speech,

W. Verhelst and M. Roelands, “An overlap-add technique based on waveform similarity (wsola) for high quality time-scale modi- ﬁcation of speech,” in ICASSP, 1993

work page 1993
[39]

High-frequency regeneration in speech coding systems,

J. Makhoul and M. Berouti, “High-frequency regeneration in speech coding systems,” in ICASSP ’79. IEEE International Con- ference on Acoustics, Speech, and Signal Processing, vol. 4, April 1979, pp. 428–431

work page 1979
[40]

Speech en- hancement via frequency bandwidth extension using line spec- tral frequencies,

S. Chennoukh, A. Gerrits, G. Miet, and R. Sluijter, “Speech en- hancement via frequency bandwidth extension using line spec- tral frequencies,” in 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), vol. 1. IEEE, 2001, pp. 665–668

work page 2001
[41]

Collapsed speech segment detection and suppression for wavenet vocoder,

Y .-C. Wu, K. Kobayashi, T. Hayashi, P. L. Tobing, and T. Toda, “Collapsed speech segment detection and suppression for wavenet vocoder,” inProc. Interspeech 2018, 2018, pp. 1988–1992

work page 2018
[42]

The voice conversion challenge 2018: Promoting development of parallel and nonparallel meth- ods,

J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicen- cio, T. Kinnunen, and Z. Ling, “The voice conversion challenge 2018: Promoting development of parallel and nonparallel meth- ods,” in Proc. Odyssey, 2018, pp. 195–202

work page 2018
[43]

Investigation of F0 conditioning and Fully Convolutional Networks in Varia- tional Autoencoder based V oice Conversion,

W.-C. Huang, Y .-C. Wu, C.-C. Lo, P. Lumban Tobing, T. Hayashi, K. Kobayashi, T. Toda, Y . Tsao, and H.-M. Wang, “Investigation of F0 conditioning and Fully Convolutional Networks in Varia- tional Autoencoder based V oice Conversion,”arXiv e-prints, May 2019

work page 2019
[44]

Mel- generalized cepstral analysis - a uniﬁed approach to speech spec- tral estimation,

K. Tokuda, T. Kobayashi, T. Masuko, and S. Imai, “Mel- generalized cepstral analysis - a uniﬁed approach to speech spec- tral estimation,” in ICSLP, 1994

work page 1994

[1] [1]

Generalization of Spectrum Differential based Direct Waveform Modification for Voice Conversion

Introduction V oice conversion (VC) aims to convert the speech from a source to that of a target without changing the linguistic content. Nu- merous approaches have been proposed, such as Gaussian mix- ture model (GMM)-based methods [1, 2], deep neural net- work (DNN)-based methods [3, 4], and exemplar-based meth- ods [5, 6, 7]. While most VC researchers ...

work page internal anchor Pith review Pith/arXiv arXiv 2016

[2] [2]

Spectrum Differential based Direct Waveform Modiﬁcation for Voice Conversion (DIFFVC) 2.1. DIFFVC based on DIFFGMM DIFFVC is a conversion framework (not restricted to VC but also other applications like singing VC) that does not employ a parametric vocoder as the waveform generation module [25, 27, 28, 30]. In this section we describe the DIFFVC framework...

work page

[3] [3]

Speciﬁcally, the residual sig- nal is shrunk then up-sampled if the F0 transformation ratio is smaller than 1 and, conversely, expanded then down-sampled if the ratio larger than 1

and resampling process can be performed on the residual signal in order to transform F0. Speciﬁcally, the residual sig- nal is shrunk then up-sampled if the F0 transformation ratio is smaller than 1 and, conversely, expanded then down-sampled if the ratio larger than 1. Finally, the F0 transformed speech is restored by ﬁltering the modiﬁed residual signal...

work page

[4] [4]

Proposed Method based on Residual Transformation Our goal is to extend the vocoder-free DIFFVC framework to any arbitrary VC model, which only knows how to convert nor- mal source features to target features. To impose as few con- straints as possible, we only demand the VC model to estimate Figure 2: The proposed direct waveform modiﬁcation framework, wh...

work page

[5] [5]

Experimental Evaluation 4.1. Experimental settings We evaluated our proposed methods on the SPOKE task of V oice Conversion Challenge 2018 (VCC2018) [36], which in- cluded recordings of professional US English speakers with a sampling rate of 22050 Hz. The dataset consisted of 81/35 ut- terances per speaker for training/testing sets, respectively. We used...

work page 2018

[6] [6]

Conclusions and Future Work In this paper, we introduced a generalization of the DIFFVC framework to make it applicable to general VC models. The proposed method is based on an F0 transformation in the resid- ual domain, so that synthesis ﬁltering is performed directly us- ing the converted spectral features, thus removing the need for the conversion mode...

work page

[7] [7]

Continuous probabilis- tic transform for voice conversion,

Y . Stylianou, O. Cappe, and E. Moulines, “Continuous probabilis- tic transform for voice conversion,”IEEE Transactions on Speech and Audio Processing, vol. 6, no. 2, pp. 131–142, Mar 1998

work page 1998

[8] [8]

V oice conversion based on maximum-likelihood estimation of spectral parameter trajectory,

T. Toda, A. W. Black, and K. Tokuda, “V oice conversion based on maximum-likelihood estimation of spectral parameter trajectory,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 8, pp. 2222–2235, Nov 2007

work page 2007

[9] [9]

Spectral mapping using artiﬁcial neural networks for voice con- version,

S. Desai, A. W. Black, B. Yegnanarayana, and K. Prahallad, “Spectral mapping using artiﬁcial neural networks for voice con- version,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 5, pp. 954–964, July 2010

work page 2010

[10] [10]

V oice conversion using deep neural networks with layer-wise generative training,

L. H. Chen, Z. H. Ling, L. J. Liu, and L. R. Dai, “V oice conversion using deep neural networks with layer-wise generative training,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 22, no. 12, pp. 1859–1872, Dec 2014

work page 2014

[11] [11]

Exemplar-based voice conversion in noisy environment,

R. Takashima, T. Takiguchi, and Y . Ariki, “Exemplar-based voice conversion in noisy environment,” in Proc. SLT, 2012, pp. 313– 317

work page 2012

[12] [12]

Exemplar-based sparse representation with residual compensation for voice con- version,

Z. Wu, T. Virtanen, E. S. Chng, and H. Li, “Exemplar-based sparse representation with residual compensation for voice con- version,” IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, vol. 22, no. 10, pp. 1506–1521, Oct 2014

work page 2014

[13] [13]

Locally linear embedding for exemplar-based spectral conver- sion,

Y .-C. Wu, H.-T. Hwang, C.-C. Hsu, Y . Tsao, and H.-M. Wang, “Locally linear embedding for exemplar-based spectral conver- sion,” in Proc. Interspeech, 2016, pp. 1652–1656

work page 2016

[14] [14]

Speech analysis and synthesis by linear prediction of the speech wave,

B. S. Atal and S. L. Hanauer, “Speech analysis and synthesis by linear prediction of the speech wave,”The Journal of the Acousti- cal Society of America, vol. 50, no. 2B, pp. 637–655, 1971

work page 1971

[15] [15]

Kawahara, I

H. Kawahara, I. Masuda-Katsuse, and A. de Cheveign, “Re- structuring speech representations using a pitch-adaptive time- frequency smoothing and an instantaneous-frequency-based f0 extraction: Possible role of a repetitive structure in sounds,” Speech Communication, vol. 27, no. 3, pp. 187 – 207, 1999

work page 1999

[16] [16]

WORLD: A V ocoder- Based High-Quality Speech Synthesis System for Real-Time Ap- plications,

M. Morise, F. Yokomori, and K. Ozawa, “WORLD: A V ocoder- Based High-Quality Speech Synthesis System for Real-Time Ap- plications,” IEICE Transactions on Information and Systems , vol. 99, pp. 1877–1884, 2016

work page 2016

[17] [17]

Speaker-dependent wavenet vocoder,

A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda, and T. Toda, “Speaker-dependent wavenet vocoder,” in Proc. Interspeech , 2017, pp. 1118–1122

work page 2017

[18] [18]

An investigation of multi-speaker training for wavenet vocoder,

T. Hayashi, A. Tamamori, K. Kobayashi, K. Takeda, and T. Toda, “An investigation of multi-speaker training for wavenet vocoder,” in IEEE Automatic Speech Recognition and Understanding Work- shop (ASRU), Dec 2017, pp. 712–718

work page 2017

[19] [19]

Efficient Neural Audio Synthesis

N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. v. d. Oord, S. Dieleman, and K. Kavukcuoglu, “Efﬁcient neural audio synthesis,” arXiv preprint arXiv:1802.08435, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[20] [20]

SampleRNN: An Unconditional End-to-End Neural Audio Generation Model

S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y . Bengio, “Samplernn: An unconditional end-to-end neural audio generation model,” arXiv preprint arXiv:1612.07837, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[21] [21]

Fftnet: A real- time speaker-dependent neural vocoder,

Z. Jin, A. Finkelstein, G. J. Mysore, and J. Lu, “Fftnet: A real- time speaker-dependent neural vocoder,” in 2018 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 2251–2255

work page 2018

[22] [22]

Lpcnet: Improving neural speech synthesis through linear prediction,

J.-M. Valin and J. Skoglund, “Lpcnet: Improving neural speech synthesis through linear prediction,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP). IEEE, 2019, pp. 5891–5895

work page 2019

[23] [23]

Parallel WaveNet: Fast High-Fidelity Speech Synthesis

A. v. d. Oord, Y . Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. v. d. Driessche, E. Lockhart, L. C. Cobo, F. Stimberg et al. , “Parallel wavenet: Fast high-ﬁdelity speech synthesis,” arXiv preprint arXiv:1711.10433, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[24] [24]

Waveglow: A ﬂow-based generative network for speech synthesis,

R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A ﬂow-based generative network for speech synthesis,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 3617–3621

work page 2019

[25] [25]

Neural source-ﬁlter-based waveform model for statistical parametric speech synthesis,

X. Wang, S. Takaki, and J. Yamagishi, “Neural source-ﬁlter-based waveform model for statistical parametric speech synthesis,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 5916– 5920

work page 2019

[26] [26]

Statistical voice conversion with wavenet-based waveform generation,

K. Kobayashi, T. Hayashi, A. Tamamori, and T. Toda, “Statistical voice conversion with wavenet-based waveform generation,” in Proc. Interspeech, 2017, pp. 1138–1142

work page 2017

[27] [27]

High-quality voice conver- sion using spectrogram-based wavenet vocoder,

K. Chen, B. Chen, J. Lai, and K. Yu, “High-quality voice conver- sion using spectrogram-based wavenet vocoder,” in Proc. Inter- speech, 2018, pp. 1993–1997

work page 2018

[28] [28]

AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss

K. Qian, Y . Zhang, S. Chang, X. Yang, and M. Hasegawa- Johnson, “Zero-shot voice style transfer with only autoencoder loss,” arXiv preprint arXiv:1905.05879, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905

[29] [29]

Sequence-to-sequence acoustic modeling for voice conversion,

J.-X. Zhang, Z.-H. Ling, L.-J. Liu, Y . Jiang, and L.-R. Dai, “Sequence-to-sequence acoustic modeling for voice conversion,” IEEE/ACM Transactions on Audio, Speech and Language Pro- cessing (TASLP), vol. 27, no. 3, pp. 631–644, 2019

work page 2019

[30] [30]

Adaptive wavenet vocoder for residual compensation in gan-based voice conversion,

B. Sisman, M. Zhang, S. Sakti, H. Li, and S. Nakamura, “Adaptive wavenet vocoder for residual compensation in gan-based voice conversion,” in 2018 IEEE Spoken Language Technology Work- shop (SLT). IEEE, 2018, pp. 282–289

work page 2018

[31] [31]

The nu- naist voice conversion system for the voice conversion challenge 2016,

K. Kobayashi, S. Takamichi, S. Nakamura, and T. Toda, “The nu- naist voice conversion system for the voice conversion challenge 2016,” in Interspeech, 2016, pp. 1667–1671

work page 2016

[32] [32]

The voice conversion challenge 2016,

T. Toda, L.-H. Chen, D. Saito, F. Villavicencio, M. Wester, Z. Wu, and J. Yamagishi, “The voice conversion challenge 2016,” in In- terspeech 2016, 2016, pp. 1632–1636

work page 2016

[33] [33]

sprocket: Open-source voice conver- sion software,

K. Kobayashi and T. Toda, “sprocket: Open-source voice conver- sion software,” inProc. Odyssey 2018 The Speaker and Language Recognition Workshop, 2018, pp. 203–210

work page 2018

[34] [34]

Implementation of f0 transformation for statistical singing voice conversion based on direct waveform modiﬁcation,

K. Kobayashi, T. Toda, and S. Nakamura, “Implementation of f0 transformation for statistical singing voice conversion based on direct waveform modiﬁcation,” 2016 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP) , pp. 5670–5674, 2016

work page 2016

[35] [35]

V oice conversion from non-parallel corpora using variational auto-encoder,

C.-C. Hsu, H.-T. Hwang, Y .-C. Wu, Y . Tsao, and H.-M. Wang, “V oice conversion from non-parallel corpora using variational auto-encoder,” in Proc. APISPA ASC, 2016, pp. 1–6

work page 2016

[36] [36]

F0 transformation techniques for statistical voice conversion with direct waveform modiﬁcation with spectral differential,

K. Kobayashi, T. Toda, and S. Nakamura, “F0 transformation techniques for statistical voice conversion with direct waveform modiﬁcation with spectral differential,” 2016 IEEE Spoken Lan- guage Technology Workshop (SLT), pp. 693–700, 2016

work page 2016

[37] [37]

Ways to imple- ment global variance in statistical speech synthesis,

H. Siln, E. Hel, J. Nurminen, and M. Gabbouj, “Ways to imple- ment global variance in statistical speech synthesis,” in Proc. In- terspeech, 2012, pp. 1436–1439

work page 2012

[38] [38]

An overlap-add technique based on waveform similarity (wsola) for high quality time-scale modi- ﬁcation of speech,

W. Verhelst and M. Roelands, “An overlap-add technique based on waveform similarity (wsola) for high quality time-scale modi- ﬁcation of speech,” in ICASSP, 1993

work page 1993

[39] [39]

High-frequency regeneration in speech coding systems,

J. Makhoul and M. Berouti, “High-frequency regeneration in speech coding systems,” in ICASSP ’79. IEEE International Con- ference on Acoustics, Speech, and Signal Processing, vol. 4, April 1979, pp. 428–431

work page 1979

[40] [40]

Speech en- hancement via frequency bandwidth extension using line spec- tral frequencies,

S. Chennoukh, A. Gerrits, G. Miet, and R. Sluijter, “Speech en- hancement via frequency bandwidth extension using line spec- tral frequencies,” in 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), vol. 1. IEEE, 2001, pp. 665–668

work page 2001

[41] [41]

Collapsed speech segment detection and suppression for wavenet vocoder,

Y .-C. Wu, K. Kobayashi, T. Hayashi, P. L. Tobing, and T. Toda, “Collapsed speech segment detection and suppression for wavenet vocoder,” inProc. Interspeech 2018, 2018, pp. 1988–1992

work page 2018

[42] [42]

The voice conversion challenge 2018: Promoting development of parallel and nonparallel meth- ods,

J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicen- cio, T. Kinnunen, and Z. Ling, “The voice conversion challenge 2018: Promoting development of parallel and nonparallel meth- ods,” in Proc. Odyssey, 2018, pp. 195–202

work page 2018

[43] [43]

Investigation of F0 conditioning and Fully Convolutional Networks in Varia- tional Autoencoder based V oice Conversion,

W.-C. Huang, Y .-C. Wu, C.-C. Lo, P. Lumban Tobing, T. Hayashi, K. Kobayashi, T. Toda, Y . Tsao, and H.-M. Wang, “Investigation of F0 conditioning and Fully Convolutional Networks in Varia- tional Autoencoder based V oice Conversion,”arXiv e-prints, May 2019

work page 2019

[44] [44]

Mel- generalized cepstral analysis - a uniﬁed approach to speech spec- tral estimation,

K. Tokuda, T. Kobayashi, T. Masuko, and S. Imai, “Mel- generalized cepstral analysis - a uniﬁed approach to speech spec- tral estimation,” in ICSLP, 1994

work page 1994