Non-Parallel Voice Conversion with Cyclic Variational Autoencoder

Kazuhiro Kobayashi; Patrick Lumban Tobing; Tomoki Hayashi; Tomoki Toda; Yi-Chiao Wu

arxiv: 1907.10185 · v1 · pith:EV2AEO6Wnew · submitted 2019-07-24 · 📡 eess.AS · cs.CL· cs.SD

Non-Parallel Voice Conversion with Cyclic Variational Autoencoder

Patrick Lumban Tobing , Yi-Chiao Wu , Tomoki Hayashi , Kazuhiro Kobayashi , Tomoki Toda This is my paper

Pith reviewed 2026-05-24 17:03 UTC · model grok-4.3

classification 📡 eess.AS cs.CLcs.SD

keywords voice conversionvariational autoencodernon-parallel trainingcyclic reconstructionspectral modelingspeech synthesislatent space

0 comments

The pith

CycleVAE recycles converted spectra to create direct optimization targets for non-parallel voice conversion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard VAE voice conversion cannot directly optimize converted spectra under non-parallel conditions because no paired target exists for comparison. The proposed method adds a cyclic reconstruction step: converted spectra are fed back into the same encoder-decoder pair to produce cyclic reconstructions that match the original input and therefore can be optimized with a reconstruction loss. The cycle can be repeated, allowing the conversion path itself to be trained indirectly. Experiments show this yields converted spectra with lower error, latent codes with higher correlation across speakers, and converted speech with measurably higher quality and conversion accuracy.

Core claim

In the CycleVAE spectral model, latent features encoded from source spectra are decoded with target speaker codes to produce converted spectra; these converted spectra are then re-encoded and decoded with source speaker codes to produce cyclic reconstructed spectra that are directly compared to the original input spectra, thereby supplying an optimization signal for the otherwise unobservable conversion mapping.

What carries the argument

The cyclic reconstruction flow that recycles converted spectra back through the model to obtain optimizable cyclic reconstructed spectra.

If this is right

Converted spectra achieve higher accuracy than standard VAE outputs.
Latent features exhibit a higher degree of correlation between source and converted representations.
Perceptual quality and speaker-conversion accuracy of the output speech increase significantly.
The same cyclic mechanism can be iterated for multiple cycles using the reconstructed features as new inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same recycling idea could supply indirect supervision for other unpaired translation tasks such as unpaired image-to-image style transfer.
Multi-speaker or multi-domain extensions might be trained by chaining cycles across more than two speakers without requiring parallel recordings.
If the cyclic loss is the dominant training signal, the method may reduce reliance on speaker-identity labels during inference.

Load-bearing premise

Optimizing the cyclic reconstructed spectra will improve the underlying conversion mapping even though no parallel source-target pairs are available.

What would settle it

An ablation on a standard non-parallel VC test set in which removing the cyclic reconstruction loss produces no measurable drop in mel-cepstral distortion or subjective conversion scores.

Figures

Figures reproduced from arXiv: 1907.10185 by Kazuhiro Kobayashi, Patrick Lumban Tobing, Tomoki Hayashi, Tomoki Toda, Yi-Chiao Wu.

**Figure 1.** Figure 1: , we propose CycleVAE, which is capable of recycling the converted spectra back into the system, so that the conversion flow is indirectly considered in the parameter optimization. A similar idea has also been proposed as a cycle-consistent flow in a self-supervised method for visual correspondence [24]. In the proposed CycleVAE-based VC, the parameter optimization is defined as follows: {θˆ, φˆ} = argmax… view at source ↗

**Figure 2.** Figure 2: Mel-cepstral distortion (mcd) of reconstructed (rec) spectra, estimated using the conventional VAE-based (cyc0) and the proposed CycleVAE-based (cyc3) VC, during 180 training epochs, for training (train) and testing (test) sets. mcds were computed with only the speech frames of the input speech. 4. Experimental Evaluation 4.1. Experimental conditions We used a subset of the Voice Conversion Challenge (VCC… view at source ↗

**Figure 4.** Figure 4: Cosine similarity (cosine) between latent features of corresponding source and target speech, encoded with the conventional VAE-based (cyc0) and the proposed CycleVAE-based (cyc3) VC, during 180 training epochs, for training (train) and testing (test) sets. cosines were computed, through DTW alignment, with only the speech frames of source and target speech. where reconstruction performance is not a prop… view at source ↗

read the original abstract

In this paper, we present a novel technique for a non-parallel voice conversion (VC) with the use of cyclic variational autoencoder (CycleVAE)-based spectral modeling. In a variational autoencoder(VAE) framework, a latent space, usually with a Gaussian prior, is used to encode a set of input features. In a VAE-based VC, the encoded latent features are fed into a decoder, along with speaker-coding features, to generate estimated spectra with either the original speaker identity (reconstructed) or another speaker identity (converted). Due to the non-parallel modeling condition, the converted spectra can not be directly optimized, which heavily degrades the performance of a VAE-based VC. In this work, to overcome this problem, we propose to use CycleVAE-based spectral model that indirectly optimizes the conversion flow by recycling the converted features back into the system to obtain corresponding cyclic reconstructed spectra that can be directly optimized. The cyclic flow can be continued by using the cyclic reconstructed features as input for the next cycle. The experimental results demonstrate the effectiveness of the proposed CycleVAE-based VC, which yields higher accuracy of converted spectra, generates latent features with higher correlation degree, and significantly improves the quality and conversion accuracy of the converted speech.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CycleVAE adds a reconstruction cycle to standard VAE voice conversion so the model can optimize the non-parallel path indirectly, and the logic holds without obvious contradictions.

read the letter

The main point is that this paper takes the usual VAE voice conversion setup and adds cyclic reconstruction to handle the fact that converted spectra have no direct target in non-parallel data. They feed the converted output back in as input to get a cyclic reconstruction that can be optimized with the usual loss, and they allow the cycle to continue. That step is internally consistent and directly targets the optimization problem described in the abstract.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes CycleVAE, an extension of variational autoencoders for non-parallel voice conversion. Standard VAE encoding/decoding with speaker codes is augmented with cyclic reconstruction losses that recycle converted spectra back through the model, allowing direct optimization of the otherwise unobservable conversion path. The abstract claims this yields higher converted-spectral accuracy, higher latent-feature correlation, and improved perceptual quality/conversion accuracy.

Significance. If the experimental claims hold with appropriate controls, the cyclic-reconstruction device would constitute a practical solution to the non-parallel optimization problem that has limited prior VAE-VC work. The approach is a direct analogue of cycle-consistency losses used successfully in other unpaired translation tasks and could therefore be of interest to the speech-synthesis community.

major comments (1)

Abstract: the central claim that CycleVAE 'yields higher accuracy of converted spectra, generates latent features with higher correlation degree, and significantly improves the quality and conversion accuracy' is unsupported by any numerical results, baselines, dataset description, or statistical tests. Because the manuscript supplies no evidence for the claimed gains, the effectiveness assertion that constitutes the paper's main contribution cannot be evaluated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and the opportunity to clarify the manuscript. We address the single major comment below.

read point-by-point responses

Referee: Abstract: the central claim that CycleVAE 'yields higher accuracy of converted spectra, generates latent features with higher correlation degree, and significantly improves the quality and conversion accuracy' is unsupported by any numerical results, baselines, dataset description, or statistical tests. Because the manuscript supplies no evidence for the claimed gains, the effectiveness assertion that constitutes the paper's main contribution cannot be evaluated.

Authors: The full manuscript contains a complete experimental section (Section 4) that supplies the supporting evidence for the abstract claims. This includes numerical comparisons of converted spectral accuracy (via mel-cepstral distortion), latent-feature correlations, baseline VAE-VC systems, dataset details (VCC 2018), and subjective quality/conversion-accuracy results with statistical testing. The abstract follows the conventional role of summarizing those findings at a high level rather than repeating the numbers. We can revise the abstract to incorporate selected numerical highlights if the referee prefers greater specificity there. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes a CycleVAE architecture that augments a standard VAE ELBO with cycle-consistency reconstruction losses to enable gradient flow through the non-parallel conversion mapping. This is a modeling choice grounded in existing VAE and cycle-consistency techniques rather than any derivation that reduces a claimed prediction or uniqueness result to a fitted parameter or self-citation by construction. No equations are presented that equate an output quantity to its own input via redefinition, and the central performance claims rest on external objective and subjective metrics rather than internal self-consistency alone. The logical structure is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are specified in the provided text.

pith-pipeline@v0.9.0 · 5767 in / 988 out tokens · 27325 ms · 2026-05-24T17:03:45.402424+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 7 internal anchors

[1]

Introduction Using a voice conversion (VC) system, voice characteristic s of a source speaker can be transformed into that of a desired tar get speaker, while keeping the linguistic contents intact. Such trans- formation can be achieved, for example, by performing stati sti- cal conversion of spectral envelope parameters of the vocal tract spectrum, and a...

work page
[2]

Conventional V AE-based VC The ﬂow of conventional V AE-based VC is illustrated by the upper part of Fig. 1. Let X t = [ e(x)⊤ t , s(x)⊤ t ]⊤ , e(x) t = [ e(x) t (1), . . . , e(x) t (De)]⊤ , and s(x) t = [s(x) t (1), . . . , s(x) t (Ds)]⊤ be the De + Ds, De, and Ds- dimensional feature vectors of the input, the excitation, a nd the spectra, respectively, ...

work page
[3]

Proposed CycleV AE-based VC In this paper, to improve the V AE-based VC, as illustrated in Fig. 1, we propose CycleV AE, which is capable of recycling the converted spectra back into the system, so that the conversi on ﬂow is indirectly considered in the parameter optimization . A similar idea has also been proposed as a cycle-consistent ﬂo w in a self-su...

work page
[4]

Experimental conditions We used a subset of the V oice Conversion Challenge (VCC) 2018 [25] dataset, which included four speakers, i.e., SF1, SM1, TF1, and TM1

Experimental Evaluation 4.1. Experimental conditions We used a subset of the V oice Conversion Challenge (VCC) 2018 [25] dataset, which included four speakers, i.e., SF1, SM1, TF1, and TM1. The speaker notations are as follows: S de- notes source speaker, T denotes target speaker, F denotes fe - male speaker, and M denotes male speaker. The total number o...

work page 2018
[5]

Conclusions We have presented a novel framework to improve conventional V AE, for a non-parallel VC, by using a cycle-consistent ﬂow, i.e., the proposed CycleV AE. Speciﬁcally, the converted sp ec- tra, which is not directly optimized, is recycled back into t he system, to generate cyclic reconstructed spectra that can b e di- rectly optimized. The cyclic...

work page
[6]

Acknowledgements This work was partly supported by JST, PRESTO Grant Number JPMJPR1657, and JSPS KAKENHI Grant Number JP17H06101

work page
[7]

Spectral voice conversion for te xt- to-speech synthesis,

A. Kain and M. W. Macon, “Spectral voice conversion for te xt- to-speech synthesis,” in Proc. ICASSP, Seatle, Washington, USA, May 1998, pp. 285–288

work page 1998
[8]

Intra-gender st atistical singing voice conversion with direct waveform modiﬁcation using log-spectral differential,

K. Kobayashi, T. Toda, and S. Nakamura, “Intra-gender st atistical singing voice conversion with direct waveform modiﬁcation using log-spectral differential,” Speech Commun., vol. 99, pp. 211–220, 2018

work page 2018
[9]

Improving the intelligibility of dy sarthric speech,

A. B. Kain, J.-P . Hosom, X. Niu, J. P . van Santen, M. Fried- Oken, and J. Staehely, “Improving the intelligibility of dy sarthric speech,” Speech Commun., vol. 49, no. 9, pp. 743–759, 2007

work page 2007
[10]

A hybrid approach to electrolaryngeal speech enhancement ba sed on spectral subtraction and statistical voice conversion,

K. Tanaka, T. Toda, G. Neubig, S. Sakti, and S. Nakamura, “ A hybrid approach to electrolaryngeal speech enhancement ba sed on spectral subtraction and statistical voice conversion, ” in Proc. INTERSPEECH, Lyon, France, Sep. 2013, pp. 3067–3071

work page 2013
[11]

Data-driven emotion conversi on in spoken English,

Z. Inanoglu and S. Y oung, “Data-driven emotion conversi on in spoken English,” Speech Commun. , vol. 51, no. 3, pp. 268–283, 2009

work page 2009
[12]

Evaluation of expressive speech syn- thesis with voice conversion and copy resynthesis techniqu es,

O. T ¨ urk and M. Schr¨ oder, “Evaluation of expressive speech syn- thesis with voice conversion and copy resynthesis techniqu es,” IEEE/ACM Trans. Audio Speech Lang. Process. , vol. 18, no. 5, pp. 965–973, 2010

work page 2010
[13]

Multisens ory processing for speech enhancement and magnitude-normaliz ed spectra for speech modeling,

A. Subramanya, Z. Zhang, Z. Liu, and A. Acero, “Multisens ory processing for speech enhancement and magnitude-normaliz ed spectra for speech modeling,” Speech Commun. , vol. 50, no. 3, pp. 228–243, 2008

work page 2008
[14]

Statistical voice con- version techniques for body-conducted unvoiced speech enh ance- ment,

T. Toda, M. Nakagiri, and K. Shikano, “Statistical voice con- version techniques for body-conducted unvoiced speech enh ance- ment,” IEEE Trans. Audio Speech Lang. Process. , vol. 20, no. 9, pp. 2505–2517, 2012

work page 2012
[15]

Articulatory co ntrol- lable speech modiﬁcation based on statistical inversion an d pro- duction mappings,

P . L. Tobing, K. Kobayashi, and T. Toda, “Articulatory co ntrol- lable speech modiﬁcation based on statistical inversion an d pro- duction mappings,” IEEE/ACM Trans. Audio Speech Lang. Pro- cess., vol. 25, no. 12, pp. 2337–2350, 2017

work page 2017
[16]

Continuous probabilis- tic transform for voice conversion,

Y . Stylianou, O. Capp´ e, and E. Moulines, “Continuous probabilis- tic transform for voice conversion,” IEEE Trans. Speech Audio Process., vol. 6, no. 2, pp. 131–142, 1998

work page 1998
[17]

V oice conversion ba sed on maximum-likelihood estimation of spectral parameter tr ajec- tory,

T. Toda, A. W. Black, and K. Tokuda, “V oice conversion ba sed on maximum-likelihood estimation of spectral parameter tr ajec- tory,” IEEE Trans. Audio Speech Lang. Process. , vol. 15, no. 8, pp. 2222–2235, 2007

work page 2007
[18]

INCA algorithm fo r training voice conversion systems from nonparallel corpor a,

D. Erro, A. Moreno, and A. Bonafonte, “INCA algorithm fo r training voice conversion systems from nonparallel corpor a,” IEEE Trans. Speech Audio Process. , vol. 18, no. 5, pp. 944–953, 2010

work page 2010
[19]

Non-parallel voi ce con- version using joint optimization of alignment by temporal c ontext and spectral distortion,

H. Benisty, D. Malah, and K. Crammer, “Non-parallel voi ce con- version using joint optimization of alignment by temporal c ontext and spectral distortion,” in Proc. ICASSP , Florence, Italy, May 2014, pp. 7909–7913

work page 2014
[20]

Text-independen t voice conversion based on state mapped codebook,

M. Zhang, J. Tao, J. Tian, and X. Wang, “Text-independen t voice conversion based on state mapped codebook,” in Proc. ICASSP, Las V egas, USA, Mar. 2008, pp. 4605–4608

work page 2008
[21]

Non-parallel training f or voice conversion based on adaptation method,

P . Song, W. Zheng, and L. Zhao, “Non-parallel training f or voice conversion based on adaptation method,” in Proc. ICASSP, V an- couver, Canada, May 2013, pp. 6905–6909

work page 2013
[22]

Non-parallel training in voice conversion using an adaptive restricted boltzmann ma- chine,

T. Nakashika, T. Takiguchi, and Y . Minami, “Non-parallel training in voice conversion using an adaptive restricted boltzmann ma- chine,” IEEE/ACM Trans. Audio Speech Lang. Process. , vol. 24, no. 11, pp. 2032–2045, 2016

work page 2032
[23]

High- quality nonparallel voice conversion based on cycle-consistent ad- versarial network,

F. Fang, J. Y amagishi, I. Echizen, , and J. Lorenzo-True ba, “High- quality nonparallel voice conversion based on cycle-consistent ad- versarial network,” in Proc. ICASSP, Calgary, Canada, Apr. 2018, pp. 5279–5283

work page 2018
[24]

StarGAN- VC: Non-parallel many-to-many voice conversion with star gene rative adversarial networks,

H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “StarGAN- VC: Non-parallel many-to-many voice conversion with star gene rative adversarial networks,” in Proc. SLT, Athens, Greece, Dec. 2018, pp. 266–273

work page 2018
[25]

V oice conversion from non-parallel corpora using variati onal auto-encoder,

C.-C. Hsu, H.-T. Hwang, Y .-C. Wu, Y . Tsao, and H.-M. Wang , “V oice conversion from non-parallel corpora using variati onal auto-encoder,” in Proc. APSIPA, Jeju, South Korea, Dec. 2016, pp. 1–6

work page 2016
[26]

V oice conversion from unaligned corpora using variationa l au- toencoding Wasserstein generative adversarial networks,

C.-C. Hsu, H.-T. Hwang, Y .-C. Wu, Y . Tsao, , and H.-M. Wan g, “V oice conversion from unaligned corpora using variationa l au- toencoding Wasserstein generative adversarial networks,” in Proc. INTERSPEECH, Stockholm, Sweden, Aug. 2017, pp. 3364–3368

work page 2017
[27]

Non-p arallel voice conversion using variational autoencoders conditio ned by phonetic posteriorgrams and d-vectors,

Y . Saito, Y . Ijima, K. Nishida, and S. Takamichi, “Non-p arallel voice conversion using variational autoencoders conditio ned by phonetic posteriorgrams and d-vectors,” in Proc. ICASSP , Cal- gary, Canada, Apr. 2018, pp. 5274–5278

work page 2018
[28]

ACV AE- VC: Non-parallel many-to-many voice conversion with auxil - iary classiﬁer variational autoencoder,

H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “ACV AE- VC: Non-parallel many-to-many voice conversion with auxil - iary classiﬁer variational autoencoder,” CoRR arXiv preprint arXiv:1808.05092, 2018

work page arXiv 2018
[29]

Auto-Encoding Variational Bayes

D. P . Kingma and J. Ba, “Auto-encoding variational bayes,” CoRR arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[30]

Learning Correspondence from the Cycle-Consistency of Time

X. Wang, A. Jabri, and A. A. Efros, “Learning correspon- dence from the cycle-consistency of time,” CoRR arXiv preprint arXiv:1903.07593, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1903
[31]

The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods

J. Lorenzo-Trueba, J. Y amagishi, T. Toda, D. Saito, F. V illavicen- cio, T. Kinnunen, and Z. Ling, “The V oice Conversion Challenge 2018: Promoting development of parallel and nonparallel me th- ods,” Corr arXiv preprint arXiv:1804.04262, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[32]

WORLD: A vocoder- based high-quality speech synthesis system for real-time a ppli- cations,

M. Morise, F. Y okomori, and K. Ozawa, “WORLD: A vocoder- based high-quality speech synthesis system for real-time a ppli- cations,” IEICE Trans. Inf. Syst. , vol. 99, no. 7, pp. 1877–1884, 2016

work page 2016
[33]

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

K. Cho, B. van Merrienboer, C ¸ . G¨ ulc ¸ehre, F. Bougares , H. Schwenk, and Y . Bengio, “Learning phrase representation s using RNN encoder-decoder for statistical machine transla tion,” CoRR arXiv preprint arXiv:1406.1078, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[34]

Dropout: a simple way to prevent neural n et- works from overﬁtting,

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural n et- works from overﬁtting,” J. Mach. Learning Res. , vol. 15, no. 1, pp. 1929–1958, 2014

work page 1929
[35]

Understanding the difﬁculty o f train- ing deep feedforward neural networks,

X. Glorot and Y . Bengio, “Understanding the difﬁculty o f train- ing deep feedforward neural networks,” in Proc. AISTATS, vol. 9, Sardinia, Italy, May 2010, pp. 249–256

work page 2010
[36]

Adam: A Method for Stochastic Optimization

D. P . Kingma and J. Ba, “Adam: A method for stochastic opt i- mization,” CoRR arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[37]

Un- supervised speech representation learning using WaveNet autoen- coders,

J. Chorowski, R. J. Weiss, S. Bengio, and A. van den Oord, “Un- supervised speech representation learning using WaveNet autoen- coders,” CoRR arXiv preprint arXiv:1901.08810, 2019

work page arXiv 1901
[38]

Learning Latent Representations for Speech Generation and Transformation

W.-N. Hsu, Y . Zhang, and J. Glass, “Learning latent repr esen- tations for speech generation and transformation,” CoRR arXiv preprint arXiv:1704.04222, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[39]

Neural discrete represe ntation learning,

A. van den Oord and O. Vinyals, “Neural discrete represe ntation learning,” in Adv. NIPS, Long Beach, USA, Dec. 2017, pp. 6306– 6315

work page 2017
[40]

Unsupervised learning of dis- entangled and interpretable representations from sequent ial data,

W.-N. Hsu, Y . Zhang, and J. Glass, “Unsupervised learning of dis- entangled and interpretable representations from sequent ial data,” in Adv. NIPS, Long Beach, USA, Dec. 2017, pp. 1878–1889

work page 2017
[41]

Non- parallel voice conversion using i-vector PLDA: Towards unifying spe aker veriﬁcation and transformation,

T. Kinnunen, L. Juvela, P . Alku, and J. Y amagishi, “Non- parallel voice conversion using i-vector PLDA: Towards unifying spe aker veriﬁcation and transformation,” in Proc. ICASSP, New Orleans, USA, Mar. 2017, pp. 5535–5539

work page 2017
[42]

WaveNet: A Generative Model for Raw Audio

A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Viny als, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu , “WaveNet: A generative model for raw audio,” CoRR arXiv preprint arXiv:1609.03499, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[43]

V oice conversion with cyclic recurrent neural network and ﬁne- tuned WaveNet vocoder,

P . L. Tobing, Y .-C. Wu, T. Hayashi, K. Kobayashi, and T. T oda, “V oice conversion with cyclic recurrent neural network and ﬁne- tuned WaveNet vocoder,” in Proc. ICASSP, Brighton, UK, May 2019, pp. 6815–6819

work page 2019

[1] [1]

Introduction Using a voice conversion (VC) system, voice characteristic s of a source speaker can be transformed into that of a desired tar get speaker, while keeping the linguistic contents intact. Such trans- formation can be achieved, for example, by performing stati sti- cal conversion of spectral envelope parameters of the vocal tract spectrum, and a...

work page

[2] [2]

Conventional V AE-based VC The ﬂow of conventional V AE-based VC is illustrated by the upper part of Fig. 1. Let X t = [ e(x)⊤ t , s(x)⊤ t ]⊤ , e(x) t = [ e(x) t (1), . . . , e(x) t (De)]⊤ , and s(x) t = [s(x) t (1), . . . , s(x) t (Ds)]⊤ be the De + Ds, De, and Ds- dimensional feature vectors of the input, the excitation, a nd the spectra, respectively, ...

work page

[3] [3]

Proposed CycleV AE-based VC In this paper, to improve the V AE-based VC, as illustrated in Fig. 1, we propose CycleV AE, which is capable of recycling the converted spectra back into the system, so that the conversi on ﬂow is indirectly considered in the parameter optimization . A similar idea has also been proposed as a cycle-consistent ﬂo w in a self-su...

work page

[4] [4]

Experimental conditions We used a subset of the V oice Conversion Challenge (VCC) 2018 [25] dataset, which included four speakers, i.e., SF1, SM1, TF1, and TM1

Experimental Evaluation 4.1. Experimental conditions We used a subset of the V oice Conversion Challenge (VCC) 2018 [25] dataset, which included four speakers, i.e., SF1, SM1, TF1, and TM1. The speaker notations are as follows: S de- notes source speaker, T denotes target speaker, F denotes fe - male speaker, and M denotes male speaker. The total number o...

work page 2018

[5] [5]

Conclusions We have presented a novel framework to improve conventional V AE, for a non-parallel VC, by using a cycle-consistent ﬂow, i.e., the proposed CycleV AE. Speciﬁcally, the converted sp ec- tra, which is not directly optimized, is recycled back into t he system, to generate cyclic reconstructed spectra that can b e di- rectly optimized. The cyclic...

work page

[6] [6]

Acknowledgements This work was partly supported by JST, PRESTO Grant Number JPMJPR1657, and JSPS KAKENHI Grant Number JP17H06101

work page

[7] [7]

Spectral voice conversion for te xt- to-speech synthesis,

A. Kain and M. W. Macon, “Spectral voice conversion for te xt- to-speech synthesis,” in Proc. ICASSP, Seatle, Washington, USA, May 1998, pp. 285–288

work page 1998

[8] [8]

Intra-gender st atistical singing voice conversion with direct waveform modiﬁcation using log-spectral differential,

K. Kobayashi, T. Toda, and S. Nakamura, “Intra-gender st atistical singing voice conversion with direct waveform modiﬁcation using log-spectral differential,” Speech Commun., vol. 99, pp. 211–220, 2018

work page 2018

[9] [9]

Improving the intelligibility of dy sarthric speech,

A. B. Kain, J.-P . Hosom, X. Niu, J. P . van Santen, M. Fried- Oken, and J. Staehely, “Improving the intelligibility of dy sarthric speech,” Speech Commun., vol. 49, no. 9, pp. 743–759, 2007

work page 2007

[10] [10]

A hybrid approach to electrolaryngeal speech enhancement ba sed on spectral subtraction and statistical voice conversion,

K. Tanaka, T. Toda, G. Neubig, S. Sakti, and S. Nakamura, “ A hybrid approach to electrolaryngeal speech enhancement ba sed on spectral subtraction and statistical voice conversion, ” in Proc. INTERSPEECH, Lyon, France, Sep. 2013, pp. 3067–3071

work page 2013

[11] [11]

Data-driven emotion conversi on in spoken English,

Z. Inanoglu and S. Y oung, “Data-driven emotion conversi on in spoken English,” Speech Commun. , vol. 51, no. 3, pp. 268–283, 2009

work page 2009

[12] [12]

Evaluation of expressive speech syn- thesis with voice conversion and copy resynthesis techniqu es,

O. T ¨ urk and M. Schr¨ oder, “Evaluation of expressive speech syn- thesis with voice conversion and copy resynthesis techniqu es,” IEEE/ACM Trans. Audio Speech Lang. Process. , vol. 18, no. 5, pp. 965–973, 2010

work page 2010

[13] [13]

Multisens ory processing for speech enhancement and magnitude-normaliz ed spectra for speech modeling,

A. Subramanya, Z. Zhang, Z. Liu, and A. Acero, “Multisens ory processing for speech enhancement and magnitude-normaliz ed spectra for speech modeling,” Speech Commun. , vol. 50, no. 3, pp. 228–243, 2008

work page 2008

[14] [14]

Statistical voice con- version techniques for body-conducted unvoiced speech enh ance- ment,

T. Toda, M. Nakagiri, and K. Shikano, “Statistical voice con- version techniques for body-conducted unvoiced speech enh ance- ment,” IEEE Trans. Audio Speech Lang. Process. , vol. 20, no. 9, pp. 2505–2517, 2012

work page 2012

[15] [15]

Articulatory co ntrol- lable speech modiﬁcation based on statistical inversion an d pro- duction mappings,

P . L. Tobing, K. Kobayashi, and T. Toda, “Articulatory co ntrol- lable speech modiﬁcation based on statistical inversion an d pro- duction mappings,” IEEE/ACM Trans. Audio Speech Lang. Pro- cess., vol. 25, no. 12, pp. 2337–2350, 2017

work page 2017

[16] [16]

Continuous probabilis- tic transform for voice conversion,

Y . Stylianou, O. Capp´ e, and E. Moulines, “Continuous probabilis- tic transform for voice conversion,” IEEE Trans. Speech Audio Process., vol. 6, no. 2, pp. 131–142, 1998

work page 1998

[17] [17]

V oice conversion ba sed on maximum-likelihood estimation of spectral parameter tr ajec- tory,

T. Toda, A. W. Black, and K. Tokuda, “V oice conversion ba sed on maximum-likelihood estimation of spectral parameter tr ajec- tory,” IEEE Trans. Audio Speech Lang. Process. , vol. 15, no. 8, pp. 2222–2235, 2007

work page 2007

[18] [18]

INCA algorithm fo r training voice conversion systems from nonparallel corpor a,

D. Erro, A. Moreno, and A. Bonafonte, “INCA algorithm fo r training voice conversion systems from nonparallel corpor a,” IEEE Trans. Speech Audio Process. , vol. 18, no. 5, pp. 944–953, 2010

work page 2010

[19] [19]

Non-parallel voi ce con- version using joint optimization of alignment by temporal c ontext and spectral distortion,

H. Benisty, D. Malah, and K. Crammer, “Non-parallel voi ce con- version using joint optimization of alignment by temporal c ontext and spectral distortion,” in Proc. ICASSP , Florence, Italy, May 2014, pp. 7909–7913

work page 2014

[20] [20]

Text-independen t voice conversion based on state mapped codebook,

M. Zhang, J. Tao, J. Tian, and X. Wang, “Text-independen t voice conversion based on state mapped codebook,” in Proc. ICASSP, Las V egas, USA, Mar. 2008, pp. 4605–4608

work page 2008

[21] [21]

Non-parallel training f or voice conversion based on adaptation method,

P . Song, W. Zheng, and L. Zhao, “Non-parallel training f or voice conversion based on adaptation method,” in Proc. ICASSP, V an- couver, Canada, May 2013, pp. 6905–6909

work page 2013

[22] [22]

Non-parallel training in voice conversion using an adaptive restricted boltzmann ma- chine,

T. Nakashika, T. Takiguchi, and Y . Minami, “Non-parallel training in voice conversion using an adaptive restricted boltzmann ma- chine,” IEEE/ACM Trans. Audio Speech Lang. Process. , vol. 24, no. 11, pp. 2032–2045, 2016

work page 2032

[23] [23]

High- quality nonparallel voice conversion based on cycle-consistent ad- versarial network,

F. Fang, J. Y amagishi, I. Echizen, , and J. Lorenzo-True ba, “High- quality nonparallel voice conversion based on cycle-consistent ad- versarial network,” in Proc. ICASSP, Calgary, Canada, Apr. 2018, pp. 5279–5283

work page 2018

[24] [24]

StarGAN- VC: Non-parallel many-to-many voice conversion with star gene rative adversarial networks,

H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “StarGAN- VC: Non-parallel many-to-many voice conversion with star gene rative adversarial networks,” in Proc. SLT, Athens, Greece, Dec. 2018, pp. 266–273

work page 2018

[25] [25]

V oice conversion from non-parallel corpora using variati onal auto-encoder,

C.-C. Hsu, H.-T. Hwang, Y .-C. Wu, Y . Tsao, and H.-M. Wang , “V oice conversion from non-parallel corpora using variati onal auto-encoder,” in Proc. APSIPA, Jeju, South Korea, Dec. 2016, pp. 1–6

work page 2016

[26] [26]

V oice conversion from unaligned corpora using variationa l au- toencoding Wasserstein generative adversarial networks,

C.-C. Hsu, H.-T. Hwang, Y .-C. Wu, Y . Tsao, , and H.-M. Wan g, “V oice conversion from unaligned corpora using variationa l au- toencoding Wasserstein generative adversarial networks,” in Proc. INTERSPEECH, Stockholm, Sweden, Aug. 2017, pp. 3364–3368

work page 2017

[27] [27]

Non-p arallel voice conversion using variational autoencoders conditio ned by phonetic posteriorgrams and d-vectors,

Y . Saito, Y . Ijima, K. Nishida, and S. Takamichi, “Non-p arallel voice conversion using variational autoencoders conditio ned by phonetic posteriorgrams and d-vectors,” in Proc. ICASSP , Cal- gary, Canada, Apr. 2018, pp. 5274–5278

work page 2018

[28] [28]

ACV AE- VC: Non-parallel many-to-many voice conversion with auxil - iary classiﬁer variational autoencoder,

H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “ACV AE- VC: Non-parallel many-to-many voice conversion with auxil - iary classiﬁer variational autoencoder,” CoRR arXiv preprint arXiv:1808.05092, 2018

work page arXiv 2018

[29] [29]

Auto-Encoding Variational Bayes

D. P . Kingma and J. Ba, “Auto-encoding variational bayes,” CoRR arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[30] [30]

Learning Correspondence from the Cycle-Consistency of Time

X. Wang, A. Jabri, and A. A. Efros, “Learning correspon- dence from the cycle-consistency of time,” CoRR arXiv preprint arXiv:1903.07593, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1903

[31] [31]

The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods

J. Lorenzo-Trueba, J. Y amagishi, T. Toda, D. Saito, F. V illavicen- cio, T. Kinnunen, and Z. Ling, “The V oice Conversion Challenge 2018: Promoting development of parallel and nonparallel me th- ods,” Corr arXiv preprint arXiv:1804.04262, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[32] [32]

WORLD: A vocoder- based high-quality speech synthesis system for real-time a ppli- cations,

M. Morise, F. Y okomori, and K. Ozawa, “WORLD: A vocoder- based high-quality speech synthesis system for real-time a ppli- cations,” IEICE Trans. Inf. Syst. , vol. 99, no. 7, pp. 1877–1884, 2016

work page 2016

[33] [33]

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

K. Cho, B. van Merrienboer, C ¸ . G¨ ulc ¸ehre, F. Bougares , H. Schwenk, and Y . Bengio, “Learning phrase representation s using RNN encoder-decoder for statistical machine transla tion,” CoRR arXiv preprint arXiv:1406.1078, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[34] [34]

Dropout: a simple way to prevent neural n et- works from overﬁtting,

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural n et- works from overﬁtting,” J. Mach. Learning Res. , vol. 15, no. 1, pp. 1929–1958, 2014

work page 1929

[35] [35]

Understanding the difﬁculty o f train- ing deep feedforward neural networks,

X. Glorot and Y . Bengio, “Understanding the difﬁculty o f train- ing deep feedforward neural networks,” in Proc. AISTATS, vol. 9, Sardinia, Italy, May 2010, pp. 249–256

work page 2010

[36] [36]

Adam: A Method for Stochastic Optimization

D. P . Kingma and J. Ba, “Adam: A method for stochastic opt i- mization,” CoRR arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[37] [37]

Un- supervised speech representation learning using WaveNet autoen- coders,

J. Chorowski, R. J. Weiss, S. Bengio, and A. van den Oord, “Un- supervised speech representation learning using WaveNet autoen- coders,” CoRR arXiv preprint arXiv:1901.08810, 2019

work page arXiv 1901

[38] [38]

Learning Latent Representations for Speech Generation and Transformation

W.-N. Hsu, Y . Zhang, and J. Glass, “Learning latent repr esen- tations for speech generation and transformation,” CoRR arXiv preprint arXiv:1704.04222, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[39] [39]

Neural discrete represe ntation learning,

A. van den Oord and O. Vinyals, “Neural discrete represe ntation learning,” in Adv. NIPS, Long Beach, USA, Dec. 2017, pp. 6306– 6315

work page 2017

[40] [40]

Unsupervised learning of dis- entangled and interpretable representations from sequent ial data,

W.-N. Hsu, Y . Zhang, and J. Glass, “Unsupervised learning of dis- entangled and interpretable representations from sequent ial data,” in Adv. NIPS, Long Beach, USA, Dec. 2017, pp. 1878–1889

work page 2017

[41] [41]

Non- parallel voice conversion using i-vector PLDA: Towards unifying spe aker veriﬁcation and transformation,

T. Kinnunen, L. Juvela, P . Alku, and J. Y amagishi, “Non- parallel voice conversion using i-vector PLDA: Towards unifying spe aker veriﬁcation and transformation,” in Proc. ICASSP, New Orleans, USA, Mar. 2017, pp. 5535–5539

work page 2017

[42] [42]

WaveNet: A Generative Model for Raw Audio

A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Viny als, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu , “WaveNet: A generative model for raw audio,” CoRR arXiv preprint arXiv:1609.03499, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[43] [43]

V oice conversion with cyclic recurrent neural network and ﬁne- tuned WaveNet vocoder,

P . L. Tobing, Y .-C. Wu, T. Hayashi, K. Kobayashi, and T. T oda, “V oice conversion with cyclic recurrent neural network and ﬁne- tuned WaveNet vocoder,” in Proc. ICASSP, Brighton, UK, May 2019, pp. 6815–6819

work page 2019