Non-Parallel Voice Conversion with Cyclic Variational Autoencoder
Pith reviewed 2026-05-24 17:03 UTC · model grok-4.3
The pith
CycleVAE recycles converted spectra to create direct optimization targets for non-parallel voice conversion.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the CycleVAE spectral model, latent features encoded from source spectra are decoded with target speaker codes to produce converted spectra; these converted spectra are then re-encoded and decoded with source speaker codes to produce cyclic reconstructed spectra that are directly compared to the original input spectra, thereby supplying an optimization signal for the otherwise unobservable conversion mapping.
What carries the argument
The cyclic reconstruction flow that recycles converted spectra back through the model to obtain optimizable cyclic reconstructed spectra.
If this is right
- Converted spectra achieve higher accuracy than standard VAE outputs.
- Latent features exhibit a higher degree of correlation between source and converted representations.
- Perceptual quality and speaker-conversion accuracy of the output speech increase significantly.
- The same cyclic mechanism can be iterated for multiple cycles using the reconstructed features as new inputs.
Where Pith is reading between the lines
- The same recycling idea could supply indirect supervision for other unpaired translation tasks such as unpaired image-to-image style transfer.
- Multi-speaker or multi-domain extensions might be trained by chaining cycles across more than two speakers without requiring parallel recordings.
- If the cyclic loss is the dominant training signal, the method may reduce reliance on speaker-identity labels during inference.
Load-bearing premise
Optimizing the cyclic reconstructed spectra will improve the underlying conversion mapping even though no parallel source-target pairs are available.
What would settle it
An ablation on a standard non-parallel VC test set in which removing the cyclic reconstruction loss produces no measurable drop in mel-cepstral distortion or subjective conversion scores.
Figures
read the original abstract
In this paper, we present a novel technique for a non-parallel voice conversion (VC) with the use of cyclic variational autoencoder (CycleVAE)-based spectral modeling. In a variational autoencoder(VAE) framework, a latent space, usually with a Gaussian prior, is used to encode a set of input features. In a VAE-based VC, the encoded latent features are fed into a decoder, along with speaker-coding features, to generate estimated spectra with either the original speaker identity (reconstructed) or another speaker identity (converted). Due to the non-parallel modeling condition, the converted spectra can not be directly optimized, which heavily degrades the performance of a VAE-based VC. In this work, to overcome this problem, we propose to use CycleVAE-based spectral model that indirectly optimizes the conversion flow by recycling the converted features back into the system to obtain corresponding cyclic reconstructed spectra that can be directly optimized. The cyclic flow can be continued by using the cyclic reconstructed features as input for the next cycle. The experimental results demonstrate the effectiveness of the proposed CycleVAE-based VC, which yields higher accuracy of converted spectra, generates latent features with higher correlation degree, and significantly improves the quality and conversion accuracy of the converted speech.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes CycleVAE, an extension of variational autoencoders for non-parallel voice conversion. Standard VAE encoding/decoding with speaker codes is augmented with cyclic reconstruction losses that recycle converted spectra back through the model, allowing direct optimization of the otherwise unobservable conversion path. The abstract claims this yields higher converted-spectral accuracy, higher latent-feature correlation, and improved perceptual quality/conversion accuracy.
Significance. If the experimental claims hold with appropriate controls, the cyclic-reconstruction device would constitute a practical solution to the non-parallel optimization problem that has limited prior VAE-VC work. The approach is a direct analogue of cycle-consistency losses used successfully in other unpaired translation tasks and could therefore be of interest to the speech-synthesis community.
major comments (1)
- Abstract: the central claim that CycleVAE 'yields higher accuracy of converted spectra, generates latent features with higher correlation degree, and significantly improves the quality and conversion accuracy' is unsupported by any numerical results, baselines, dataset description, or statistical tests. Because the manuscript supplies no evidence for the claimed gains, the effectiveness assertion that constitutes the paper's main contribution cannot be evaluated.
Simulated Author's Rebuttal
We thank the referee for the detailed review and the opportunity to clarify the manuscript. We address the single major comment below.
read point-by-point responses
-
Referee: Abstract: the central claim that CycleVAE 'yields higher accuracy of converted spectra, generates latent features with higher correlation degree, and significantly improves the quality and conversion accuracy' is unsupported by any numerical results, baselines, dataset description, or statistical tests. Because the manuscript supplies no evidence for the claimed gains, the effectiveness assertion that constitutes the paper's main contribution cannot be evaluated.
Authors: The full manuscript contains a complete experimental section (Section 4) that supplies the supporting evidence for the abstract claims. This includes numerical comparisons of converted spectral accuracy (via mel-cepstral distortion), latent-feature correlations, baseline VAE-VC systems, dataset details (VCC 2018), and subjective quality/conversion-accuracy results with statistical testing. The abstract follows the conventional role of summarizing those findings at a high level rather than repeating the numbers. We can revise the abstract to incorporate selected numerical highlights if the referee prefers greater specificity there. revision: partial
Circularity Check
No significant circularity detected
full rationale
The paper describes a CycleVAE architecture that augments a standard VAE ELBO with cycle-consistency reconstruction losses to enable gradient flow through the non-parallel conversion mapping. This is a modeling choice grounded in existing VAE and cycle-consistency techniques rather than any derivation that reduces a claimed prediction or uniqueness result to a fitted parameter or self-citation by construction. No equations are presented that equate an output quantity to its own input via redefinition, and the central performance claims rest on external objective and subjective metrics rather than internal self-consistency alone. The logical structure is therefore self-contained.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Introduction Using a voice conversion (VC) system, voice characteristic s of a source speaker can be transformed into that of a desired tar get speaker, while keeping the linguistic contents intact. Such trans- formation can be achieved, for example, by performing stati sti- cal conversion of spectral envelope parameters of the vocal tract spectrum, and a...
-
[2]
Conventional V AE-based VC The flow of conventional V AE-based VC is illustrated by the upper part of Fig. 1. Let X t = [ e(x)⊤ t , s(x)⊤ t ]⊤ , e(x) t = [ e(x) t (1), . . . , e(x) t (De)]⊤ , and s(x) t = [s(x) t (1), . . . , s(x) t (Ds)]⊤ be the De + Ds, De, and Ds- dimensional feature vectors of the input, the excitation, a nd the spectra, respectively, ...
-
[3]
Proposed CycleV AE-based VC In this paper, to improve the V AE-based VC, as illustrated in Fig. 1, we propose CycleV AE, which is capable of recycling the converted spectra back into the system, so that the conversi on flow is indirectly considered in the parameter optimization . A similar idea has also been proposed as a cycle-consistent flo w in a self-su...
-
[4]
Experimental Evaluation 4.1. Experimental conditions We used a subset of the V oice Conversion Challenge (VCC) 2018 [25] dataset, which included four speakers, i.e., SF1, SM1, TF1, and TM1. The speaker notations are as follows: S de- notes source speaker, T denotes target speaker, F denotes fe - male speaker, and M denotes male speaker. The total number o...
work page 2018
-
[5]
Conclusions We have presented a novel framework to improve conventional V AE, for a non-parallel VC, by using a cycle-consistent flow, i.e., the proposed CycleV AE. Specifically, the converted sp ec- tra, which is not directly optimized, is recycled back into t he system, to generate cyclic reconstructed spectra that can b e di- rectly optimized. The cyclic...
-
[6]
Acknowledgements This work was partly supported by JST, PRESTO Grant Number JPMJPR1657, and JSPS KAKENHI Grant Number JP17H06101
-
[7]
Spectral voice conversion for te xt- to-speech synthesis,
A. Kain and M. W. Macon, “Spectral voice conversion for te xt- to-speech synthesis,” in Proc. ICASSP, Seatle, Washington, USA, May 1998, pp. 285–288
work page 1998
-
[8]
K. Kobayashi, T. Toda, and S. Nakamura, “Intra-gender st atistical singing voice conversion with direct waveform modification using log-spectral differential,” Speech Commun., vol. 99, pp. 211–220, 2018
work page 2018
-
[9]
Improving the intelligibility of dy sarthric speech,
A. B. Kain, J.-P . Hosom, X. Niu, J. P . van Santen, M. Fried- Oken, and J. Staehely, “Improving the intelligibility of dy sarthric speech,” Speech Commun., vol. 49, no. 9, pp. 743–759, 2007
work page 2007
-
[10]
K. Tanaka, T. Toda, G. Neubig, S. Sakti, and S. Nakamura, “ A hybrid approach to electrolaryngeal speech enhancement ba sed on spectral subtraction and statistical voice conversion, ” in Proc. INTERSPEECH, Lyon, France, Sep. 2013, pp. 3067–3071
work page 2013
-
[11]
Data-driven emotion conversi on in spoken English,
Z. Inanoglu and S. Y oung, “Data-driven emotion conversi on in spoken English,” Speech Commun. , vol. 51, no. 3, pp. 268–283, 2009
work page 2009
-
[12]
Evaluation of expressive speech syn- thesis with voice conversion and copy resynthesis techniqu es,
O. T ¨ urk and M. Schr¨ oder, “Evaluation of expressive speech syn- thesis with voice conversion and copy resynthesis techniqu es,” IEEE/ACM Trans. Audio Speech Lang. Process. , vol. 18, no. 5, pp. 965–973, 2010
work page 2010
-
[13]
A. Subramanya, Z. Zhang, Z. Liu, and A. Acero, “Multisens ory processing for speech enhancement and magnitude-normaliz ed spectra for speech modeling,” Speech Commun. , vol. 50, no. 3, pp. 228–243, 2008
work page 2008
-
[14]
Statistical voice con- version techniques for body-conducted unvoiced speech enh ance- ment,
T. Toda, M. Nakagiri, and K. Shikano, “Statistical voice con- version techniques for body-conducted unvoiced speech enh ance- ment,” IEEE Trans. Audio Speech Lang. Process. , vol. 20, no. 9, pp. 2505–2517, 2012
work page 2012
-
[15]
P . L. Tobing, K. Kobayashi, and T. Toda, “Articulatory co ntrol- lable speech modification based on statistical inversion an d pro- duction mappings,” IEEE/ACM Trans. Audio Speech Lang. Pro- cess., vol. 25, no. 12, pp. 2337–2350, 2017
work page 2017
-
[16]
Continuous probabilis- tic transform for voice conversion,
Y . Stylianou, O. Capp´ e, and E. Moulines, “Continuous probabilis- tic transform for voice conversion,” IEEE Trans. Speech Audio Process., vol. 6, no. 2, pp. 131–142, 1998
work page 1998
-
[17]
V oice conversion ba sed on maximum-likelihood estimation of spectral parameter tr ajec- tory,
T. Toda, A. W. Black, and K. Tokuda, “V oice conversion ba sed on maximum-likelihood estimation of spectral parameter tr ajec- tory,” IEEE Trans. Audio Speech Lang. Process. , vol. 15, no. 8, pp. 2222–2235, 2007
work page 2007
-
[18]
INCA algorithm fo r training voice conversion systems from nonparallel corpor a,
D. Erro, A. Moreno, and A. Bonafonte, “INCA algorithm fo r training voice conversion systems from nonparallel corpor a,” IEEE Trans. Speech Audio Process. , vol. 18, no. 5, pp. 944–953, 2010
work page 2010
-
[19]
H. Benisty, D. Malah, and K. Crammer, “Non-parallel voi ce con- version using joint optimization of alignment by temporal c ontext and spectral distortion,” in Proc. ICASSP , Florence, Italy, May 2014, pp. 7909–7913
work page 2014
-
[20]
Text-independen t voice conversion based on state mapped codebook,
M. Zhang, J. Tao, J. Tian, and X. Wang, “Text-independen t voice conversion based on state mapped codebook,” in Proc. ICASSP, Las V egas, USA, Mar. 2008, pp. 4605–4608
work page 2008
-
[21]
Non-parallel training f or voice conversion based on adaptation method,
P . Song, W. Zheng, and L. Zhao, “Non-parallel training f or voice conversion based on adaptation method,” in Proc. ICASSP, V an- couver, Canada, May 2013, pp. 6905–6909
work page 2013
-
[22]
Non-parallel training in voice conversion using an adaptive restricted boltzmann ma- chine,
T. Nakashika, T. Takiguchi, and Y . Minami, “Non-parallel training in voice conversion using an adaptive restricted boltzmann ma- chine,” IEEE/ACM Trans. Audio Speech Lang. Process. , vol. 24, no. 11, pp. 2032–2045, 2016
work page 2032
-
[23]
High- quality nonparallel voice conversion based on cycle-consistent ad- versarial network,
F. Fang, J. Y amagishi, I. Echizen, , and J. Lorenzo-True ba, “High- quality nonparallel voice conversion based on cycle-consistent ad- versarial network,” in Proc. ICASSP, Calgary, Canada, Apr. 2018, pp. 5279–5283
work page 2018
-
[24]
StarGAN- VC: Non-parallel many-to-many voice conversion with star gene rative adversarial networks,
H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “StarGAN- VC: Non-parallel many-to-many voice conversion with star gene rative adversarial networks,” in Proc. SLT, Athens, Greece, Dec. 2018, pp. 266–273
work page 2018
-
[25]
V oice conversion from non-parallel corpora using variati onal auto-encoder,
C.-C. Hsu, H.-T. Hwang, Y .-C. Wu, Y . Tsao, and H.-M. Wang , “V oice conversion from non-parallel corpora using variati onal auto-encoder,” in Proc. APSIPA, Jeju, South Korea, Dec. 2016, pp. 1–6
work page 2016
-
[26]
C.-C. Hsu, H.-T. Hwang, Y .-C. Wu, Y . Tsao, , and H.-M. Wan g, “V oice conversion from unaligned corpora using variationa l au- toencoding Wasserstein generative adversarial networks,” in Proc. INTERSPEECH, Stockholm, Sweden, Aug. 2017, pp. 3364–3368
work page 2017
-
[27]
Y . Saito, Y . Ijima, K. Nishida, and S. Takamichi, “Non-p arallel voice conversion using variational autoencoders conditio ned by phonetic posteriorgrams and d-vectors,” in Proc. ICASSP , Cal- gary, Canada, Apr. 2018, pp. 5274–5278
work page 2018
-
[28]
H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “ACV AE- VC: Non-parallel many-to-many voice conversion with auxil - iary classifier variational autoencoder,” CoRR arXiv preprint arXiv:1808.05092, 2018
-
[29]
Auto-Encoding Variational Bayes
D. P . Kingma and J. Ba, “Auto-encoding variational bayes,” CoRR arXiv preprint arXiv:1312.6114, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[30]
Learning Correspondence from the Cycle-Consistency of Time
X. Wang, A. Jabri, and A. A. Efros, “Learning correspon- dence from the cycle-consistency of time,” CoRR arXiv preprint arXiv:1903.07593, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1903
-
[31]
The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods
J. Lorenzo-Trueba, J. Y amagishi, T. Toda, D. Saito, F. V illavicen- cio, T. Kinnunen, and Z. Ling, “The V oice Conversion Challenge 2018: Promoting development of parallel and nonparallel me th- ods,” Corr arXiv preprint arXiv:1804.04262, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[32]
WORLD: A vocoder- based high-quality speech synthesis system for real-time a ppli- cations,
M. Morise, F. Y okomori, and K. Ozawa, “WORLD: A vocoder- based high-quality speech synthesis system for real-time a ppli- cations,” IEICE Trans. Inf. Syst. , vol. 99, no. 7, pp. 1877–1884, 2016
work page 2016
-
[33]
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation
K. Cho, B. van Merrienboer, C ¸ . G¨ ulc ¸ehre, F. Bougares , H. Schwenk, and Y . Bengio, “Learning phrase representation s using RNN encoder-decoder for statistical machine transla tion,” CoRR arXiv preprint arXiv:1406.1078, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[34]
Dropout: a simple way to prevent neural n et- works from overfitting,
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural n et- works from overfitting,” J. Mach. Learning Res. , vol. 15, no. 1, pp. 1929–1958, 2014
work page 1929
-
[35]
Understanding the difficulty o f train- ing deep feedforward neural networks,
X. Glorot and Y . Bengio, “Understanding the difficulty o f train- ing deep feedforward neural networks,” in Proc. AISTATS, vol. 9, Sardinia, Italy, May 2010, pp. 249–256
work page 2010
-
[36]
Adam: A Method for Stochastic Optimization
D. P . Kingma and J. Ba, “Adam: A method for stochastic opt i- mization,” CoRR arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[37]
Un- supervised speech representation learning using WaveNet autoen- coders,
J. Chorowski, R. J. Weiss, S. Bengio, and A. van den Oord, “Un- supervised speech representation learning using WaveNet autoen- coders,” CoRR arXiv preprint arXiv:1901.08810, 2019
-
[38]
Learning Latent Representations for Speech Generation and Transformation
W.-N. Hsu, Y . Zhang, and J. Glass, “Learning latent repr esen- tations for speech generation and transformation,” CoRR arXiv preprint arXiv:1704.04222, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[39]
Neural discrete represe ntation learning,
A. van den Oord and O. Vinyals, “Neural discrete represe ntation learning,” in Adv. NIPS, Long Beach, USA, Dec. 2017, pp. 6306– 6315
work page 2017
-
[40]
Unsupervised learning of dis- entangled and interpretable representations from sequent ial data,
W.-N. Hsu, Y . Zhang, and J. Glass, “Unsupervised learning of dis- entangled and interpretable representations from sequent ial data,” in Adv. NIPS, Long Beach, USA, Dec. 2017, pp. 1878–1889
work page 2017
-
[41]
T. Kinnunen, L. Juvela, P . Alku, and J. Y amagishi, “Non- parallel voice conversion using i-vector PLDA: Towards unifying spe aker verification and transformation,” in Proc. ICASSP, New Orleans, USA, Mar. 2017, pp. 5535–5539
work page 2017
-
[42]
WaveNet: A Generative Model for Raw Audio
A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Viny als, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu , “WaveNet: A generative model for raw audio,” CoRR arXiv preprint arXiv:1609.03499, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[43]
V oice conversion with cyclic recurrent neural network and fine- tuned WaveNet vocoder,
P . L. Tobing, Y .-C. Wu, T. Hayashi, K. Kobayashi, and T. T oda, “V oice conversion with cyclic recurrent neural network and fine- tuned WaveNet vocoder,” in Proc. ICASSP, Brighton, UK, May 2019, pp. 6815–6819
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.