pith. sign in

arxiv: 1907.11898 · v1 · pith:LJMPJLMKnew · submitted 2019-07-27 · 📡 eess.AS · eess.SP

Generalization of Spectrum Differential based Direct Waveform Modification for Voice Conversion

Pith reviewed 2026-05-24 14:49 UTC · model grok-4.3

classification 📡 eess.AS eess.SP
keywords voice conversionDIFFVCspectral conversiondirect waveform modificationF0 transformationresidual domainVAEvocoder-free synthesis
0
0 comments X

The pith

Inverse and synthesis filtering on residuals lets any spectral conversion model generate waveforms directly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper modifies the DIFFVC method for voice conversion so it works with any spectral conversion model. The change performs inverse filtering on the input speech, transforms F0 in the residual domain, and applies synthesis filtering using the already-converted spectral features. This removes the requirement that the conversion model must predict spectral differentials or be retrained for each pair. When tested on a non-parallel VAE-based spectral model, the approach produces higher quality output than a vocoder-based baseline while keeping rich spectral details.

Core claim

By performing inverse filtering on the input signal followed by synthesis filtering on the F0 transformed residual signal using the converted spectral features directly, the spectral conversion model does not need to be retrained or capable of predicting the spectral differential.

What carries the argument

F0 transformation in the residual domain through inverse filtering followed by synthesis filtering with converted spectral features.

If this is right

  • Any spectral conversion model can serve as the waveform generation module without extra training steps.
  • Non-parallel models such as VAE-based converters become compatible with direct waveform modification.
  • The need for parallel training data and pair-specific retraining disappears.
  • Waveform output avoids vocoder processing while retaining spectral detail.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The filtering steps could be applied to other residual-based signal processing tasks outside voice conversion.
  • Optimizing the inverse and synthesis filters might enable lower-latency conversion pipelines.
  • The approach may reduce training data requirements for new voice pairs in deployment.

Load-bearing premise

Synthesis filtering applied to the F0-transformed residual using converted spectral features will preserve waveform quality and avoid artifacts that differential prediction would have handled.

What would settle it

A side-by-side listening test or objective quality measure on the same spectral conversion model showing that the proposed method produces more artifacts or lower naturalness than the original DIFFVC that requires explicit differential estimation.

Figures

Figures reproduced from arXiv: 1907.11898 by Hsin-Min Wang, Hsin-Te Hwang, Kazuhiro Kobayashi, Patrick Lumban Tobing, Tomoki Toda, Wen-Chin Huang, Yi-Chiao Wu, Yu-Huai Peng, Yu Tsao.

Figure 1
Figure 1. Figure 1: The spectrum differential based direct waveform modification for vocoder free voice conversion, where the spectrum differential is estimated using a Gaussian mixture model (GMM). Such a GMM model is termed DIFFGMM. State-of-the-art VC systems have shown remarkable results by combining such neural waveform generation process with con￾version models such as GMMs [20], or other methods based on recent DNN met… view at source ↗
Figure 2
Figure 2. Figure 2: The proposed direct waveform modification framework, where the F0 transformation is realized in the residual domain. sig, res, mcp, and env represent the waveform signal, residual signal, mel-cepsturm, and envelope signal, respectively. the converted spectral features given source spectral features ex￾tracted from a normal source speech, regardless of whether a parallel training dataset is available. In th… view at source ↗
Figure 3
Figure 3. Figure 3: An illustration of the calculated envelopes envW (e world), envGV (e gv) and envDIF F (e diff), which are from the converted sample SF1-TF1-30006. The red dashed line denotes the threshold, which is set to be 10000 here. 3.3. Collapsed waveform detection and feature substitution In our initial experiments, we often observed collapsed wave￾form segments in sig(y) GV . This is a combined result of the fre￾qu… view at source ↗
Figure 4
Figure 4. Figure 4: The top, bottom rows are the spectrograms of the source, converted speeches using our method, respectively. Top left: SF1-30001. Top right: SM1-30001. Bottom left: SF1-TF1-30001. Bottom right: SM1-TF1-30001. choose the one with more natural voice among two converted utterances generated by the two methods for the same sentence (content) in random order. In the conversion similarity test, a natural speech s… view at source ↗
read the original abstract

We present a modification to the spectrum differential based direct waveform modification for voice conversion (DIFFVC) so that it can be directly applied as a waveform generation module to voice conversion models. The recently proposed DIFFVC avoids the use of a vocoder, meanwhile preserves rich spectral details hence capable of generating high quality converted voice. To apply the DIFFVC framework, a model that can estimate the spectral differential from the F0 transformed input speech needs to be trained beforehand. This requirement imposes several constraints, including a limitation on the estimation model to parallel training and the need of extra training on each conversion pair, which make DIFFVC inflexible. Based on the above motivations, we propose a new DIFFVC framework based on an F0 transformation in the residual domain. By performing inverse filtering on the input signal followed by synthesis filtering on the F0 transformed residual signal using the converted spectral features directly, the spectral conversion model does not need to be retrained or capable of predicting the spectral differential. We describe several details that need to be taken care of under this modification, and by applying our proposed method to a non-parallel, variational autoencoder (VAE)-based spectral conversion model, we demonstrate that this framework can be generalized to any spectral conversion model, and experimental evaluations show that it can outperform a baseline framework whose waveform generation process is carried out by a vocoder.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a modification to the spectrum differential based direct waveform modification (DIFFVC) for voice conversion. By performing inverse filtering on the input signal, F0 transformation on the residual domain, and synthesis filtering using converted spectral features directly, the approach eliminates the need for a separate model to estimate spectral differentials or for retraining the spectral conversion model on each pair. The authors apply this to a non-parallel VAE-based spectral conversion model and claim that it generalizes to arbitrary spectral models while outperforming a vocoder-based baseline in experimental evaluations.

Significance. If the quality preservation holds without the original differential correction, the framework would increase flexibility for high-quality waveform generation across diverse spectral conversion models, removing constraints like parallel training data requirements.

major comments (2)
  1. [Abstract] Abstract: the claim that experiments 'show that it can outperform a baseline framework whose waveform generation process is carried out by a vocoder' provides no metrics, controls, ablation details, or evaluation protocol, which is load-bearing for the generalization claim.
  2. [Abstract] Abstract (method description): the premise that synthesis filtering on the F0-transformed residual using directly converted (non-differential) spectral features will preserve waveform quality and avoid artifacts previously compensated by explicit differential estimation is unverified; no comparison to original DIFFVC or quantitative evidence of artifact-free output is referenced.
minor comments (1)
  1. [Abstract] Abstract: the reference to 'several details that need to be taken care of under this modification' is vague and should be expanded with explicit pointers to the relevant implementation steps or sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address the major comments point by point below, clarifying the manuscript's contributions and experimental support while noting where revisions could strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that experiments 'show that it can outperform a baseline framework whose waveform generation process is carried out by a vocoder' provides no metrics, controls, ablation details, or evaluation protocol, which is load-bearing for the generalization claim.

    Authors: The abstract is intended as a concise summary. The full manuscript details the experimental protocol, including application to a non-parallel VAE spectral conversion model, objective measures (e.g., MCD), subjective evaluations, and direct comparison against the vocoder baseline under matched conditions. We agree the abstract could be strengthened by briefly referencing key results and will revise it accordingly. revision: partial

  2. Referee: [Abstract] Abstract (method description): the premise that synthesis filtering on the F0-transformed residual using directly converted (non-differential) spectral features will preserve waveform quality and avoid artifacts previously compensated by explicit differential estimation is unverified; no comparison to original DIFFVC or quantitative evidence of artifact-free output is referenced.

    Authors: Section 3 explains the residual-domain approach: inverse filtering isolates the excitation, F0 is transformed there, and converted spectra are used directly for synthesis filtering. This design removes the differential estimation step and its associated compensation. Experiments with the VAE model show the resulting quality exceeds the vocoder baseline. A head-to-head comparison with original DIFFVC is not performed because the original requires parallel data and pair-specific differential models—the very constraints our generalization removes. The reported results provide indirect quantitative support via the baseline comparison and lack of reported artifacts. revision: no

Circularity Check

0 steps flagged

No significant circularity; pipeline rearrangement is self-contained

full rationale

The paper proposes a signal-processing modification to DIFFVC: inverse filtering of input, F0 transform on residual, then synthesis filtering with directly converted spectral features. This rearrangement is described via standard operations and does not reduce any claimed prediction or result to a fitted parameter, self-definition, or self-citation chain. The central claim (generalization to arbitrary spectral models without differential estimation) is presented as an engineering change whose validity is checked by experiments on a VAE model; no load-bearing step equates output to input by construction. Minor self-citations to prior DIFFVC work are not used to import uniqueness theorems or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on standard digital signal processing assumptions for speech; no new entities or fitted constants are introduced in the abstract description.

axioms (1)
  • standard math Inverse filtering and synthesis filtering operations are invertible and preserve sufficient information for high-quality waveform reconstruction when applied to speech residuals.
    Invoked when the paper states that synthesis filtering on the F0-transformed residual using converted spectral features replaces the original differential estimation.

pith-pipeline@v0.9.0 · 5811 in / 1157 out tokens · 23150 ms · 2026-05-24T14:49:29.773566+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 5 internal anchors

  1. [1]

    Generalization of Spectrum Differential based Direct Waveform Modification for Voice Conversion

    Introduction V oice conversion (VC) aims to convert the speech from a source to that of a target without changing the linguistic content. Nu- merous approaches have been proposed, such as Gaussian mix- ture model (GMM)-based methods [1, 2], deep neural net- work (DNN)-based methods [3, 4], and exemplar-based meth- ods [5, 6, 7]. While most VC researchers ...

  2. [2]

    Spectrum Differential based Direct Waveform Modification for Voice Conversion (DIFFVC) 2.1. DIFFVC based on DIFFGMM DIFFVC is a conversion framework (not restricted to VC but also other applications like singing VC) that does not employ a parametric vocoder as the waveform generation module [25, 27, 28, 30]. In this section we describe the DIFFVC framework...

  3. [3]

    Specifically, the residual sig- nal is shrunk then up-sampled if the F0 transformation ratio is smaller than 1 and, conversely, expanded then down-sampled if the ratio larger than 1

    and resampling process can be performed on the residual signal in order to transform F0. Specifically, the residual sig- nal is shrunk then up-sampled if the F0 transformation ratio is smaller than 1 and, conversely, expanded then down-sampled if the ratio larger than 1. Finally, the F0 transformed speech is restored by filtering the modified residual signal...

  4. [4]

    Proposed Method based on Residual Transformation Our goal is to extend the vocoder-free DIFFVC framework to any arbitrary VC model, which only knows how to convert nor- mal source features to target features. To impose as few con- straints as possible, we only demand the VC model to estimate Figure 2: The proposed direct waveform modification framework, wh...

  5. [5]

    Experimental Evaluation 4.1. Experimental settings We evaluated our proposed methods on the SPOKE task of V oice Conversion Challenge 2018 (VCC2018) [36], which in- cluded recordings of professional US English speakers with a sampling rate of 22050 Hz. The dataset consisted of 81/35 ut- terances per speaker for training/testing sets, respectively. We used...

  6. [6]

    Conclusions and Future Work In this paper, we introduced a generalization of the DIFFVC framework to make it applicable to general VC models. The proposed method is based on an F0 transformation in the resid- ual domain, so that synthesis filtering is performed directly us- ing the converted spectral features, thus removing the need for the conversion mode...

  7. [7]

    Continuous probabilis- tic transform for voice conversion,

    Y . Stylianou, O. Cappe, and E. Moulines, “Continuous probabilis- tic transform for voice conversion,”IEEE Transactions on Speech and Audio Processing, vol. 6, no. 2, pp. 131–142, Mar 1998

  8. [8]

    V oice conversion based on maximum-likelihood estimation of spectral parameter trajectory,

    T. Toda, A. W. Black, and K. Tokuda, “V oice conversion based on maximum-likelihood estimation of spectral parameter trajectory,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 8, pp. 2222–2235, Nov 2007

  9. [9]

    Spectral mapping using artificial neural networks for voice con- version,

    S. Desai, A. W. Black, B. Yegnanarayana, and K. Prahallad, “Spectral mapping using artificial neural networks for voice con- version,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 5, pp. 954–964, July 2010

  10. [10]

    V oice conversion using deep neural networks with layer-wise generative training,

    L. H. Chen, Z. H. Ling, L. J. Liu, and L. R. Dai, “V oice conversion using deep neural networks with layer-wise generative training,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 22, no. 12, pp. 1859–1872, Dec 2014

  11. [11]

    Exemplar-based voice conversion in noisy environment,

    R. Takashima, T. Takiguchi, and Y . Ariki, “Exemplar-based voice conversion in noisy environment,” in Proc. SLT, 2012, pp. 313– 317

  12. [12]

    Exemplar-based sparse representation with residual compensation for voice con- version,

    Z. Wu, T. Virtanen, E. S. Chng, and H. Li, “Exemplar-based sparse representation with residual compensation for voice con- version,” IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, vol. 22, no. 10, pp. 1506–1521, Oct 2014

  13. [13]

    Locally linear embedding for exemplar-based spectral conver- sion,

    Y .-C. Wu, H.-T. Hwang, C.-C. Hsu, Y . Tsao, and H.-M. Wang, “Locally linear embedding for exemplar-based spectral conver- sion,” in Proc. Interspeech, 2016, pp. 1652–1656

  14. [14]

    Speech analysis and synthesis by linear prediction of the speech wave,

    B. S. Atal and S. L. Hanauer, “Speech analysis and synthesis by linear prediction of the speech wave,”The Journal of the Acousti- cal Society of America, vol. 50, no. 2B, pp. 637–655, 1971

  15. [15]

    Kawahara, I

    H. Kawahara, I. Masuda-Katsuse, and A. de Cheveign, “Re- structuring speech representations using a pitch-adaptive time- frequency smoothing and an instantaneous-frequency-based f0 extraction: Possible role of a repetitive structure in sounds,” Speech Communication, vol. 27, no. 3, pp. 187 – 207, 1999

  16. [16]

    WORLD: A V ocoder- Based High-Quality Speech Synthesis System for Real-Time Ap- plications,

    M. Morise, F. Yokomori, and K. Ozawa, “WORLD: A V ocoder- Based High-Quality Speech Synthesis System for Real-Time Ap- plications,” IEICE Transactions on Information and Systems , vol. 99, pp. 1877–1884, 2016

  17. [17]

    Speaker-dependent wavenet vocoder,

    A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda, and T. Toda, “Speaker-dependent wavenet vocoder,” in Proc. Interspeech , 2017, pp. 1118–1122

  18. [18]

    An investigation of multi-speaker training for wavenet vocoder,

    T. Hayashi, A. Tamamori, K. Kobayashi, K. Takeda, and T. Toda, “An investigation of multi-speaker training for wavenet vocoder,” in IEEE Automatic Speech Recognition and Understanding Work- shop (ASRU), Dec 2017, pp. 712–718

  19. [19]

    Efficient Neural Audio Synthesis

    N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. v. d. Oord, S. Dieleman, and K. Kavukcuoglu, “Efficient neural audio synthesis,” arXiv preprint arXiv:1802.08435, 2018

  20. [20]

    SampleRNN: An Unconditional End-to-End Neural Audio Generation Model

    S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y . Bengio, “Samplernn: An unconditional end-to-end neural audio generation model,” arXiv preprint arXiv:1612.07837, 2016

  21. [21]

    Fftnet: A real- time speaker-dependent neural vocoder,

    Z. Jin, A. Finkelstein, G. J. Mysore, and J. Lu, “Fftnet: A real- time speaker-dependent neural vocoder,” in 2018 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 2251–2255

  22. [22]

    Lpcnet: Improving neural speech synthesis through linear prediction,

    J.-M. Valin and J. Skoglund, “Lpcnet: Improving neural speech synthesis through linear prediction,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP). IEEE, 2019, pp. 5891–5895

  23. [23]

    Parallel WaveNet: Fast High-Fidelity Speech Synthesis

    A. v. d. Oord, Y . Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. v. d. Driessche, E. Lockhart, L. C. Cobo, F. Stimberg et al. , “Parallel wavenet: Fast high-fidelity speech synthesis,” arXiv preprint arXiv:1711.10433, 2017

  24. [24]

    Waveglow: A flow-based generative network for speech synthesis,

    R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flow-based generative network for speech synthesis,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 3617–3621

  25. [25]

    Neural source-filter-based waveform model for statistical parametric speech synthesis,

    X. Wang, S. Takaki, and J. Yamagishi, “Neural source-filter-based waveform model for statistical parametric speech synthesis,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 5916– 5920

  26. [26]

    Statistical voice conversion with wavenet-based waveform generation,

    K. Kobayashi, T. Hayashi, A. Tamamori, and T. Toda, “Statistical voice conversion with wavenet-based waveform generation,” in Proc. Interspeech, 2017, pp. 1138–1142

  27. [27]

    High-quality voice conver- sion using spectrogram-based wavenet vocoder,

    K. Chen, B. Chen, J. Lai, and K. Yu, “High-quality voice conver- sion using spectrogram-based wavenet vocoder,” in Proc. Inter- speech, 2018, pp. 1993–1997

  28. [28]

    AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss

    K. Qian, Y . Zhang, S. Chang, X. Yang, and M. Hasegawa- Johnson, “Zero-shot voice style transfer with only autoencoder loss,” arXiv preprint arXiv:1905.05879, 2019

  29. [29]

    Sequence-to-sequence acoustic modeling for voice conversion,

    J.-X. Zhang, Z.-H. Ling, L.-J. Liu, Y . Jiang, and L.-R. Dai, “Sequence-to-sequence acoustic modeling for voice conversion,” IEEE/ACM Transactions on Audio, Speech and Language Pro- cessing (TASLP), vol. 27, no. 3, pp. 631–644, 2019

  30. [30]

    Adaptive wavenet vocoder for residual compensation in gan-based voice conversion,

    B. Sisman, M. Zhang, S. Sakti, H. Li, and S. Nakamura, “Adaptive wavenet vocoder for residual compensation in gan-based voice conversion,” in 2018 IEEE Spoken Language Technology Work- shop (SLT). IEEE, 2018, pp. 282–289

  31. [31]

    The nu- naist voice conversion system for the voice conversion challenge 2016,

    K. Kobayashi, S. Takamichi, S. Nakamura, and T. Toda, “The nu- naist voice conversion system for the voice conversion challenge 2016,” in Interspeech, 2016, pp. 1667–1671

  32. [32]

    The voice conversion challenge 2016,

    T. Toda, L.-H. Chen, D. Saito, F. Villavicencio, M. Wester, Z. Wu, and J. Yamagishi, “The voice conversion challenge 2016,” in In- terspeech 2016, 2016, pp. 1632–1636

  33. [33]

    sprocket: Open-source voice conver- sion software,

    K. Kobayashi and T. Toda, “sprocket: Open-source voice conver- sion software,” inProc. Odyssey 2018 The Speaker and Language Recognition Workshop, 2018, pp. 203–210

  34. [34]

    Implementation of f0 transformation for statistical singing voice conversion based on direct waveform modification,

    K. Kobayashi, T. Toda, and S. Nakamura, “Implementation of f0 transformation for statistical singing voice conversion based on direct waveform modification,” 2016 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP) , pp. 5670–5674, 2016

  35. [35]

    V oice conversion from non-parallel corpora using variational auto-encoder,

    C.-C. Hsu, H.-T. Hwang, Y .-C. Wu, Y . Tsao, and H.-M. Wang, “V oice conversion from non-parallel corpora using variational auto-encoder,” in Proc. APISPA ASC, 2016, pp. 1–6

  36. [36]

    F0 transformation techniques for statistical voice conversion with direct waveform modification with spectral differential,

    K. Kobayashi, T. Toda, and S. Nakamura, “F0 transformation techniques for statistical voice conversion with direct waveform modification with spectral differential,” 2016 IEEE Spoken Lan- guage Technology Workshop (SLT), pp. 693–700, 2016

  37. [37]

    Ways to imple- ment global variance in statistical speech synthesis,

    H. Siln, E. Hel, J. Nurminen, and M. Gabbouj, “Ways to imple- ment global variance in statistical speech synthesis,” in Proc. In- terspeech, 2012, pp. 1436–1439

  38. [38]

    An overlap-add technique based on waveform similarity (wsola) for high quality time-scale modi- fication of speech,

    W. Verhelst and M. Roelands, “An overlap-add technique based on waveform similarity (wsola) for high quality time-scale modi- fication of speech,” in ICASSP, 1993

  39. [39]

    High-frequency regeneration in speech coding systems,

    J. Makhoul and M. Berouti, “High-frequency regeneration in speech coding systems,” in ICASSP ’79. IEEE International Con- ference on Acoustics, Speech, and Signal Processing, vol. 4, April 1979, pp. 428–431

  40. [40]

    Speech en- hancement via frequency bandwidth extension using line spec- tral frequencies,

    S. Chennoukh, A. Gerrits, G. Miet, and R. Sluijter, “Speech en- hancement via frequency bandwidth extension using line spec- tral frequencies,” in 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), vol. 1. IEEE, 2001, pp. 665–668

  41. [41]

    Collapsed speech segment detection and suppression for wavenet vocoder,

    Y .-C. Wu, K. Kobayashi, T. Hayashi, P. L. Tobing, and T. Toda, “Collapsed speech segment detection and suppression for wavenet vocoder,” inProc. Interspeech 2018, 2018, pp. 1988–1992

  42. [42]

    The voice conversion challenge 2018: Promoting development of parallel and nonparallel meth- ods,

    J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicen- cio, T. Kinnunen, and Z. Ling, “The voice conversion challenge 2018: Promoting development of parallel and nonparallel meth- ods,” in Proc. Odyssey, 2018, pp. 195–202

  43. [43]

    Investigation of F0 conditioning and Fully Convolutional Networks in Varia- tional Autoencoder based V oice Conversion,

    W.-C. Huang, Y .-C. Wu, C.-C. Lo, P. Lumban Tobing, T. Hayashi, K. Kobayashi, T. Toda, Y . Tsao, and H.-M. Wang, “Investigation of F0 conditioning and Fully Convolutional Networks in Varia- tional Autoencoder based V oice Conversion,”arXiv e-prints, May 2019

  44. [44]

    Mel- generalized cepstral analysis - a unified approach to speech spec- tral estimation,

    K. Tokuda, T. Kobayashi, T. Masuko, and S. Imai, “Mel- generalized cepstral analysis - a unified approach to speech spec- tral estimation,” in ICSLP, 1994