pith. sign in

arxiv: 1906.08977 · v1 · pith:I7YW5V6Rnew · submitted 2019-06-21 · 💻 cs.SD · cs.LG· eess.AS

Singing Voice Synthesis Using Deep Autoregressive Neural Networks for Acoustic Modeling

Pith reviewed 2026-05-25 18:36 UTC · model grok-4.3

classification 💻 cs.SD cs.LGeess.AS
keywords singing voice synthesisdeep autoregressive networksacoustic modelingF0 contoursvibratosneural networksspeech synthesis
0
0 comments X

The pith

Deep autoregressive networks capture frame dependencies in singing voice features better than recurrent networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that deep autoregressive neural networks can model the acoustic features of singing voice more accurately than conventional recurrent neural networks. Singing voice has more dynamic movements like vibratos, so autoregressive prediction from previous frames helps capture these local dependencies in F0 and spectral features. An F0 post-processing step aligns predictions with music notes. Experiments on a Chinese singing corpus indicate improved objective and subjective results. A reader would care if this leads to more natural synthesized singing voices.

Core claim

This paper claims that adopting deep autoregressive models for predicting F0 and spectral features in singing voice synthesis allows better description of dependencies among consecutive frames, enabling effective production of F0 contours with vibratos and superior performance over RNN-based methods.

What carries the argument

Deep autoregressive (DAR) models for sequential prediction of acoustic features, including discretized F0 and continuous spectra processed via a prenet with self-attention layers.

If this is right

  • The method produces F0 contours with vibratos effectively.
  • Better objective and subjective performance than RNNs on the Chinese singing corpus.
  • History length in DAR affects F0 modeling performance.
  • The prenet module with self-attention enables handling of continuous spectral features.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Autoregressive modeling could be applied to other time-series audio tasks with local dynamics.
  • The self-attention prenet might improve handling of longer contexts in synthesis models.
  • Results suggest potential for language-specific adaptations in singing synthesis.

Load-bearing premise

That the F0 post-processing strategy resolves inconsistencies between predicted contours and music-note values without introducing audible artifacts or reducing natural expressiveness.

What would settle it

An experiment where the DAR model with post-processing produces F0 contours that, when used in synthesis, show no vibrato or lower subjective scores than RNN models on the same test set.

Figures

Figures reproduced from arXiv: 1906.08977 by Li-Rong Dai, Yang Ai, Yuan-Hao Yi, Zhen-Hua Ling.

Figure 2
Figure 2. Figure 2: The structure of our DAR-based spectral model for SVS. post-processed F0 value fbt can be calculated as fet = 1 2w + 1 tX +w i=t−w fi, (1) fbt = ft − fet + f (n) t , (2) where 2w+1 represents the window size for moving average, fet denotes the output of moving average and and f (n) t represents the t-th frame of the stair-like F0 contours determined by music notes. The above operations are performed indepe… view at source ↗
Figure 3
Figure 3. Figure 3: F0 contours generated by different models for a voiced segment in our test set, where “Natural” and “Music Note” are two references [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The subjective preference scores among Baseline, DAR and DAR+P. 3.4. Subjective evaluation Subjective listening tests were carried out to evaluate the pref￾erence scores between the songs synthesized by different meth￾ods. 5 songs and 6 utterances in each song were randomly selected from our test set and were synthesized by the three methods listed in [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
read the original abstract

This paper presents a method of using autoregressive neural networks for the acoustic modeling of singing voice synthesis (SVS). Singing voice differs from speech and it contains more local dynamic movements of acoustic features, e.g., vibratos. Therefore, our method adopts deep autoregressive (DAR) models to predict the F0 and spectral features of singing voice in order to better describe the dependencies among the acoustic features of consecutive frames. For F0 modeling, discretized F0 values are used and the influences of the history length in DAR are analyzed by experiments. An F0 post-processing strategy is also designed to alleviate the inconsistency between the predicted F0 contours and the F0 values determined by music notes. Furthermore, we extend the DAR model to deal with continuous spectral features, and a prenet module with self-attention layers is introduced to process historical frames. Experiments on a Chinese singing voice corpus demonstrate that our method using DARs can produce F0 contours with vibratos effectively, and can achieve better objective and subjective performance than the conventional method using recurrent neural networks (RNNs).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes deep autoregressive (DAR) neural networks for acoustic modeling in singing voice synthesis (SVS) to capture local dynamic movements such as vibratos in F0 and spectral features, using discretized F0 values, a history-length analysis, an F0 post-processing step for music-note consistency, and a prenet with self-attention for continuous spectra. Experiments on a Chinese singing voice corpus are reported to show that DAR produces effective F0 contours with vibratos and yields better objective and subjective results than RNN baselines.

Significance. If the performance gains are robust and attributable to the DAR acoustic model rather than post-processing, the work would provide a concrete advance in modeling expressive temporal dependencies in SVS, a domain where local dynamics differ markedly from speech. The explicit handling of discretized F0 and the prenet extension are practical contributions that could be adopted in other autoregressive synthesis pipelines.

major comments (2)
  1. [Abstract] Abstract (paragraph on F0 modeling): The central claim that DAR produces F0 contours with vibratos effectively and outperforms RNNs rests on the untested assumption that the F0 post-processing strategy resolves note inconsistencies without flattening vibrato or introducing audible artifacts; no ablation (with/without post-processing) on vibrato-specific metrics or isolated listening-test conditions is described, leaving the attribution of gains to the DAR model itself unverified.
  2. [Abstract] Abstract (comparative experiments): The reported superiority in objective and subjective performance is stated at a high level without dataset size, number of utterances, training/validation splits, error bars, or statistical significance tests, so the robustness of the cross-model comparison cannot be assessed from the given evidence.
minor comments (1)
  1. Notation for the history length parameter in the DAR F0 experiments should be defined once and used consistently when reporting the influence analysis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and indicate the revisions we plan to make.

read point-by-point responses
  1. Referee: [Abstract] Abstract (paragraph on F0 modeling): The central claim that DAR produces F0 contours with vibratos effectively and outperforms RNNs rests on the untested assumption that the F0 post-processing strategy resolves note inconsistencies without flattening vibrato or introducing audible artifacts; no ablation (with/without post-processing) on vibrato-specific metrics or isolated listening-test conditions is described, leaving the attribution of gains to the DAR model itself unverified.

    Authors: We agree that the manuscript would benefit from an explicit ablation isolating the contribution of the F0 post-processing step. The current work describes the post-processing as a means to enforce music-note consistency while aiming to retain local dynamics from the DAR model, but does not report a dedicated ablation on vibrato metrics or focused listening conditions. In the revision we will add such an ablation, including objective vibrato rate/extent measures and subjective tests isolating vibrato quality, to strengthen attribution of gains to the DAR acoustic model. revision: yes

  2. Referee: [Abstract] Abstract (comparative experiments): The reported superiority in objective and subjective performance is stated at a high level without dataset size, number of utterances, training/validation splits, error bars, or statistical significance tests, so the robustness of the cross-model comparison cannot be assessed from the given evidence.

    Authors: The full manuscript describes the Chinese singing voice corpus and the experimental protocol, but the abstract and results presentation are indeed high-level. We will revise both the abstract and the experimental results section to report the corpus size, number of utterances, training/validation/test splits, error bars on objective metrics, and statistical significance tests on the subjective evaluations. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training and evaluation on external corpus

full rationale

The paper's central claims rest on training DAR acoustic models on a Chinese singing voice corpus and reporting objective/subjective improvements over RNN baselines, plus qualitative observation of vibrato in F0 contours. No equations, uniqueness theorems, or fitted parameters are presented that reduce by construction to the inputs; the F0 post-processing step is a heuristic applied after prediction and its impact is assessed via the same external data rather than assumed tautologically. No self-citations are invoked as load-bearing premises. The derivation chain is therefore self-contained experimental work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no new free parameters, axioms, or invented entities beyond standard neural-network training assumptions; all modeling choices are presented as empirical adaptations of existing autoregressive architectures.

axioms (1)
  • domain assumption Standard neural-network optimization assumptions hold for acoustic feature prediction on singing data
    Implicit in the training of DAR models and the comparison to RNN baselines.

pith-pipeline@v0.9.0 · 5728 in / 1174 out tokens · 28231 ms · 2026-05-25T18:36:11.178355+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

    eess.AS 2024-06 unverdicted novelty 6.0

    Seed-TTS models produce speech matching human naturalness and speaker similarity, with added controllability via self-distillation and reinforcement learning.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · cited by 1 Pith paper · 7 internal anchors

  1. [1]

    Some song synthesizers have been developed based on the unit selection speech synthesis approach [1, 2]

    Introduction Singing voice synthesis (SVS) converts lyrics and musical score information (e.g., tempo, pitch, etc.) into songs, which differs from traditional text-to-speech (TTS) synthesis. Some song synthesizers have been developed based on the unit selection speech synthesis approach [1, 2]. Although this approach can achieve high sound quality, it rel...

  2. [2]

    Singing Voice Synthesis Using Deep Autoregressive Neural Networks for Acoustic Modeling

    DAR-based Singing V oice Synthesis 2.1. Basic DAR models The autoregressive (AR) dependency has been widely studied for many signal modeling and generation tasks. Deep autore- gressive (DAR) models [13] follow the idea of feeding the target data of previous frames as additional input to a uni-directional recurrent layer [15]. Assume that ot stands for the...

  3. [3]

    Natural” and “Music Note

    Experiments 3.1. Experimental conditions A Chinese singing voice corpus was adopted in our exper- iments. This corpus contained 3290 utterances (100 songs about 220 minutes) without background music from a male singer. The recordings were sampled at 16kHz with 16-bit quantization. This dataset was separated into a training set with 2976 utterances (91 son...

  4. [4]

    The DAR-based models also achieved lower V/UV error and MCD than the baseline model

    From this table, we can see that the DAR-based F0 model achieved lower F0 RMSE and higher correlation coefficient than the baseline model when either natural F0 contours or the F0 contours determined by music notes were used as references. The DAR-based models also achieved lower V/UV error and MCD than the baseline model. After applying the F0 post- proce...

  5. [5]

    Conclusions In this paper, we have presented a method of using deep au- toregressive (DAR) neural networks to model F0s and spectral features in singing voice synthesis (SVS). For F0 modeling, discretized F0 values are used and a moving average-based F0 post-processing strategy is designed to alleviate the inconsis- tency between the predicted F0 contours...

  6. [6]

    Singing-voice synthesis using demi- syllable unit selection,

    H.-Y . Gu and J.-K. He, “Singing-voice synthesis using demi- syllable unit selection,” in 2016 International Conference on Machine Learning and Cybernetics (ICMLC) , vol. 2. IEEE, 2016, pp. 654–659

  7. [7]

    Expressive singing synthesis based on unit selection for the singing synthesis challenge 2016

    J. Bonada, M. Umbert, and M. Blaauw, “Expressive singing synthesis based on unit selection for the singing synthesis challenge 2016.” in INTERSPEECH, 2016, pp. 1230–1234

  8. [8]

    An HMM- based singing voice synthesis system,

    K. Saino, H. Zen, Y . Nankaku, A. Lee, and K. Tokuda, “An HMM- based singing voice synthesis system,” in Ninth International Conference on Spoken Language Processing, 2006

  9. [9]

    Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis,

    H. Zen and A. Senior, “Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis,” in 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2014, pp. 3844–3848

  10. [10]

    Unidirectional long short-term memory recurrent neural network with recurrent output layer for low- latency speech synthesis,

    H. Zen and H. Sak, “Unidirectional long short-term memory recurrent neural network with recurrent output layer for low- latency speech synthesis,” in2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2015, pp. 4470–4474

  11. [11]

    Tacotron: Towards End-to-End Speech Synthesis

    Y . Wang, R. Skerry-Ryan, D. Stanton, Y . Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y . Xiao, Z. Chen, S. Bengioet al., “Tacotron: A fully end-to-end text-to-speech synthesis model,” arXiv preprint arXiv:1703.10135, 2017

  12. [12]

    WaveNet: A Generative Model for Raw Audio

    A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016

  13. [13]

    DNN-based spectral enhancement for neural waveform generators with low- bit quantization,

    Y . Ai, J.-X. Zhang, L. Chen, and Z.-H. Ling, “DNN-based spectral enhancement for neural waveform generators with low- bit quantization,” in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019

  14. [14]

    Singing voice synthesis based on deep neural networks

    M. Nishimura, K. Hashimoto, K. Oura, Y . Nankaku, and K. Tokuda, “Singing voice synthesis based on deep neural networks.” in Interspeech, 2016, pp. 2478–2482

  15. [15]

    Korean singing voice synthesis system based on an LSTM recurrent neural network,

    J. Kim, H. Choi, J. Park, S. Kim, J. Kim, and M. Hahn, “Korean singing voice synthesis system based on an LSTM recurrent neural network,” in INTERSPEECH 2018. International Speech Communication Association, 2018

  16. [16]

    A Neural Parametric Singing Synthesizer

    M. Blaauw and J. Bonada, “A neural parametric singing synthesizer,” arXiv preprint arXiv:1704.03809, 2017

  17. [17]

    Extraction of F0 dynamic characteristics and development of F0 control model in singing voice

    T. Saitou, M. Unoki, and M. Akagi, “Extraction of F0 dynamic characteristics and development of F0 control model in singing voice.” Georgia Institute of Technology, 2002

  18. [18]

    Autoregressive neural F0 model for statistical parametric speech synthesis,

    X. Wang, S. Takaki, and J. Yamagishi, “Autoregressive neural F0 model for statistical parametric speech synthesis,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , vol. 26, no. 8, pp. 1406–1419, 2018

  19. [19]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems , 2017, pp. 5998–6008

  20. [20]

    Better generative models for sequential data problems: Bidirectional recurrent mixture density networks,

    M. Schuster, “Better generative models for sequential data problems: Bidirectional recurrent mixture density networks,” in Advances in Neural Information Processing Systems , 2000, pp. 589–595

  21. [21]

    Hierarchical probabilistic neural network language model

    F. Morin and Y . Bengio, “Hierarchical probabilistic neural network language model.” in Aistats, vol. 5. Citeseer, 2005, pp. 246–252

  22. [22]

    Development of an F0 control model based on F0 dynamic characteristics for singing- voice synthesis,

    T. Saitou, M. Unoki, and M. Akagi, “Development of an F0 control model based on F0 dynamic characteristics for singing- voice synthesis,” Speech communication , vol. 46, no. 3-4, pp. 405–417, 2005

  23. [23]

    Dropout: a simple way to prevent neural networks from overfitting,

    N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014

  24. [24]

    Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

    S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015

  25. [25]

    Kawahara, I

    H. Kawahara, I. Masuda-Katsuse, and A. De Cheveigne, “Restructuring speech representations using a pitch-adaptive time–frequency smoothing and an instantaneous-frequency-based f0 extraction: Possible role of a repetitive structure in sounds,” Speech communication, vol. 27, no. 3-4, pp. 187–207, 1999

  26. [26]

    Deterministic annealing em algorithm,

    N. Ueda and R. Nakano, “Deterministic annealing em algorithm,” Neural networks, vol. 11, no. 2, pp. 271–282, 1998

  27. [27]

    Efficient Neural Audio Synthesis

    N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. v. d. Oord, S. Dieleman, and K. Kavukcuoglu, “Efficient neural audio synthesis,” arXiv preprint arXiv:1802.08435, 2018

  28. [28]

    Adam: A Method for Stochastic Optimization

    D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014