pith. machine review for the scientific record. sign in

arxiv: 2512.20211 · v2 · submitted 2025-12-23 · 💻 cs.SD · eess.AS· eess.SP

Recognition: no theorem link

Aliasing-Free Neural Audio Synthesis

Authors on Pith no claims yet

Pith reviewed 2026-05-16 20:32 UTC · model grok-4.3

classification 💻 cs.SD eess.ASeess.SP
keywords neural vocoderanti-aliasingaudio synthesissinging voice synthesismusic generationwaveform reconstructionneural codec
0
0 comments X

The pith

Differentiable anti-aliasing modules in neural vocoders and codecs remove artifacts to boost music and singing synthesis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows how to integrate anti-aliasing techniques from signal processing into neural audio models to prevent artifacts from non-linear functions and upsampling. The authors create Pupu-Vocoder and Pupu-Codec by adding these differentiable modules, and test them on speech, singing, music, and audio. Results indicate better performance than prior systems for singing voice, music, and audio, with similar results for speech. A benchmark for anti-aliased modules helps validate the approach.

Core claim

Pupu-Vocoder and Pupu-Codec incorporate differentiable anti-aliasing techniques into activation and upsampling modules, outperforming existing systems on singing voice, music, and audio while achieving comparable performance on speech.

What carries the argument

Differentiable anti-aliasing modules placed in activation functions and upsampling layers to suppress aliasing artifacts.

If this is right

  • Improved waveform reconstruction quality for complex audio signals like music and singing.
  • Comparable speech synthesis performance maintains usability for voice applications.
  • New benchmark enables systematic evaluation of aliasing in neural audio models.
  • The approach bridges digital signal processing and neural network design for audio.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • These modules could be adapted to other neural audio generators such as autoregressive or diffusion models.
  • Future work might explore combining this with higher sampling rates for even better fidelity.
  • Training stability with these insertions suggests they could be standard in future vocoder designs.

Load-bearing premise

Adding the differentiable anti-aliasing modules does not introduce training instabilities or quality trade-offs that cancel out the benefits.

What would settle it

Measure aliasing levels and perceptual quality scores on music and singing tasks after removing the anti-aliasing modules from the Pupu models; if quality drops significantly, the claim holds.

Figures

Figures reproduced from arXiv: 2512.20211 by Chaoren Wang, Jerry Li, Junan Zhang, Lauri Juvela, Yicheng Gu, Zhizheng Wu.

Figure 1
Figure 1. Figure 1: Illustration of different aliasing artifacts brought by the activation functions and upsampling layers. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The main idea of our proposed anti-aliased activation function and upsampling layer. “ [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the equivalent filter frequency responses [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Architecture and training schemes of the proposed models. The Pupu-Codec consists of an encoder, a residual vector [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Spectrogram visualization with a zoomed-in view of high-frequency harmonic components (around 16 kHz) regarding [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Anti-aliasing case study by passing a sine sweep [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
read the original abstract

In neural audio synthesis, neural vocoders and codecs are models that reconstruct waveforms from acoustic and latent representations, which are essential to the resulting audio quality. While current models are capable of generating perceptually natural speech, they still struggle with high-fidelity music and singing voice synthesis, as severe aliasing artifacts are introduced by non-linear activation functions and upsampling layers in existing architectures. Although various anti-aliasing techniques have been proposed in digital signal processing, their integration into neural vocoders and codecs remains under-explored. This paper incorporates differentiable anti-aliasing techniques into the activation and upsampling modules to bridge this gap, and thus presents Pupu-Vocoder and Pupu-Codec. We build a test signal benchmark to evaluate the anti-aliased modules, and validate our proposed models on speech, singing voice, music, and audio. Experimental results show that Pupu-Vocoder and Pupu-Codec outperform existing systems on singing voice, music, and audio, while achieving comparable performance on speech. Demos, codes, and checkpoints are available at VocodexElysium.github.io/AliasingFreeNeuralAudioSynthesis/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that inserting differentiable anti-aliasing modules into the activation functions and upsampling layers of standard neural vocoder and codec architectures yields Pupu-Vocoder and Pupu-Codec, which outperform prior systems on singing voice, music, and general audio while remaining comparable on speech; a dedicated test-signal benchmark is introduced to isolate the anti-aliasing benefit, and code/checkpoints are released.

Significance. If the gains hold, the work supplies a parameter-free architectural fix for a known source of artifacts in high-fidelity neural audio, directly improving domains (music, singing) where current models remain weakest. The explicit differentiability, released artifacts, and new benchmark constitute concrete strengths that support reproducibility and follow-on work.

major comments (2)
  1. [Experiments] The central performance claims rest on comparisons whose statistical significance is not reported; without p-values or confidence intervals on the listening-test or objective scores, it is impossible to determine whether the reported outperformance on music and singing is robust or could be explained by variance.
  2. [Method] Exact specifications of the differentiable anti-aliasing filters (cutoff, order, implementation inside the activation and upsampling blocks) are not provided; because these choices directly determine the aliasing suppression that is credited for the gains, the result cannot be reproduced from the text alone.
minor comments (2)
  1. [Abstract] The abstract states that the modules are 'parameter-free' yet does not explicitly confirm that no additional learnable parameters are introduced by the filter implementations.
  2. [Figures] Figure captions and axis labels in the test-signal benchmark plots should state the exact frequency range and sampling rate used so that readers can interpret the aliasing reduction quantitatively.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough review and the recommendation for minor revision. We are pleased that the significance of the work is recognized. Below we provide point-by-point responses to the major comments.

read point-by-point responses
  1. Referee: The central performance claims rest on comparisons whose statistical significance is not reported; without p-values or confidence intervals on the listening-test or objective scores, it is impossible to determine whether the reported outperformance on music and singing is robust or could be explained by variance.

    Authors: We acknowledge that statistical significance measures were not included in the original manuscript. In the revision, we will add 95% confidence intervals for all objective scores (computed over multiple random seeds) and report standard deviations across listeners for the subjective tests. This will be incorporated into the results tables and text. revision: yes

  2. Referee: Exact specifications of the differentiable anti-aliasing filters (cutoff, order, implementation inside the activation and upsampling blocks) are not provided; because these choices directly determine the aliasing suppression that is credited for the gains, the result cannot be reproduced from the text alone.

    Authors: We agree that the exact filter specifications are essential for reproducibility. In the revised manuscript, we will expand the method section with the precise cutoff frequencies, filter orders, and implementation details for the anti-aliasing modules within both the activation functions and upsampling layers, including the differentiable formulation used. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on external benchmarks and released artifacts

full rationale

The paper's central contribution is an architectural modification—inserting differentiable anti-aliasing modules into activations and upsampling layers—validated through a dedicated test-signal benchmark and comparative listening tests on speech, singing, music, and audio. Performance gains are reported against prior published systems rather than quantities defined by internal fitted parameters. No self-definitional equations, fitted-input predictions, or load-bearing self-citation chains appear; the argument is self-contained against external, reproducible test conditions and released code/checkpoints.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach relies on standard neural-network training assumptions and classical DSP anti-aliasing principles; no new free parameters or invented entities are introduced beyond typical model hyperparameters.

axioms (1)
  • domain assumption Anti-aliasing operations can be formulated as differentiable functions suitable for gradient-based training.
    Invoked when the authors state that anti-aliasing techniques are incorporated into activation and upsampling modules.

pith-pipeline@v0.9.0 · 5510 in / 1207 out tokens · 33332 ms · 2026-05-16T20:32:07.708294+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

102 extracted references · 102 canonical work pages · 2 internal anchors

  1. [1]

    MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer,

    Y . Wanget al., “MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer,” inProc. ICLR, 2025

  2. [2]

    Vevo: Controllable Zero-Shot V oice Imitation with Self-Supervised Disentanglement,

    X. Zhanget al., “Vevo: Controllable Zero-Shot V oice Imitation with Self-Supervised Disentanglement,” in Proc. ICLR, 2025

  3. [3]

    Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

    C. Wanget al., “Neural Codec Language Mod- els are Zero-Shot Text to Speech Synthesizers,” arXiv:2301.02111, 2023

  4. [4]

    Efficient Neural Audio Synthe- sis,

    N. Kalchbrenneret al., “Efficient Neural Audio Synthe- sis,” inProc. ICML, 2018, pp. 2415–2424

  5. [5]

    LPCNet: Improving Neural Speech Synthesis through Linear Prediction,

    J. Valin and J. Skoglund, “LPCNet: Improving Neural Speech Synthesis through Linear Prediction,” inProc. Int. Conf. Acoust. Speech Signal Process., 2019, pp. 5891–5895

  6. [6]

    WaveNet: A Generative Model for Raw Audio,

    A. van den Oordet al., “WaveNet: A Generative Model for Raw Audio,” inProc. ISCA Workshop Speech Synth., 2016, p. 125

  7. [7]

    WaveFM: A High- Fidelity and Efficient V ocoder Based on Flow Match- ing,

    T. Luo, X. Miao, and W. Duan, “WaveFM: A High- Fidelity and Efficient V ocoder Based on Flow Match- ing,” inProc. ACL, 2025, pp. 2187–2198

  8. [8]

    WaveFlow: A Compact Flow-based Model for Raw Audio,

    W. Ping, K. Peng, K. Zhao, and Z. Song, “WaveFlow: A Compact Flow-based Model for Raw Audio,” inProc. ICML, vol. 119, 2020, pp. 7706–7716

  9. [9]

    Waveglow: A Flow-based Generative Network for Speech Synthesis,

    R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A Flow-based Generative Network for Speech Synthesis,” inProc. Int. Conf. Acoust. Speech Signal Process., 2019, pp. 3617–3621

  10. [10]

    WaveGrad: Estimating Gradients for Waveform Generation,

    N. Chen, Y . Zhang, H. Zen, R. J. Weiss, M. Norouzi, and W. Chan, “WaveGrad: Estimating Gradients for Waveform Generation,” inProc. ICLR, 2021

  11. [11]

    DiffWave: A Versatile Diffusion Model for Audio Synthesis,

    Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, “DiffWave: A Versatile Diffusion Model for Audio Synthesis,” inProc. ICLR, 2021

  12. [12]

    Fregrad: Lightweight and Fast Frequency- Aware Diffusion V ocoder,

    T. D. Nguyen, J.-H. Kim, Y . Jang, J. Kim, and J. S. Chung, “Fregrad: Lightweight and Fast Frequency- Aware Diffusion V ocoder,” inProc. Int. Conf. Acoust. Speech Signal Process., 2024, pp. 10 736–10 740

  13. [13]

    Ultra-Lightweight Neural Differential DSP V ocoder for High Quality Speech Synthesis,

    P. Agrawal, T. K ¨ohler, Z. Xiu, P. Serai, and Q. He, “Ultra-Lightweight Neural Differential DSP V ocoder for High Quality Speech Synthesis,” inProc. Int. Conf. Acoust. Speech Signal Process., 2024, pp. 10 066– 10 070

  14. [14]

    Glot- Net - A Raw Waveform Model for the Glottal Excitation in Statistical Parametric Speech Synthesis,

    L. Juvela, B. Bollepalli, V . Tsiaras, and P. Alku, “Glot- Net - A Raw Waveform Model for the Glottal Excitation in Statistical Parametric Speech Synthesis,”IEEE/ACM Trans. Audio Speech Lang. Process., vol. 27, pp. 1019– 1030, 2019

  15. [15]

    DDSP-based Singing V ocoders: A New Subtractive-based Synthesizer and A Comprehensive Evaluation,

    D. Wuet al., “DDSP-based Singing V ocoders: A New Subtractive-based Synthesizer and A Comprehensive Evaluation,” inProc. ISMIR, 2022, pp. 76–83

  16. [16]

    BigVGAN: A Universal Neural V ocoder with Large-Scale Training,

    S. Lee, W. Ping, B. Ginsburg, B. Catanzaro, and S. Yoon, “BigVGAN: A Universal Neural V ocoder with Large-Scale Training,” inProc. ICLR, 2023

  17. [17]

    HiFi-GAN: High- Fidelity Denoising and Dereverberation Based on Speech Deep Features in Adversarial Networks,

    J. Su, Z. Jin, and A. Finkelstein, “HiFi-GAN: High- Fidelity Denoising and Dereverberation Based on Speech Deep Features in Adversarial Networks,” in Proc. Interspeech, 2020, pp. 4506–4510

  18. [18]

    EV A-GAN: En- hanced Various Audio Generation via Scalable Gener- ative Adversarial Networks,

    S. Liao, S. Lan, and A. G. Zachariah, “EV A-GAN: En- hanced Various Audio Generation via Scalable Gener- ative Adversarial Networks,”arXiv:2402.00892, 2024

  19. [19]

    Phase V ocoder,

    J. L. Flanagan and R. M. Golden, “Phase V ocoder,”Bell system technical Journal, vol. 45, pp. 1493–1509, 1966

  20. [20]

    STRAIGHT, exploitation of the other as- pect of VOCODER: Perceptually isomorphic decompo- sition of speech sounds,

    H. Kawahara, “STRAIGHT, exploitation of the other as- pect of VOCODER: Perceptually isomorphic decompo- sition of speech sounds,”Acoust. Sci. Technol., vol. 27, pp. 349–353, 2006

  21. [21]

    WORLD: A V ocoder-Based High-Quality Speech Synthesis System for Real-Time Applications,

    M. Morise, F. Yokomori, and K. Ozawa, “WORLD: A V ocoder-Based High-Quality Speech Synthesis System for Real-Time Applications,”IEICE Trans. Inf. Syst., vol. 99, pp. 1877–1884, 2016

  22. [22]

    High Fidelity Neural Audio Compression,

    A. D ´efossez, J. Copet, G. Synnaeve, and Y . Adi, “High Fidelity Neural Audio Compression,”Trans. Mach. Learn. Res., vol. 2023, 2023

  23. [23]

    High-Fidelity Audio Compression with Improved RVQGAN,

    R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar, “High-Fidelity Audio Compression with Improved RVQGAN,” inProc. NeurIPS, 2023

  24. [24]

    Speechtokenizer: Unified Speech Tokenizer for Speech Large Language Models,

    X. Zhang, D. Zhang, S. Li, Y . Zhou, and X. Qiu, “Speechtokenizer: Unified Speech Tokenizer for Speech Large Language Models,”Proc. ICLR, 2023

  25. [25]

    Finite Scalar Quantization: VQ-V AE Made Sim- ple,

    F. Mentzer, D. Minnen, E. Agustsson, and M. Tschan- nen, “Finite Scalar Quantization: VQ-V AE Made Sim- ple,” inProc. ICLR, 2024

  26. [26]

    Neural Discrete Representation Learning,

    A. Van Den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural Discrete Representation Learning,”Proc. NeurIPS, 2017

  27. [27]

    SoundStream: An End-to-End Neural Audio Codec,

    N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “SoundStream: An End-to-End Neural Audio Codec,”IEEE/ACM Trans. Audio Speech Lang. Process., vol. 30, pp. 495–507, 2022

  28. [28]

    A Fourier Explanation of AI-music Artifacts,

    D. Afchar, G. Meseguer-Brocal, K. Akesbi, and R. Hen- nequin, “A Fourier Explanation of AI-music Artifacts,” inProc. ISMIR, 2025

  29. [29]

    Upsam- pling Artifacts in Neural Audio Synthesis,

    J. Pons, S. Pascual, G. Cengarle, and J. Serr `a, “Upsam- pling Artifacts in Neural Audio Synthesis,” inProc. Int. Conf. Acoust. Speech Signal Process., 2021, pp. 3005– 3009

  30. [30]

    Analysis and Solution to Aliasing Artifacts in Neural Waveform Generation Models,

    Z. Shang, H. Zhang, P. Zhang, L. Wang, and T. Li, “Analysis and Solution to Aliasing Artifacts in Neural Waveform Generation Models,”Applied Acoustics, vol. 203, p. 109183, 2023

  31. [31]

    Wavehax: Aliasing-Free Neural Waveform Synthesis Based on 2D Convolution and Harmonic Prior for Reliable Complex Spectrogram Estimation,

    R. Yoneyama, A. Miyashita, R. Yamamoto, and T. Toda, “Wavehax: Aliasing-Free Neural Waveform Synthesis Based on 2D Convolution and Harmonic Prior for Reliable Complex Spectrogram Estimation,”IEEE/ACM Trans. Audio Speech Lang. Process., vol. 33, 2025

  32. [32]

    STFTCodec: High-Fidelity Audio Com- pression through Time-Frequency Domain Representa- tion,

    T. Fenget al., “STFTCodec: High-Fidelity Audio Com- pression through Time-Frequency Domain Representa- tion,” inProc. IEEE Int. Conf. Multimed. Expo, 2025

  33. [33]

    V ocos: Closing the gap between time- domain and Fourier-based neural vocoders for high- quality audio synthesis,

    H. Siuzdak, “V ocos: Closing the gap between time- domain and Fourier-based neural vocoders for high- quality audio synthesis,” inProc. ICLR, 2024

  34. [34]

    Antiderivative Antialiasing for Memoryless Nonlinear- 12 ities,

    S. Bilbao, F. Esqueda, J. D. Parker, and V . V ¨alim¨aki, “Antiderivative Antialiasing for Memoryless Nonlinear- 12 ities,”IEEE Signal Process. Lett., vol. 24, pp. 1049– 1053, 2017

  35. [35]

    Re- ducing the Aliasing of Nonlinear Waveshaping using Continuous-Time Convolution,

    J. D. Parker, V . Zavalishin, and E. Le Bivic, “Re- ducing the Aliasing of Nonlinear Waveshaping using Continuous-Time Convolution,” inProc. Int. Conf. Dig- ital Audio Effects, 2016, pp. 137–144

  36. [36]

    Antideriva- tive antialiasing techniques in nonlinear wave digital structures,

    D. Albertini, A. Bernardini, A. Sartiet al., “Antideriva- tive antialiasing techniques in nonlinear wave digital structures,” inJAES, 2021

  37. [37]

    Antiderivative antialiasing for stateful sys- tems,

    M. Holters, “Antiderivative antialiasing for stateful sys- tems,”Applied Sciences, vol. 10, p. 20, 2019

  38. [38]

    Arbitrary-order IIR Antiderivative Antialiasing,

    P. P. La Pastina, S. D’Angelo, and L. Gabrielli, “Arbitrary-order IIR Antiderivative Antialiasing,” in Proc. Int. Conf. Digital Audio Effects, 2021, pp. 9–16

  39. [39]

    Antiderivative Antialising for Recurrent Neural Networks,

    M. Otto and J. W. Kurt, “Antiderivative Antialising for Recurrent Neural Networks,” inProc. Int. Conf. Digital Audio Effects, 2025

  40. [40]

    Certain Topics in Telegraph Transmission Theory,

    H. Nyquist, “Certain Topics in Telegraph Transmission Theory,”Trans. AIEE, vol. 47, pp. 617–644, 2009

  41. [41]

    Alias-Free Generative Adversarial Networks,

    T. Karraset al., “Alias-Free Generative Adversarial Networks,”Proc. NeurIPS, 2021

  42. [42]

    Discrete-Time Mod- els for Nonlinear Audio Systems,

    J. Schattschneider and U. Z ¨olzer, “Discrete-Time Mod- els for Nonlinear Audio Systems,” inProc. Int. Conf. Digital Audio Effects, 1999, pp. 45–48

  43. [43]

    Distortion of Mu- sical Signals by means of Multiband Waveshaping,

    P. Fern ´andez-Cid and J. C. Quir ´os, “Distortion of Mu- sical Signals by means of Multiband Waveshaping,”J. New Music Res., vol. 30, pp. 279–287, 2001

  44. [44]

    Neural Networks Fail to Learn Periodic Functions and How to Fix It,

    L. Ziyin, T. Hartwig, and M. Ueda, “Neural Networks Fail to Learn Periodic Functions and How to Fix It,” Proc. NeurIPS, 2020

  45. [45]

    An Investigation of Time-Frequency Representation Dis- criminators for High-Fidelity V ocoders,

    Y . Gu, X. Zhang, L. Xue, H. Li, and Z. Wu, “An Investigation of Time-Frequency Representation Dis- criminators for High-Fidelity V ocoders,”IEEE/ACM Trans. Audio Speech Lang. Process., vol. 32, pp. 4569– 4579, 2024

  46. [46]

    Multi-Scale Sub- Band Constant-Q Transform Discriminator for High- Fidelity V ocoder,

    Y . Gu, X. Zhang, L. Xue, and Z. Wu, “Multi-Scale Sub- Band Constant-Q Transform Discriminator for High- Fidelity V ocoder,” inProc. Int. Conf. Acoust. Speech Signal Process., 2024, pp. 10 616–10 620

  47. [47]

    Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Gener- ation,

    H. Heet al., “Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Gener- ation,” inProc. IEEE Spoken Lang. Technol. Workshop, 2024

  48. [48]

    Emilia: A Large-Scale, Extensive, Multilin- gual, and Diverse Dataset for Speech Generation,

    ——, “Emilia: A Large-Scale, Extensive, Multilin- gual, and Diverse Dataset for Speech Generation,” IEEE/ACM Trans. Audio Speech Lang. Process., 2025

  49. [49]

    Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds

    Y . Zhanget al., “FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds,” arXiv:2407.01494, 2024

  50. [50]

    Solid State Bus-Comp: A Large-Scale and Diverse Dataset for Dynamic Range Compressor Virtual Analog Modeling,

    Y . Gu, R. Zhang, L. Juvela, and Z. Wu, “Solid State Bus-Comp: A Large-Scale and Diverse Dataset for Dynamic Range Compressor Virtual Analog Modeling,” inProc. Int. Conf. Digital Audio Effects, 2025

  51. [51]

    Neurodyne: Neural Pitch Manipulation with Representation Learn- ing and Cycle-Consistency GAN,

    Y . Gu, C. Wang, Z. Wu, and L. Juvela, “Neurodyne: Neural Pitch Manipulation with Representation Learn- ing and Cycle-Consistency GAN,” inProc. Interspeech, 2025, pp. 1253–1257

  52. [52]

    G. J. Mysore, “Can we Automatically Transform Speech Recorded on Common Consumer Devices in Real-World Environments into Professional Production Quality Speech?—A Dataset, Insights, and Challenges,” IEEE Signal Process. Lett., vol. 22, pp. 1006–1010, 2014

  53. [53]

    A Step-by-Step Process for Build- ing TTS V oices Using Open Source Data and Frame- works for Bangla, Javanese, Khmer, Nepali, Sinhala, and Sundanese

    K. Sodimanaet al., “A Step-by-Step Process for Build- ing TTS V oices Using Open Source Data and Frame- works for Bangla, Javanese, Khmer, Nepali, Sinhala, and Sundanese.” inProc. SLTU, 2018, pp. 66–70

  54. [54]

    AISHELL- 3: A Multi-speaker Mandarin TTS Corpus and the Baselines,

    Y . Shi, H. Bu, X. Xu, S. Zhang, and M. Li, “AISHELL- 3: A Multi-speaker Mandarin TTS Corpus and the Baselines,”arXiv:2010.11567, 2020

  55. [55]

    Hi-Fi Multi-Speaker English TTS Dataset,

    E. Bakhturina, V . Lavrukhin, B. Ginsburg, and Y . Zhang, “Hi-Fi Multi-Speaker English TTS Dataset,” inProc. Interspeech, 2021, pp. 2776–2780

  56. [56]

    HUI-Audio-Corpus- German: A High Quality TTS Dataset,

    P. Puchtler, J. Wirth, and R. Peinl, “HUI-Audio-Corpus- German: A High Quality TTS Dataset,” inProc. KI, vol. 12873, 2021, p. 204

  57. [57]

    CSTR VCTK corpus: English Multi-speaker Corpus for CSTR V oice Cloning Toolkit (version 0.92),

    C. Veaux, J. Yamagishi, K. MacDonaldet al., “CSTR VCTK corpus: English Multi-speaker Corpus for CSTR V oice Cloning Toolkit (version 0.92),”CSTR, 2017

  58. [58]

    BibleTTS: a large, high-fidelity, multi- lingual, and uniquely African speech corpus,

    J. Meyeret al., “BibleTTS: a large, high-fidelity, multi- lingual, and uniquely African speech corpus,” inProc. Interspeech, 2022, pp. 2383–2387

  59. [59]

    EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation,

    J. Richteret al., “EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation,” inProc. Interspeech, 2024, pp. 4873– 4877

  60. [60]

    ManaTTS Persian: a recipe for creating TTS datasets for lower resource languages,

    M. F. Qharabagh, Z. Dehghanian, and H. R. Rabiee, “ManaTTS Persian: a recipe for creating TTS datasets for lower resource languages,” inProc. ACL, 2025, pp. 9177–9206

  61. [61]

    The NUS sung and spoken lyrics corpus: A quantitative comparison of singing and speech,

    Z. Duan, H. Fang, B. Li, K. C. Sim, and Y . Wang, “The NUS sung and spoken lyrics corpus: A quantitative comparison of singing and speech,” inAsia-Pac. Signal Inf. Process. Assoc. Annu. Summit Conf., 2013, pp. 1–9

  62. [62]

    Automatic identifica- tion of emotional cues in Chinese opera singing,

    D. A. Black, M. Li, and M. Tian, “Automatic identifica- tion of emotional cues in Chinese opera singing,”Proc. ICMPC, 2014

  63. [63]

    V ocalSet: A Singing V oice Dataset,

    J. Wilkins, P. Seetharaman, A. Wahl, and B. Pardo, “V ocalSet: A Singing V oice Dataset,” inProc. ISMIR, 2018, pp. 468–474

  64. [64]

    JSUT corpus: free large-scale Japanese speech corpus for end-to-end speech synthesis

    R. Sonobe, S. Takamichi, and H. Saruwatari, “JSUT corpus: free large-scale Japanese speech corpus for end- to-end speech synthesis,”arXiv:1711.00354, 2017

  65. [65]

    Jingju a Cappella Recordings Collection (Version 2.0),

    R. Gong, R. Caro, Y . Yang, and X. Serra, “Jingju a Cappella Recordings Collection (Version 2.0),” 10.5281/zenodo.6536490, 2022

  66. [66]

    PJS: phoneme-balanced Japanese singing-voice corpus,

    J. Koguchi, S. Takamichi, and M. Morise, “PJS: phoneme-balanced Japanese singing-voice corpus,” in Asia-Pac. Signal Inf. Process. Assoc. Annu. Summit Conf., 2020, pp. 487–491

  67. [67]

    Children’s song dataset for singing voice research,

    S. Choi, W. Kim, S. Park, S. Yong, and J. Nam, “Children’s song dataset for singing voice research,” in Proc. ISMIR, vol. 4, 2020

  68. [68]

    JVS-MuSiC: Japanese multispeaker singing-voice cor- pus,

    H. Tamaru, S. Takamichi, N. Tanji, and H. Saruwatari, “JVS-MuSiC: Japanese multispeaker singing-voice cor- pus,”arXiv:2001.07044, 2020. 13

  69. [69]

    Muskits: an End-to-end Music Processing Toolkit for Singing V oice Synthesis,

    J. Shiet al., “Muskits: an End-to-end Music Processing Toolkit for Singing V oice Synthesis,” inProc. Inter- speech, 2022, pp. 4277–4281

  70. [70]

    Multi-Singer: Fast Multi-Singer Singing V oice V ocoder With A Large-Scale Corpus,

    R. Huang, F. Chen, Y . Ren, J. Liu, C. Cui, and Z. Zhao, “Multi-Singer: Fast Multi-Singer Singing V oice V ocoder With A Large-Scale Corpus,” inProc. ACM MM, 2021, pp. 3945–3954

  71. [71]

    NHSS: A speech and singing parallel database,

    B. Sharma, X. Gao, K. Vijayan, X. Tian, and H. Li, “NHSS: A speech and singing parallel database,” Speech Commun., vol. 133, pp. 9–22, 2021

  72. [72]

    Diff- Singer: Singing V oice Synthesis via Shallow Diffusion Mechanism,

    J. Liu, C. Li, Y . Ren, F. Chen, and Z. Zhao, “Diff- Singer: Singing V oice Synthesis via Shallow Diffusion Mechanism,” inProc. AAAI, 2022, pp. 11 020–11 028

  73. [73]

    Learning the Beauty in Songs: Neural Singing V oice Beautifier,

    J. Liu, C. Li, Y . Ren, Z. Zhu, and Z. Zhao, “Learning the Beauty in Songs: Neural Singing V oice Beautifier,” inProc. ACL, 2022, pp. 7970–7983

  74. [74]

    Opencpop: A High-Quality Open Source Chinese Popular Song Corpus for Singing V oice Synthesis,

    Y . Wanget al., “Opencpop: A High-Quality Open Source Chinese Popular Song Corpus for Singing V oice Synthesis,” inProc. Interspeech, 2022, pp. 4242–4246

  75. [75]

    M4Singer: A Multi-Style, Multi-Singer and Musical Score Provided Mandarin Singing Corpus,

    L. Zhanget al., “M4Singer: A Multi-Style, Multi-Singer and Musical Score Provided Mandarin Singing Corpus,” inProc. NeurIPS, 2022

  76. [76]

    SingStyle111: A Multilingual Singing Dataset With Style Transfer,

    S. Dai, Y . Wu, S. Chen, R. Huang, and R. B. Dannen- berg, “SingStyle111: A Multilingual Singing Dataset With Style Transfer,” inProc. ISMIR, 2023, pp. 765– 773

  77. [77]

    FT- GAN: Fine-Grained Tune Modeling for Chinese Opera Synthesis,

    M. Zheng, P. Bai, X. Shi, X. Zhou, and Y . Yan, “FT- GAN: Fine-Grained Tune Modeling for Chinese Opera Synthesis,” inProc. AAAI, 2024, pp. 19 697–19 705

  78. [78]

    Singing V oice Data Scaling-up: An Introduction to ACE-Opencpop and ACE-KiSing,

    J. Shiet al., “Singing V oice Data Scaling-up: An Introduction to ACE-Opencpop and ACE-KiSing,” in Proc. Interspeech, 2024

  79. [79]

    SingNet: Towards a Large-Scale, Diverse, and In-The-Wild Singing V oice Dataset,

    Y . Guet al., “SingNet: Towards a Large-Scale, Diverse, and In-The-Wild Singing V oice Dataset,” arXiv:2505.09325, 2025

  80. [80]

    A real-time system for mea- suring sound goodness in instrumental sounds,

    O. Romani Picaset al., “A real-time system for mea- suring sound goodness in instrumental sounds,” inProc. AES, vol. 138, 2015

Showing first 80 references.