UniVoice is a conditional flow matching model with a Diffusion Transformer backbone that unifies TTS and SVS via modality-specific encoders and a null melody token for speech, achieving 5.26% speech PER and 16.22% singing PER.
MakeSinger: A semi-supervised training method for data-efficient singing voice synthesis via classifier-free diffusion guidance.arXiv preprint arXiv:2406.05965, 2024
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.SD 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
UniVoice: A Unified Model for Speech and Singing Voice Generation
UniVoice is a conditional flow matching model with a Diffusion Transformer backbone that unifies TTS and SVS via modality-specific encoders and a null melody token for speech, achieving 5.26% speech PER and 16.22% singing PER.