Towards Unified Song Generation and Singing Voice Conversion with Accompaniment Co-Generation

· 2026 · cs.SD · arXiv 2606.07015

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

While song generation and singing voice conversion (SVC) have evolved significantly, they have long been developed isolated: the former lacks zero-shot speaker cloning, while the latter overlooks vocal-accompaniment synergy. To bridge this gap, we propose UniSinger, the first end-to-end framework unifying speaker cloning song generation and accompaniment co-generation SVC. Building on the multimodal diffusion transformer, we construct a unified speaker embedding space transferring speaker representation from SVC to song generation, endowing fine-grained cross-task timbre control. To mitigate multi-task optimization conflicts, we design a curriculum learning strategy using task-specific modality masking to guide the model to gradually master the generative mechanisms among semantic content, vocal timbre, and accompaniment. Experiments show state-of-the-art performance on both tasks and realizes complementary benefits, offering new possibilities for intelligent music production.

representative citing papers

Towards Unified Song Generation and Singing Voice Conversion with Accompaniment Co-Generation

cs.SD · 2026-06-05 · unverdicted · novelty 7.0

UniSinger unifies speaker-cloned song generation and accompaniment co-generation SVC in one multimodal diffusion transformer model trained with curriculum learning via task-specific modality masking.

citing papers explorer

Showing 1 of 1 citing paper.

Towards Unified Song Generation and Singing Voice Conversion with Accompaniment Co-Generation cs.SD · 2026-06-05 · unverdicted · none · ref 2 · internal anchor
UniSinger unifies speaker-cloned song generation and accompaniment co-generation SVC in one multimodal diffusion transformer model trained with curriculum learning via task-specific modality masking.

Towards Unified Song Generation and Singing Voice Conversion with Accompaniment Co-Generation

fields

years

verdicts

representative citing papers

citing papers explorer