Diff-VS is an efficient audio-aware diffusion U-Net for vocal separation that matches discriminative baselines on objective metrics while achieving state-of-the-art perceptual quality via proxy measures.
Diff-VS: Efficient Audio-Aware Diffusion U-Net for Vocals Separation
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
While diffusion models are best known for their performance in generative tasks, they have also been successfully applied to many other tasks, including audio source separation. However, current generative approaches to music source separation often underperform on standard objective metrics. In this paper, we address this issue by introducing a novel generative vocal separation model based on the Elucidated Diffusion Model (EDM) framework. Our model processes complex short-time Fourier transform spectrograms and employs an improved U-Net architecture based on music-informed design choices. Our approach matches discriminative baselines on objective metrics and achieves perceptual quality comparable to state-of-the-art systems, as assessed by proxy subjective metrics. We hope these results encourage broader exploration of generative methods for music source separation
citation-role summary
citation-polarity summary
fields
eess.AS 1years
2026 1verdicts
UNVERDICTED 1roles
extension 1polarities
extend 1representative citing papers
citing papers explorer
-
Diff-VS: Efficient Audio-Aware Diffusion U-Net for Vocals Separation
Diff-VS is an efficient audio-aware diffusion U-Net for vocal separation that matches discriminative baselines on objective metrics while achieving state-of-the-art perceptual quality via proxy measures.