pith. sign in

Diff-VS: Efficient Audio-Aware Diffusion U-Net for Vocals Separation

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it
abstract

While diffusion models are best known for their performance in generative tasks, they have also been successfully applied to many other tasks, including audio source separation. However, current generative approaches to music source separation often underperform on standard objective metrics. In this paper, we address this issue by introducing a novel generative vocal separation model based on the Elucidated Diffusion Model (EDM) framework. Our model processes complex short-time Fourier transform spectrograms and employs an improved U-Net architecture based on music-informed design choices. Our approach matches discriminative baselines on objective metrics and achieves perceptual quality comparable to state-of-the-art systems, as assessed by proxy subjective metrics. We hope these results encourage broader exploration of generative methods for music source separation

citation-role summary

extension 1

citation-polarity summary

fields

eess.AS 1

years

2026 1

verdicts

UNVERDICTED 1

roles

extension 1

polarities

extend 1

representative citing papers

citing papers explorer

Showing 1 of 1 citing paper.

  • Diff-VS: Efficient Audio-Aware Diffusion U-Net for Vocals Separation eess.AS · 2026-04-01 · unverdicted · none · ref 4 · internal anchor

    Diff-VS is an efficient audio-aware diffusion U-Net for vocal separation that matches discriminative baselines on objective metrics while achieving state-of-the-art perceptual quality via proxy measures.