Hybrid Spectrogram and Waveform Source Separation

Alexandre D\'efossez

arxiv: 2111.03600 · v3 · pith:DF55HVOZnew · submitted 2021-11-05 · 📡 eess.AS · cs.SD· stat.ML

Hybrid Spectrogram and Waveform Source Separation

Alexandre D\'efossez This is my paper

classification 📡 eess.AS cs.SDstat.ML

keywords hybridsourcedemucsseparationarchitecturedomainimprovementmodel

0 comments

read the original abstract

Source separation models either work on the spectrogram or waveform domain. In this work, we show how to perform end-to-end hybrid source separation, letting the model decide which domain is best suited for each source, and even combining both. The proposed hybrid version of the Demucs architecture won the Music Demixing Challenge 2021 organized by Sony. This architecture also comes with additional improvements, such as compressed residual branches, local attention or singular value regularization. Overall, a 1.4 dB improvement of the Signal-To-Distortion (SDR) was observed across all sources as measured on the MusDB HQ dataset, an improvement confirmed by human subjective evaluation, with an overall quality rated at 2.83 out of 5 (2.36 for the non hybrid Demucs), and absence of contamination at 3.04 (against 2.37 for the non hybrid Demucs and 2.44 for the second ranking model submitted at the competition).

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HAIM: Human-AI Music Datasets for AI Music Production Tracking Benchmark
cs.SD 2026-06 unverdicted novelty 7.0

HAIM is a new labeled dataset for granular tracking of AI interventions across music production stages, enabling evaluation beyond binary AI-or-human classification.
UnityShots: Memory-Driven Multi-Shot Audio-Video Generation with Boundary-Aware Gating
cs.CV 2026-06 unverdicted novelty 6.0

UnityShots uses fixed LTM and STM memory slots with boundary-conditioned gating and speaker tokens to achieve coherent multi-shot audio-video generation, leading open-source baselines on cross-shot coherence metrics.
Discrete Token Modeling for Multi-Stem Music Source Separation with Language Models
eess.AS 2026-04 unverdicted novelty 6.0

A Conformer-conditioned decoder-only language model generates discrete tokens via a neural audio codec to separate four music stems, reaching near state-of-the-art perceptual quality and top NISQA on vocals in MUSDB18...