Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation

arxiv: 1806.03185 · v1 · pith:LNVWQKN3new · submitted 2018-06-08 · 💻 cs.SD · eess.AS· stat.ML

Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation

Daniel Stoller , Sebastian Ewert , Simon Dixon This is my paper

classification 💻 cs.SD eess.ASstat.ML

keywords separationsourceaudioarchitecturecontextend-to-endhighinformation

0 comments p. Extension

pith:LNVWQKN3 Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{LNVWQKN3}

Prints a linked pith:LNVWQKN3 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

read the original abstract

Models for audio source separation usually operate on the magnitude spectrum, which ignores phase information and makes separation performance dependant on hyper-parameters for the spectral front-end. Therefore, we investigate end-to-end source separation in the time-domain, which allows modelling phase information and avoids fixed spectral transformations. Due to high sampling rates for audio, employing a long temporal input context on the sample level is difficult, but required for high quality separation results because of long-range temporal correlations. In this context, we propose the Wave-U-Net, an adaptation of the U-Net to the one-dimensional time domain, which repeatedly resamples feature maps to compute and combine features at different time scales. We introduce further architectural improvements, including an output layer that enforces source additivity, an upsampling technique and a context-aware prediction framework to reduce output artifacts. Experiments for singing voice separation indicate that our architecture yields a performance comparable to a state-of-the-art spectrogram-based U-Net architecture, given the same data. Finally, we reveal a problem with outliers in the currently used SDR evaluation metrics and suggest reporting rank-based statistics to alleviate this problem.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Latent Fourier Transform
cs.SD 2026-04 unverdicted novelty 7.0

LatentFT uses latent-space Fourier transforms and frequency masking in diffusion autoencoders to enable timescale-specific manipulation of musical structure in generative models.
CodecSep: Prompt-Driven Universal Sound Separation on Neural Audio Codec Latents
cs.SD 2025-09 unverdicted novelty 6.0

CodecSep performs prompt-driven universal sound separation directly in neural audio codec latents by combining a frozen DAC backbone with a lightweight FiLM-conditioned Transformer masker driven by CLAP embeddings, yi...
End-to-end Automated Deep Neural Network Optimization for PPG-based Blood Pressure Estimation on Wearables
cs.LG 2026-04 unverdicted novelty 5.0

An end-to-end hardware-aware optimization pipeline produces DNNs for PPG-based blood pressure estimation with up to 7.99% lower error and 83x fewer parameters that fit on ultra-low-power SoCs like GAP8.
A Survey of Advancing Audio Super-Resolution and Bandwidth Extension from Discriminative to Generative Models
eess.AS 2026-05 unverdicted novelty 2.0

A structured survey of audio bandwidth extension that organizes the transition from deterministic discriminative DNNs to generative approaches including GANs, diffusion models, and flow-based methods.