Single-Channel Multi-Speaker Separation using Deep Clustering

John R. Hershey; Jonathan Le Roux; Shinji Watanabe; Yusuf Isik; Zhuo Chen

arxiv: 1607.02173 · v1 · pith:MPLIN3UWnew · submitted 2016-07-07 · 💻 cs.LG · cs.SD· stat.ML

Single-Channel Multi-Speaker Separation using Deep Clustering

Yusuf Isik , Jonathan Le Roux , Zhuo Chen , Shinji Watanabe , John R. Hershey This is my paper

classification 💻 cs.LG cs.SDstat.ML

keywords separationsignalclusteringbaselinedeepend-to-endperformanceapproximation

0 comments

read the original abstract

Deep clustering is a recently introduced deep learning architecture that uses discriminatively trained embeddings as the basis for clustering. It was recently applied to spectrogram segmentation, resulting in impressive results on speaker-independent multi-speaker separation. In this paper we extend the baseline system with an end-to-end signal approximation objective that greatly improves performance on a challenging speech separation. We first significantly improve upon the baseline system performance by incorporating better regularization, larger temporal context, and a deeper architecture, culminating in an overall improvement in signal to distortion ratio (SDR) of 10.3 dB compared to the baseline of 6.0 dB for two-speaker separation, as well as a 7.1 dB SDR improvement for three-speaker separation. We then extend the model to incorporate an enhancement layer to refine the signal estimates, and perform end-to-end training through both the clustering and enhancement stages to maximize signal fidelity. We evaluate the results using automatic speech recognition. The new signal approximation objective, combined with end-to-end training, produces unprecedented performance, reducing the word error rate (WER) from 89.1% down to 30.8%. This represents a major advancement towards solving the cocktail party problem.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Discriminative-Generative Target Speaker Extraction with Decoder-Only Language Models
eess.AS 2026-01 unverdicted novelty 6.0

A hybrid two-stage framework pairs a discriminative front-end for interference suppression with a generative decoder-only LM back-end to improve perceptual quality and speaker consistency in target speaker extraction ...
Discriminative Learning for Monaural Speech Separation Using Deep Embedding Features
cs.SD 2019-07 unverdicted novelty 4.0

Jointly trains deep clustering embeddings with uPIT plus discriminative fine-tuning to report better speaker-independent separation than either baseline alone on WSJ0-2mix.