Discriminative Learning for Monaural Speech Separation Using Deep Embedding Features

Bin Liu; Cunhang Fan; Jiangyan Yi; Jianhua Tao; Zhengqi Wen

arxiv: 1907.09884 · v1 · pith:4WI76UG2new · submitted 2019-07-23 · 💻 cs.SD · cs.LG· eess.AS

Discriminative Learning for Monaural Speech Separation Using Deep Embedding Features

Cunhang Fan , Bin Liu , Jianhua Tao , Jiangyan Yi , Zhengqi Wen This is my paper

Pith reviewed 2026-05-24 17:01 UTC · model grok-4.3

classification 💻 cs.SD cs.LGeess.AS

keywords speech separationdeep clusteringpermutation invariant trainingmonaural audiodiscriminative learningspeaker independentembedding featuresuPIT

0 comments

The pith

Combining deep clustering embeddings with uPIT and discriminative learning improves speaker-independent monaural speech separation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes training a deep clustering network to extract embedding features that capture source information, then feeding those into uPIT for direct source separation. These are jointly trained to optimize the actual separation goal rather than just embeddings or permutations separately. Discriminative learning is added to increase the distance between different permutation options. This approach is tested on the WSJ0-2mix dataset and shows better results than using DC or uPIT alone. A sympathetic reader would care because it simplifies the pipeline for speech separation in noisy or multi-speaker environments.

Core claim

By extracting deep embedding features with a DC network and using them as input to uPIT, then jointly training the system while applying discriminative learning to maximize permutation distances, the method directly optimizes separation objectives and outperforms both DC and uPIT on speaker-independent tasks.

What carries the argument

Deep embedding features from the DC network serving as input to uPIT, with joint training and a discriminative objective.

If this is right

The separation process becomes a single optimizable pipeline focused on actual source recovery.
Embedding features provide better speaker discrimination than standard uPIT inputs.
Joint training avoids the two-step complexity of traditional DC pipelines.
Discriminative fine-tuning increases separation quality by penalizing close permutations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This suggests embedding spaces learned for clustering can be repurposed to guide permutation-invariant training in other audio tasks.
Future work could test if the joint model scales to three or more overlapping speakers without additional changes.
The method implies that maximizing inter-permutation distances is a general way to improve multi-source separation models.

Load-bearing premise

The deep embedding features from the DC network contain enough distinct information about each source to allow uPIT to discriminate and separate the target speakers effectively.

What would settle it

Running the proposed joint model on the WSJ0-2mix dataset and finding no improvement in standard metrics like SDR compared to baseline DC and uPIT implementations would falsify the performance claim.

Figures

Figures reproduced from arXiv: 1907.09884 by Bin Liu, Cunhang Fan, Jiangyan Yi, Jianhua Tao, Zhengqi Wen.

**Figure 1.** Figure 1: Schematic diagram of our proposed DL-DEF speech separation system. DC loss is the loss of deep clustering. 3.2. Speech separation model based on deep embedding features Different from deep clustering [12] utilizing the K-means clustering to acquire hard masks, we use the embedding vectors as the input of uPIT to directly learn each source’s soft masks. Therefore, the DC and uPIT can be trained end-to-end.… view at source ↗

read the original abstract

Deep clustering (DC) and utterance-level permutation invariant training (uPIT) have been demonstrated promising for speaker-independent speech separation. DC is usually formulated as two-step processes: embedding learning and embedding clustering, which results in complex separation pipelines and a huge obstacle in directly optimizing the actual separation objectives. As for uPIT, it only minimizes the chosen permutation with the lowest mean square error, doesn't discriminate it with other permutations. In this paper, we propose a discriminative learning method for speaker-independent speech separation using deep embedding features. Firstly, a DC network is trained to extract deep embedding features, which contain each source's information and have an advantage in discriminating each target speakers. Then these features are used as the input for uPIT to directly separate the different sources. Finally, uPIT and DC are jointly trained, which directly optimizes the actual separation objectives. Moreover, in order to maximize the distance of each permutation, the discriminative learning is applied to fine tuning the whole model. Our experiments are conducted on WSJ0-2mix dataset. Experimental results show that the proposed models achieve better performances than DC and uPIT for speaker-independent speech separation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper combines DC embeddings as input to uPIT with joint training and a discriminative permutation loss to report better results than either baseline alone on WSJ0-2mix.

read the letter

The main point is that they extract embeddings from a trained DC network, feed those directly into uPIT for separation, train the combined system end-to-end, and add a loss term that enlarges the distance between different permutations. This produces the claimed performance lift over standalone DC and uPIT on the WSJ0-2mix benchmark. The approach addresses the two-step nature of DC and the single-permutation focus of uPIT in one pipeline. Joint optimization lets the separation objective shape the embeddings, and the discriminative term is a direct response to uPIT's limitation. The method description stays consistent from architecture to claimed outcome. The experiments stay on the standard dataset, which keeps the comparison fair. The soft spots are modest. The abstract supplies no numbers, so the size of the gain and whether it justifies the extra steps cannot be judged without the full results and ablations. The statement that the embeddings discriminate speakers well follows from DC training but receives little extra validation. If the improvement is small, the added complexity may not move the needle much. This is for researchers already working on deep monaural separation who want a practical tweak on existing tools. It will not shift the broader field but gives a testable extension on public data. The logic is coherent and the claim is empirical rather than circular, so it deserves a serious referee even if revisions are needed for the numbers and analysis.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes a hybrid approach for speaker-independent monaural speech separation: a DC network is first trained to produce deep embedding features, which are then fed as input to uPIT; the two are jointly trained to optimize the separation objective directly, followed by discriminative fine-tuning that enlarges distances between permutations. Experiments are reported on the WSJ0-2mix corpus, with the claim that the resulting models outperform standalone DC and uPIT.

Significance. If the reported gains hold under rigorous evaluation, the method would usefully combine the representational advantages of learned embeddings with direct optimization of the separation loss and explicit discrimination among permutations, addressing two acknowledged limitations of the baseline techniques.

minor comments (3)

Abstract: the sentence 'have an advantage in discriminating each target speakers' contains a grammatical error ('speakers' should be 'speaker').
Abstract: the claim of 'better performances' is stated without any numerical values, error bars, or reference to the specific metrics (e.g., SDR, PESQ) or tables that appear later in the paper; a brief quantitative summary would improve readability.
The manuscript should clarify whether the joint-training stage and the subsequent discriminative fine-tuning are performed sequentially or with a combined loss, and whether any hyper-parameters control the relative weighting of the two terms.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the accurate summary of our work and the recommendation of minor revision. No major comments were listed in the report.

Circularity Check

0 steps flagged

No significant circularity; empirical performance claim on held-out data

full rationale

The paper proposes a joint DC+uPIT architecture with discriminative fine-tuning and reports experimental results on WSJ0-2mix. No derivation chain exists that reduces by construction to fitted parameters, self-citations, or renamed inputs. The embedding features are explicitly defined as DC outputs fed into uPIT; joint training directly optimizes the separation objective; performance gains are measured on separate test data. This is a standard empirical ML contribution with no load-bearing self-definitional or fitted-input steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on one domain assumption about embedding quality and on neural-network parameters fitted to the WSJ0-2mix training set; no new physical entities are introduced.

free parameters (1)

neural network parameters
All weights in the DC and uPIT networks are fitted to minimize the joint separation loss on the training portion of WSJ0-2mix.

axioms (1)

domain assumption Deep embedding features extracted by the DC network contain each source's information and discriminate target speakers
This premise is invoked to justify using the embeddings as input to uPIT.

pith-pipeline@v0.9.0 · 5748 in / 1334 out tokens · 32256 ms · 2026-05-24T17:01:57.286545+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 2 internal anchors

[1]

It is a very challeng- ing task, which is known as the cocktail party problem [1]

Introduction Monaural speech separation aims to estimate target sources from mixed signals in a single-channel. It is a very challeng- ing task, which is known as the cocktail party problem [1]. In order to solve the cocktail party problem, many works have been done over the decades. Traditional speech sepa- ration methods include computational auditory s...

work page
[2]

DANet creates attractor points in a high-dimensional embedding space of the acoustic signals

method is proposed. DANet creates attractor points in a high-dimensional embedding space of the acoustic signals. Then the similarities between the embedded points and each attractor are used to directly estimate a soft separation mask at the training stage. Unfortunately, it enables end-to-end train- ing while still requiring K-means at the testing stage...

work page
[3]

Discriminative Learning for Monaural Speech Separation Using Deep Embedding Features

is applied to speech separation, which uses a multi-task learning architecture to combine the DC and uPIT. However, it simply employs the DC and uPIT as two outputs rather than fusion with each other. In this paper, in order to address the problems of DC and uPIT, we propose a discriminative learning method for speaker- independent speech separation with ...

work page internal anchor Pith review Pith/arXiv arXiv 1907
[4]

y(t) = S∑ s=1 xs(t) (1) wherey(t) is the mixed speech,S is the number of source sig- nals andxs(t),s = 1,...,S are target sources

Single Channel Speech Separation The object of single channel speech separation is to separate target sources from a mixed signal. y(t) = S∑ s=1 xs(t) (1) wherey(t) is the mixed speech,S is the number of source sig- nals andxs(t),s = 1,...,S are target sources. The correspond- ing short-time Fourier transformation (STFT) of those signals areY (t,f ) andXs...

work page
[5]

From DC net- work [12] we can know that clusters in the deep embedding space can represent the inferred spectral masking patterns of individual sources

The Proposed Speech Separation System In this section, we present our proposed discriminative learning method for speaker-independent speech separation with deep embedding features, which is shown in Figure 1. From DC net- work [12] we can know that clusters in the deep embedding space can represent the inferred spectral masking patterns of individual sou...

work page
[6]

Dataset Our experiments are conducted on the WSJ0-2mix dataset [12], which is derived from WSJ corpus [20]

Experiments and Results 4.1. Dataset Our experiments are conducted on the WSJ0-2mix dataset [12], which is derived from WSJ corpus [20]. The WSJ0-2mix dataset consists three sets: training set (20,000 utterances about 30 hours), validation set (5,000 utterances about 10 hours) and test set (3,000 utterances about 5 hours). Speciﬁcally, training and valida...

work page
[7]

As for the separated network, it has only one BLSTM layer with 896 units

A tanh activation function is followed by the embedding layer. As for the separated network, it has only one BLSTM layer with 896 units. Therefore, there are three BLSTM layers in this work, which keeps the network conﬁguration the same as baseline in [15]. A Rectiﬁed Liner Uint (ReLU) activation function is followed by the uPIT network, which is the mask...

work page
[8]

We ﬁrstly train an DC network to extract deep embedding features

Conclusions In this paper, we propose a speaker-independent speech sepa- ration method with discriminative learning based on deep em- bedding features. We ﬁrstly train an DC network to extract deep embedding features. Then these features are used as the input of uPIT system to directly separate the different speaker sources. Moreover, uPIT and DC are join...

work page
[9]

Authors also thank Shuai Nie for his helpful comments on this work

Acknowledgements This work is supported by the National Key Research & De- velopment Plan of China (No.2017YFB1002802), the NSFC (No.61425017, No.61831022, No.61771472, No.61603390), the Strategic Priority Research Program of Chinese Academy of Sciences (No.XDC02050100), and Inria-CAS Joint Research Project (No.173211KYSB20170061). Authors also thank Shua...

work page
[10]

Attentional selection in a cocktail party environment can be decoded from single-trial eeg,

J. A. O’Sullivan, A. J. Power, N. Mesgarani, S. Rajaram, J. J. Foxe, B. G. Shinncunningham, M. Slaney, S. A. Shamma, and E. C. Lalor, “Attentional selection in a cocktail party environment can be decoded from single-trial eeg,” Cerebral Cortex, vol. 25, no. 7, p. 1697, 2015

work page 2015
[11]

Wang and G

D. Wang and G. J. Brown, Computational auditory scene analy- sis: Principles, algorithms, and applications . Wiley-IEEE press, 2006

work page 2006
[12]

Single-channel speech sepa- ration using sparse non-negative matrix factorization,

M. N. Schmidt and R. K. Olsson, “Single-channel speech sepa- ration using sparse non-negative matrix factorization,” in INTER- SPEECH 2006 - Icslp, Ninth International Conference on Spoken Language Processing, Pittsburgh, Pa, Usa, September, 2006

work page 2006
[13]

Speech enhancement using a mini- mum mean-square error log-spectral amplitude estimator,

Y . Ephraim and D. Malah, “Speech enhancement using a mini- mum mean-square error log-spectral amplitude estimator,” IEEE transactions on acoustics, speech, and signal processing , vol. 33, no. 2, pp. 443–445, 1985

work page 1985
[14]

Monaural speech separation and recognition challenge,

M. Cooke, J. R. Hershey, and S. J. Rennie, “Monaural speech separation and recognition challenge,” Computer Speech & Lan- guage, vol. 24, no. 1, pp. 1–15, 2010

work page 2010
[15]

Investigations on data aug- mentation and loss functions for deep learning based speech- background separation,

H. Erdogan and T. Yoshioka, “Investigations on data aug- mentation and loss functions for deep learning based speech- background separation,” Proc. Interspeech 2018, pp. 3499–3503, 2018

work page 2018
[16]

Deep extractor network for target speaker recovery from single channel speech mixtures,

J. Wang, J. Chen, D. Su, L. Chen, M. Yu, Y . Qian, and D. Yu, “Deep extractor network for target speaker recovery from single channel speech mixtures,” Proc. Interspeech 2018 , pp. 307–311, 2018

work page 2018
[17]

Tasnet: time-domain audio separation network for real-time, single-channel speech separation,

Y . Luo and N. Mesgarani, “Tasnet: time-domain audio separation network for real-time, single-channel speech separation,” in Proc. ICASSP. IEEE, 2018, pp. 696–700

work page 2018
[18]

Single channel speech separation with constrained utterance level permutation in- variant training using grid lstm,

C. Xu, W. Rao, X. Xiao, E. S. Chng, and H. Li, “Single channel speech separation with constrained utterance level permutation in- variant training using grid lstm,” in2018 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2018, pp. 6–10

work page 2018
[19]

Utterance- level permutation invariant training with discriminative learning for single channel speech separation,

C. Fan, B. Liu, J. Tao, Z. Wen, J. Yi, and Y . Bai, “Utterance- level permutation invariant training with discriminative learning for single channel speech separation,” in Proc. ISCSLP. IEEE, 2018

work page 2018
[20]

A pitch-aware approach to single- channel speech separation,

K. Wang, F. Song, and X. Lei, “A pitch-aware approach to single- channel speech separation,” in 2019 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2019

work page 2019
[21]

Deep clus- tering: Discriminative embeddings for segmentation and separa- tion,

J. R. Hershey, Z. Chen, J. L. Roux, and S. Watanabe, “Deep clus- tering: Discriminative embeddings for segmentation and separa- tion,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2016, pp. 31–35

work page 2016
[22]

Deep attractor network for single-microphone speaker separation,

Z. Chen, Y . Luo, and N. Mesgarani, “Deep attractor network for single-microphone speaker separation,” in 2017 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 246–250

work page 2017
[23]

Permutation invari- ant training of deep models for speaker-independent multi-talker speech separation,

D. Yu, M. Kolbk, Z. H. Tan, and J. Jensen, “Permutation invari- ant training of deep models for speaker-independent multi-talker speech separation,” in IEEE International Conference on Acous- tics, Speech and Signal Processing , 2017, pp. 241–245

work page 2017
[24]

Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,

M. Kolbæk, D. Yu, Z. Tan, J. Jensen, M. Kolbaek, D. Yu, Z. Tan, and J. Jensen, “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,” IEEE/ACM Transactions on Audio, Speech and Language Pro- cessing (TASLP), vol. 25, no. 10, pp. 1901–1913, 2017

work page 1901
[25]

Alternative objective functions for deep clustering,

Z.-Q. Wang, J. Le Roux, and J. R. Hershey, “Alternative objective functions for deep clustering,” in 2018 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2018, pp. 686–690

work page 2018
[26]

Deep clustering and conventional networks for music separa- tion: Stronger together,

Y . Luo, Z. Chen, J. R. Hershey, J. Le Roux, and N. Mesgarani, “Deep clustering and conventional networks for music separa- tion: Stronger together,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2017, pp. 61–65

work page 2017
[27]

On training targets for supervised speech separation,

Y . Wang, A. Narayanan, and D. L. Wang, “On training targets for supervised speech separation,” IEEE/ACM Trans Audio Speech Lang Process, vol. 22, no. 12, pp. 1849–1858, 2014

work page 2014
[28]

Phase- sensitive and recognition-boosted speech separation using deep recurrent neural networks,

H. Erdogan, J. R. Hershey, S. Watanabe, and J. L. Roux, “Phase- sensitive and recognition-boosted speech separation using deep recurrent neural networks,” in IEEE International Conference on Acoustics, Speech and Signal Processing , 2015, pp. 708–712

work page 2015
[29]

Csr-i (wsj0) com- plete,

J. Garofalo, D. Graff, D. Paul, and D. Pallett, “Csr-i (wsj0) com- plete,” Linguistic Data Consortium, Philadelphia , 2007

work page 2007
[30]

Tensorﬂow: Large-scale machine learning on heterogeneous distributed sys- tems,

M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, and M. Devin, “Tensorﬂow: Large-scale machine learning on heterogeneous distributed sys- tems,” 2016

work page 2016
[31]

Adam: A method for stochastic opti- mization,

D. P. Kingma and J. Ba, “Adam: A method for stochastic opti- mization,” Computer Science, 2014

work page 2014
[32]

Performance measure- ment in blind audio source separation,

E. Vincent, R. Gribonval, and C. F ´evotte, “Performance measure- ment in blind audio source separation,” IEEE transactions on au- dio, speech, and language processing , vol. 14, no. 4, pp. 1462– 1469, 2006

work page 2006
[33]

Per- ceptual evaluation of speech quality (pesq) the new itu standard for end-to-end speech quality assessment part i–time-delay com- pensation,

A. W. Rix, M. P. Hollier, A. P. Hekstra, and J. G. Beerends, “Per- ceptual evaluation of speech quality (pesq) the new itu standard for end-to-end speech quality assessment part i–time-delay com- pensation,” Journal of the Audio Engineering Society , vol. 50, no. 10, pp. 755–764, 2002

work page 2002
[34]

A shifted delta coefﬁcient objective for monaural speech separation using multi-task learn- ing,

C. Xu, W. Rao, E. S. Chng, and H. Li, “A shifted delta coefﬁcient objective for monaural speech separation using multi-task learn- ing,” in Proceedings of Interspeech, 2018, pp. 3479–3483

work page 2018
[35]

Single-Channel Multi-Speaker Separation using Deep Clustering

Y . Isik, J. L. Roux, Z. Chen, S. Watanabe, and J. R. Hershey, “Single-channel multi-speaker separation using deep clustering,” arXiv preprint arXiv:1607.02173, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[1] [1]

It is a very challeng- ing task, which is known as the cocktail party problem [1]

Introduction Monaural speech separation aims to estimate target sources from mixed signals in a single-channel. It is a very challeng- ing task, which is known as the cocktail party problem [1]. In order to solve the cocktail party problem, many works have been done over the decades. Traditional speech sepa- ration methods include computational auditory s...

work page

[2] [2]

DANet creates attractor points in a high-dimensional embedding space of the acoustic signals

method is proposed. DANet creates attractor points in a high-dimensional embedding space of the acoustic signals. Then the similarities between the embedded points and each attractor are used to directly estimate a soft separation mask at the training stage. Unfortunately, it enables end-to-end train- ing while still requiring K-means at the testing stage...

work page

[3] [3]

Discriminative Learning for Monaural Speech Separation Using Deep Embedding Features

is applied to speech separation, which uses a multi-task learning architecture to combine the DC and uPIT. However, it simply employs the DC and uPIT as two outputs rather than fusion with each other. In this paper, in order to address the problems of DC and uPIT, we propose a discriminative learning method for speaker- independent speech separation with ...

work page internal anchor Pith review Pith/arXiv arXiv 1907

[4] [4]

y(t) = S∑ s=1 xs(t) (1) wherey(t) is the mixed speech,S is the number of source sig- nals andxs(t),s = 1,...,S are target sources

Single Channel Speech Separation The object of single channel speech separation is to separate target sources from a mixed signal. y(t) = S∑ s=1 xs(t) (1) wherey(t) is the mixed speech,S is the number of source sig- nals andxs(t),s = 1,...,S are target sources. The correspond- ing short-time Fourier transformation (STFT) of those signals areY (t,f ) andXs...

work page

[5] [5]

From DC net- work [12] we can know that clusters in the deep embedding space can represent the inferred spectral masking patterns of individual sources

The Proposed Speech Separation System In this section, we present our proposed discriminative learning method for speaker-independent speech separation with deep embedding features, which is shown in Figure 1. From DC net- work [12] we can know that clusters in the deep embedding space can represent the inferred spectral masking patterns of individual sou...

work page

[6] [6]

Dataset Our experiments are conducted on the WSJ0-2mix dataset [12], which is derived from WSJ corpus [20]

Experiments and Results 4.1. Dataset Our experiments are conducted on the WSJ0-2mix dataset [12], which is derived from WSJ corpus [20]. The WSJ0-2mix dataset consists three sets: training set (20,000 utterances about 30 hours), validation set (5,000 utterances about 10 hours) and test set (3,000 utterances about 5 hours). Speciﬁcally, training and valida...

work page

[7] [7]

As for the separated network, it has only one BLSTM layer with 896 units

A tanh activation function is followed by the embedding layer. As for the separated network, it has only one BLSTM layer with 896 units. Therefore, there are three BLSTM layers in this work, which keeps the network conﬁguration the same as baseline in [15]. A Rectiﬁed Liner Uint (ReLU) activation function is followed by the uPIT network, which is the mask...

work page

[8] [8]

We ﬁrstly train an DC network to extract deep embedding features

Conclusions In this paper, we propose a speaker-independent speech sepa- ration method with discriminative learning based on deep em- bedding features. We ﬁrstly train an DC network to extract deep embedding features. Then these features are used as the input of uPIT system to directly separate the different speaker sources. Moreover, uPIT and DC are join...

work page

[9] [9]

Authors also thank Shuai Nie for his helpful comments on this work

Acknowledgements This work is supported by the National Key Research & De- velopment Plan of China (No.2017YFB1002802), the NSFC (No.61425017, No.61831022, No.61771472, No.61603390), the Strategic Priority Research Program of Chinese Academy of Sciences (No.XDC02050100), and Inria-CAS Joint Research Project (No.173211KYSB20170061). Authors also thank Shua...

work page

[10] [10]

Attentional selection in a cocktail party environment can be decoded from single-trial eeg,

J. A. O’Sullivan, A. J. Power, N. Mesgarani, S. Rajaram, J. J. Foxe, B. G. Shinncunningham, M. Slaney, S. A. Shamma, and E. C. Lalor, “Attentional selection in a cocktail party environment can be decoded from single-trial eeg,” Cerebral Cortex, vol. 25, no. 7, p. 1697, 2015

work page 2015

[11] [11]

Wang and G

D. Wang and G. J. Brown, Computational auditory scene analy- sis: Principles, algorithms, and applications . Wiley-IEEE press, 2006

work page 2006

[12] [12]

Single-channel speech sepa- ration using sparse non-negative matrix factorization,

M. N. Schmidt and R. K. Olsson, “Single-channel speech sepa- ration using sparse non-negative matrix factorization,” in INTER- SPEECH 2006 - Icslp, Ninth International Conference on Spoken Language Processing, Pittsburgh, Pa, Usa, September, 2006

work page 2006

[13] [13]

Speech enhancement using a mini- mum mean-square error log-spectral amplitude estimator,

Y . Ephraim and D. Malah, “Speech enhancement using a mini- mum mean-square error log-spectral amplitude estimator,” IEEE transactions on acoustics, speech, and signal processing , vol. 33, no. 2, pp. 443–445, 1985

work page 1985

[14] [14]

Monaural speech separation and recognition challenge,

M. Cooke, J. R. Hershey, and S. J. Rennie, “Monaural speech separation and recognition challenge,” Computer Speech & Lan- guage, vol. 24, no. 1, pp. 1–15, 2010

work page 2010

[15] [15]

Investigations on data aug- mentation and loss functions for deep learning based speech- background separation,

H. Erdogan and T. Yoshioka, “Investigations on data aug- mentation and loss functions for deep learning based speech- background separation,” Proc. Interspeech 2018, pp. 3499–3503, 2018

work page 2018

[16] [16]

Deep extractor network for target speaker recovery from single channel speech mixtures,

J. Wang, J. Chen, D. Su, L. Chen, M. Yu, Y . Qian, and D. Yu, “Deep extractor network for target speaker recovery from single channel speech mixtures,” Proc. Interspeech 2018 , pp. 307–311, 2018

work page 2018

[17] [17]

Tasnet: time-domain audio separation network for real-time, single-channel speech separation,

Y . Luo and N. Mesgarani, “Tasnet: time-domain audio separation network for real-time, single-channel speech separation,” in Proc. ICASSP. IEEE, 2018, pp. 696–700

work page 2018

[18] [18]

Single channel speech separation with constrained utterance level permutation in- variant training using grid lstm,

C. Xu, W. Rao, X. Xiao, E. S. Chng, and H. Li, “Single channel speech separation with constrained utterance level permutation in- variant training using grid lstm,” in2018 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2018, pp. 6–10

work page 2018

[19] [19]

Utterance- level permutation invariant training with discriminative learning for single channel speech separation,

C. Fan, B. Liu, J. Tao, Z. Wen, J. Yi, and Y . Bai, “Utterance- level permutation invariant training with discriminative learning for single channel speech separation,” in Proc. ISCSLP. IEEE, 2018

work page 2018

[20] [20]

A pitch-aware approach to single- channel speech separation,

K. Wang, F. Song, and X. Lei, “A pitch-aware approach to single- channel speech separation,” in 2019 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2019

work page 2019

[21] [21]

Deep clus- tering: Discriminative embeddings for segmentation and separa- tion,

J. R. Hershey, Z. Chen, J. L. Roux, and S. Watanabe, “Deep clus- tering: Discriminative embeddings for segmentation and separa- tion,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2016, pp. 31–35

work page 2016

[22] [22]

Deep attractor network for single-microphone speaker separation,

Z. Chen, Y . Luo, and N. Mesgarani, “Deep attractor network for single-microphone speaker separation,” in 2017 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 246–250

work page 2017

[23] [23]

Permutation invari- ant training of deep models for speaker-independent multi-talker speech separation,

D. Yu, M. Kolbk, Z. H. Tan, and J. Jensen, “Permutation invari- ant training of deep models for speaker-independent multi-talker speech separation,” in IEEE International Conference on Acous- tics, Speech and Signal Processing , 2017, pp. 241–245

work page 2017

[24] [24]

Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,

M. Kolbæk, D. Yu, Z. Tan, J. Jensen, M. Kolbaek, D. Yu, Z. Tan, and J. Jensen, “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,” IEEE/ACM Transactions on Audio, Speech and Language Pro- cessing (TASLP), vol. 25, no. 10, pp. 1901–1913, 2017

work page 1901

[25] [25]

Alternative objective functions for deep clustering,

Z.-Q. Wang, J. Le Roux, and J. R. Hershey, “Alternative objective functions for deep clustering,” in 2018 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2018, pp. 686–690

work page 2018

[26] [26]

Deep clustering and conventional networks for music separa- tion: Stronger together,

Y . Luo, Z. Chen, J. R. Hershey, J. Le Roux, and N. Mesgarani, “Deep clustering and conventional networks for music separa- tion: Stronger together,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2017, pp. 61–65

work page 2017

[27] [27]

On training targets for supervised speech separation,

Y . Wang, A. Narayanan, and D. L. Wang, “On training targets for supervised speech separation,” IEEE/ACM Trans Audio Speech Lang Process, vol. 22, no. 12, pp. 1849–1858, 2014

work page 2014

[28] [28]

Phase- sensitive and recognition-boosted speech separation using deep recurrent neural networks,

H. Erdogan, J. R. Hershey, S. Watanabe, and J. L. Roux, “Phase- sensitive and recognition-boosted speech separation using deep recurrent neural networks,” in IEEE International Conference on Acoustics, Speech and Signal Processing , 2015, pp. 708–712

work page 2015

[29] [29]

Csr-i (wsj0) com- plete,

J. Garofalo, D. Graff, D. Paul, and D. Pallett, “Csr-i (wsj0) com- plete,” Linguistic Data Consortium, Philadelphia , 2007

work page 2007

[30] [30]

Tensorﬂow: Large-scale machine learning on heterogeneous distributed sys- tems,

M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, and M. Devin, “Tensorﬂow: Large-scale machine learning on heterogeneous distributed sys- tems,” 2016

work page 2016

[31] [31]

Adam: A method for stochastic opti- mization,

D. P. Kingma and J. Ba, “Adam: A method for stochastic opti- mization,” Computer Science, 2014

work page 2014

[32] [32]

Performance measure- ment in blind audio source separation,

E. Vincent, R. Gribonval, and C. F ´evotte, “Performance measure- ment in blind audio source separation,” IEEE transactions on au- dio, speech, and language processing , vol. 14, no. 4, pp. 1462– 1469, 2006

work page 2006

[33] [33]

Per- ceptual evaluation of speech quality (pesq) the new itu standard for end-to-end speech quality assessment part i–time-delay com- pensation,

A. W. Rix, M. P. Hollier, A. P. Hekstra, and J. G. Beerends, “Per- ceptual evaluation of speech quality (pesq) the new itu standard for end-to-end speech quality assessment part i–time-delay com- pensation,” Journal of the Audio Engineering Society , vol. 50, no. 10, pp. 755–764, 2002

work page 2002

[34] [34]

A shifted delta coefﬁcient objective for monaural speech separation using multi-task learn- ing,

C. Xu, W. Rao, E. S. Chng, and H. Li, “A shifted delta coefﬁcient objective for monaural speech separation using multi-task learn- ing,” in Proceedings of Interspeech, 2018, pp. 3479–3483

work page 2018

[35] [35]

Single-Channel Multi-Speaker Separation using Deep Clustering

Y . Isik, J. L. Roux, Z. Chen, S. Watanabe, and J. R. Hershey, “Single-channel multi-speaker separation using deep clustering,” arXiv preprint arXiv:1607.02173, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016