pith. sign in

arxiv: 1907.09884 · v1 · pith:4WI76UG2new · submitted 2019-07-23 · 💻 cs.SD · cs.LG· eess.AS

Discriminative Learning for Monaural Speech Separation Using Deep Embedding Features

Pith reviewed 2026-05-24 17:01 UTC · model grok-4.3

classification 💻 cs.SD cs.LGeess.AS
keywords speech separationdeep clusteringpermutation invariant trainingmonaural audiodiscriminative learningspeaker independentembedding featuresuPIT
0
0 comments X

The pith

Combining deep clustering embeddings with uPIT and discriminative learning improves speaker-independent monaural speech separation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes training a deep clustering network to extract embedding features that capture source information, then feeding those into uPIT for direct source separation. These are jointly trained to optimize the actual separation goal rather than just embeddings or permutations separately. Discriminative learning is added to increase the distance between different permutation options. This approach is tested on the WSJ0-2mix dataset and shows better results than using DC or uPIT alone. A sympathetic reader would care because it simplifies the pipeline for speech separation in noisy or multi-speaker environments.

Core claim

By extracting deep embedding features with a DC network and using them as input to uPIT, then jointly training the system while applying discriminative learning to maximize permutation distances, the method directly optimizes separation objectives and outperforms both DC and uPIT on speaker-independent tasks.

What carries the argument

Deep embedding features from the DC network serving as input to uPIT, with joint training and a discriminative objective.

If this is right

  • The separation process becomes a single optimizable pipeline focused on actual source recovery.
  • Embedding features provide better speaker discrimination than standard uPIT inputs.
  • Joint training avoids the two-step complexity of traditional DC pipelines.
  • Discriminative fine-tuning increases separation quality by penalizing close permutations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This suggests embedding spaces learned for clustering can be repurposed to guide permutation-invariant training in other audio tasks.
  • Future work could test if the joint model scales to three or more overlapping speakers without additional changes.
  • The method implies that maximizing inter-permutation distances is a general way to improve multi-source separation models.

Load-bearing premise

The deep embedding features from the DC network contain enough distinct information about each source to allow uPIT to discriminate and separate the target speakers effectively.

What would settle it

Running the proposed joint model on the WSJ0-2mix dataset and finding no improvement in standard metrics like SDR compared to baseline DC and uPIT implementations would falsify the performance claim.

Figures

Figures reproduced from arXiv: 1907.09884 by Bin Liu, Cunhang Fan, Jiangyan Yi, Jianhua Tao, Zhengqi Wen.

Figure 1
Figure 1. Figure 1: Schematic diagram of our proposed DL-DEF speech separation system. DC loss is the loss of deep clustering. 3.2. Speech separation model based on deep embedding features Different from deep clustering [12] utilizing the K-means clus￾tering to acquire hard masks, we use the embedding vectors as the input of uPIT to directly learn each source’s soft masks. Therefore, the DC and uPIT can be trained end-to-end.… view at source ↗
read the original abstract

Deep clustering (DC) and utterance-level permutation invariant training (uPIT) have been demonstrated promising for speaker-independent speech separation. DC is usually formulated as two-step processes: embedding learning and embedding clustering, which results in complex separation pipelines and a huge obstacle in directly optimizing the actual separation objectives. As for uPIT, it only minimizes the chosen permutation with the lowest mean square error, doesn't discriminate it with other permutations. In this paper, we propose a discriminative learning method for speaker-independent speech separation using deep embedding features. Firstly, a DC network is trained to extract deep embedding features, which contain each source's information and have an advantage in discriminating each target speakers. Then these features are used as the input for uPIT to directly separate the different sources. Finally, uPIT and DC are jointly trained, which directly optimizes the actual separation objectives. Moreover, in order to maximize the distance of each permutation, the discriminative learning is applied to fine tuning the whole model. Our experiments are conducted on WSJ0-2mix dataset. Experimental results show that the proposed models achieve better performances than DC and uPIT for speaker-independent speech separation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes a hybrid approach for speaker-independent monaural speech separation: a DC network is first trained to produce deep embedding features, which are then fed as input to uPIT; the two are jointly trained to optimize the separation objective directly, followed by discriminative fine-tuning that enlarges distances between permutations. Experiments are reported on the WSJ0-2mix corpus, with the claim that the resulting models outperform standalone DC and uPIT.

Significance. If the reported gains hold under rigorous evaluation, the method would usefully combine the representational advantages of learned embeddings with direct optimization of the separation loss and explicit discrimination among permutations, addressing two acknowledged limitations of the baseline techniques.

minor comments (3)
  1. Abstract: the sentence 'have an advantage in discriminating each target speakers' contains a grammatical error ('speakers' should be 'speaker').
  2. Abstract: the claim of 'better performances' is stated without any numerical values, error bars, or reference to the specific metrics (e.g., SDR, PESQ) or tables that appear later in the paper; a brief quantitative summary would improve readability.
  3. The manuscript should clarify whether the joint-training stage and the subsequent discriminative fine-tuning are performed sequentially or with a combined loss, and whether any hyper-parameters control the relative weighting of the two terms.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the accurate summary of our work and the recommendation of minor revision. No major comments were listed in the report.

Circularity Check

0 steps flagged

No significant circularity; empirical performance claim on held-out data

full rationale

The paper proposes a joint DC+uPIT architecture with discriminative fine-tuning and reports experimental results on WSJ0-2mix. No derivation chain exists that reduces by construction to fitted parameters, self-citations, or renamed inputs. The embedding features are explicitly defined as DC outputs fed into uPIT; joint training directly optimizes the separation objective; performance gains are measured on separate test data. This is a standard empirical ML contribution with no load-bearing self-definitional or fitted-input steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on one domain assumption about embedding quality and on neural-network parameters fitted to the WSJ0-2mix training set; no new physical entities are introduced.

free parameters (1)
  • neural network parameters
    All weights in the DC and uPIT networks are fitted to minimize the joint separation loss on the training portion of WSJ0-2mix.
axioms (1)
  • domain assumption Deep embedding features extracted by the DC network contain each source's information and discriminate target speakers
    This premise is invoked to justify using the embeddings as input to uPIT.

pith-pipeline@v0.9.0 · 5748 in / 1334 out tokens · 32256 ms · 2026-05-24T17:01:57.286545+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 2 internal anchors

  1. [1]

    It is a very challeng- ing task, which is known as the cocktail party problem [1]

    Introduction Monaural speech separation aims to estimate target sources from mixed signals in a single-channel. It is a very challeng- ing task, which is known as the cocktail party problem [1]. In order to solve the cocktail party problem, many works have been done over the decades. Traditional speech sepa- ration methods include computational auditory s...

  2. [2]

    DANet creates attractor points in a high-dimensional embedding space of the acoustic signals

    method is proposed. DANet creates attractor points in a high-dimensional embedding space of the acoustic signals. Then the similarities between the embedded points and each attractor are used to directly estimate a soft separation mask at the training stage. Unfortunately, it enables end-to-end train- ing while still requiring K-means at the testing stage...

  3. [3]

    Discriminative Learning for Monaural Speech Separation Using Deep Embedding Features

    is applied to speech separation, which uses a multi-task learning architecture to combine the DC and uPIT. However, it simply employs the DC and uPIT as two outputs rather than fusion with each other. In this paper, in order to address the problems of DC and uPIT, we propose a discriminative learning method for speaker- independent speech separation with ...

  4. [4]

    y(t) = S∑ s=1 xs(t) (1) wherey(t) is the mixed speech,S is the number of source sig- nals andxs(t),s = 1,...,S are target sources

    Single Channel Speech Separation The object of single channel speech separation is to separate target sources from a mixed signal. y(t) = S∑ s=1 xs(t) (1) wherey(t) is the mixed speech,S is the number of source sig- nals andxs(t),s = 1,...,S are target sources. The correspond- ing short-time Fourier transformation (STFT) of those signals areY (t,f ) andXs...

  5. [5]

    From DC net- work [12] we can know that clusters in the deep embedding space can represent the inferred spectral masking patterns of individual sources

    The Proposed Speech Separation System In this section, we present our proposed discriminative learning method for speaker-independent speech separation with deep embedding features, which is shown in Figure 1. From DC net- work [12] we can know that clusters in the deep embedding space can represent the inferred spectral masking patterns of individual sou...

  6. [6]

    Dataset Our experiments are conducted on the WSJ0-2mix dataset [12], which is derived from WSJ corpus [20]

    Experiments and Results 4.1. Dataset Our experiments are conducted on the WSJ0-2mix dataset [12], which is derived from WSJ corpus [20]. The WSJ0-2mix dataset consists three sets: training set (20,000 utterances about 30 hours), validation set (5,000 utterances about 10 hours) and test set (3,000 utterances about 5 hours). Specifically, training and valida...

  7. [7]

    As for the separated network, it has only one BLSTM layer with 896 units

    A tanh activation function is followed by the embedding layer. As for the separated network, it has only one BLSTM layer with 896 units. Therefore, there are three BLSTM layers in this work, which keeps the network configuration the same as baseline in [15]. A Rectified Liner Uint (ReLU) activation function is followed by the uPIT network, which is the mask...

  8. [8]

    We firstly train an DC network to extract deep embedding features

    Conclusions In this paper, we propose a speaker-independent speech sepa- ration method with discriminative learning based on deep em- bedding features. We firstly train an DC network to extract deep embedding features. Then these features are used as the input of uPIT system to directly separate the different speaker sources. Moreover, uPIT and DC are join...

  9. [9]

    Authors also thank Shuai Nie for his helpful comments on this work

    Acknowledgements This work is supported by the National Key Research & De- velopment Plan of China (No.2017YFB1002802), the NSFC (No.61425017, No.61831022, No.61771472, No.61603390), the Strategic Priority Research Program of Chinese Academy of Sciences (No.XDC02050100), and Inria-CAS Joint Research Project (No.173211KYSB20170061). Authors also thank Shua...

  10. [10]

    Attentional selection in a cocktail party environment can be decoded from single-trial eeg,

    J. A. O’Sullivan, A. J. Power, N. Mesgarani, S. Rajaram, J. J. Foxe, B. G. Shinncunningham, M. Slaney, S. A. Shamma, and E. C. Lalor, “Attentional selection in a cocktail party environment can be decoded from single-trial eeg,” Cerebral Cortex, vol. 25, no. 7, p. 1697, 2015

  11. [11]

    Wang and G

    D. Wang and G. J. Brown, Computational auditory scene analy- sis: Principles, algorithms, and applications . Wiley-IEEE press, 2006

  12. [12]

    Single-channel speech sepa- ration using sparse non-negative matrix factorization,

    M. N. Schmidt and R. K. Olsson, “Single-channel speech sepa- ration using sparse non-negative matrix factorization,” in INTER- SPEECH 2006 - Icslp, Ninth International Conference on Spoken Language Processing, Pittsburgh, Pa, Usa, September, 2006

  13. [13]

    Speech enhancement using a mini- mum mean-square error log-spectral amplitude estimator,

    Y . Ephraim and D. Malah, “Speech enhancement using a mini- mum mean-square error log-spectral amplitude estimator,” IEEE transactions on acoustics, speech, and signal processing , vol. 33, no. 2, pp. 443–445, 1985

  14. [14]

    Monaural speech separation and recognition challenge,

    M. Cooke, J. R. Hershey, and S. J. Rennie, “Monaural speech separation and recognition challenge,” Computer Speech & Lan- guage, vol. 24, no. 1, pp. 1–15, 2010

  15. [15]

    Investigations on data aug- mentation and loss functions for deep learning based speech- background separation,

    H. Erdogan and T. Yoshioka, “Investigations on data aug- mentation and loss functions for deep learning based speech- background separation,” Proc. Interspeech 2018, pp. 3499–3503, 2018

  16. [16]

    Deep extractor network for target speaker recovery from single channel speech mixtures,

    J. Wang, J. Chen, D. Su, L. Chen, M. Yu, Y . Qian, and D. Yu, “Deep extractor network for target speaker recovery from single channel speech mixtures,” Proc. Interspeech 2018 , pp. 307–311, 2018

  17. [17]

    Tasnet: time-domain audio separation network for real-time, single-channel speech separation,

    Y . Luo and N. Mesgarani, “Tasnet: time-domain audio separation network for real-time, single-channel speech separation,” in Proc. ICASSP. IEEE, 2018, pp. 696–700

  18. [18]

    Single channel speech separation with constrained utterance level permutation in- variant training using grid lstm,

    C. Xu, W. Rao, X. Xiao, E. S. Chng, and H. Li, “Single channel speech separation with constrained utterance level permutation in- variant training using grid lstm,” in2018 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2018, pp. 6–10

  19. [19]

    Utterance- level permutation invariant training with discriminative learning for single channel speech separation,

    C. Fan, B. Liu, J. Tao, Z. Wen, J. Yi, and Y . Bai, “Utterance- level permutation invariant training with discriminative learning for single channel speech separation,” in Proc. ISCSLP. IEEE, 2018

  20. [20]

    A pitch-aware approach to single- channel speech separation,

    K. Wang, F. Song, and X. Lei, “A pitch-aware approach to single- channel speech separation,” in 2019 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2019

  21. [21]

    Deep clus- tering: Discriminative embeddings for segmentation and separa- tion,

    J. R. Hershey, Z. Chen, J. L. Roux, and S. Watanabe, “Deep clus- tering: Discriminative embeddings for segmentation and separa- tion,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2016, pp. 31–35

  22. [22]

    Deep attractor network for single-microphone speaker separation,

    Z. Chen, Y . Luo, and N. Mesgarani, “Deep attractor network for single-microphone speaker separation,” in 2017 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 246–250

  23. [23]

    Permutation invari- ant training of deep models for speaker-independent multi-talker speech separation,

    D. Yu, M. Kolbk, Z. H. Tan, and J. Jensen, “Permutation invari- ant training of deep models for speaker-independent multi-talker speech separation,” in IEEE International Conference on Acous- tics, Speech and Signal Processing , 2017, pp. 241–245

  24. [24]

    Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,

    M. Kolbæk, D. Yu, Z. Tan, J. Jensen, M. Kolbaek, D. Yu, Z. Tan, and J. Jensen, “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,” IEEE/ACM Transactions on Audio, Speech and Language Pro- cessing (TASLP), vol. 25, no. 10, pp. 1901–1913, 2017

  25. [25]

    Alternative objective functions for deep clustering,

    Z.-Q. Wang, J. Le Roux, and J. R. Hershey, “Alternative objective functions for deep clustering,” in 2018 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2018, pp. 686–690

  26. [26]

    Deep clustering and conventional networks for music separa- tion: Stronger together,

    Y . Luo, Z. Chen, J. R. Hershey, J. Le Roux, and N. Mesgarani, “Deep clustering and conventional networks for music separa- tion: Stronger together,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2017, pp. 61–65

  27. [27]

    On training targets for supervised speech separation,

    Y . Wang, A. Narayanan, and D. L. Wang, “On training targets for supervised speech separation,” IEEE/ACM Trans Audio Speech Lang Process, vol. 22, no. 12, pp. 1849–1858, 2014

  28. [28]

    Phase- sensitive and recognition-boosted speech separation using deep recurrent neural networks,

    H. Erdogan, J. R. Hershey, S. Watanabe, and J. L. Roux, “Phase- sensitive and recognition-boosted speech separation using deep recurrent neural networks,” in IEEE International Conference on Acoustics, Speech and Signal Processing , 2015, pp. 708–712

  29. [29]

    Csr-i (wsj0) com- plete,

    J. Garofalo, D. Graff, D. Paul, and D. Pallett, “Csr-i (wsj0) com- plete,” Linguistic Data Consortium, Philadelphia , 2007

  30. [30]

    Tensorflow: Large-scale machine learning on heterogeneous distributed sys- tems,

    M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, and M. Devin, “Tensorflow: Large-scale machine learning on heterogeneous distributed sys- tems,” 2016

  31. [31]

    Adam: A method for stochastic opti- mization,

    D. P. Kingma and J. Ba, “Adam: A method for stochastic opti- mization,” Computer Science, 2014

  32. [32]

    Performance measure- ment in blind audio source separation,

    E. Vincent, R. Gribonval, and C. F ´evotte, “Performance measure- ment in blind audio source separation,” IEEE transactions on au- dio, speech, and language processing , vol. 14, no. 4, pp. 1462– 1469, 2006

  33. [33]

    Per- ceptual evaluation of speech quality (pesq) the new itu standard for end-to-end speech quality assessment part i–time-delay com- pensation,

    A. W. Rix, M. P. Hollier, A. P. Hekstra, and J. G. Beerends, “Per- ceptual evaluation of speech quality (pesq) the new itu standard for end-to-end speech quality assessment part i–time-delay com- pensation,” Journal of the Audio Engineering Society , vol. 50, no. 10, pp. 755–764, 2002

  34. [34]

    A shifted delta coefficient objective for monaural speech separation using multi-task learn- ing,

    C. Xu, W. Rao, E. S. Chng, and H. Li, “A shifted delta coefficient objective for monaural speech separation using multi-task learn- ing,” in Proceedings of Interspeech, 2018, pp. 3479–3483

  35. [35]

    Single-Channel Multi-Speaker Separation using Deep Clustering

    Y . Isik, J. L. Roux, Z. Chen, S. Watanabe, and J. R. Hershey, “Single-channel multi-speaker separation using deep clustering,” arXiv preprint arXiv:1607.02173, 2016