arxiv: 2604.03219 · v1 · submitted 2026-04-03 · 📡 eess.AS · cs.SD

Unmixing the Crowd: Learning Mixture-to-Set Speaker Embeddings for Enrollment-Free Target Speech Extraction

FNU Sidharth , Meysam Asgari , Hao-Wen Dong , Dhruv Jain This is my paper

Pith reviewed 2026-05-13 17:55 UTC · model grok-4.3

classification 📡 eess.AS cs.SD

keywords speaker embeddingtarget speech extractionenrollment-freemixture processingpermutation invariantLibriMixDNS Challenge

0 comments

The pith

A model learns to predict speaker embeddings directly from noisy mixtures, enabling target speech extraction without any enrollment recording.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes training a network that takes a noisy multi-speaker mixture as input and outputs a small set of candidate speaker embeddings. These embeddings are supervised to align with those produced by a strong single-speaker embedding model using permutation-invariant loss. The resulting embeddings form a structured space where identities cluster meaningfully and outperform clustering baselines. When used to condition speech extraction networks, they improve separation quality on both simulated and real data. This setup removes the traditional requirement for a clean enrollment utterance of the target speaker.

Core claim

The model maps a noisy mixture directly to a small set of candidate speaker embeddings trained to align with a strong single-speaker speaker-embedding space via permutation-invariant teacher supervision. On noisy LibriMix, the resulting embeddings form a structured and clusterable identity space, outperforming WavLM+K-means and separation-derived embeddings in standard clustering metrics. Conditioning these embeddings into multiple extraction back-ends consistently improves objective quality and intelligibility, and generalizes to real DNS-Challenge recordings.

What carries the argument

Mixture-to-set speaker embedding predictor supervised by permutation-invariant alignment to a pretrained single-speaker embedding model.

If this is right

The predicted embeddings create a clusterable identity space that exceeds WavLM with k-means in clustering metrics.
Using the embeddings to condition extraction networks raises objective quality and intelligibility scores.
The method works on simulated noisy LibriMix data and carries over to real-world DNS-Challenge recordings.
Multiple different extraction back-ends benefit from the same set of mixture-derived embeddings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such a system could enable hands-free target extraction in live settings like conferences where asking for enrollment clips is impossible.
If the alignment quality holds across domains, it may reduce dependence on clean single-speaker data for training speaker-aware audio systems.
The structured embedding space suggests potential for unsupervised speaker diarization directly from mixtures without separate embedding extraction steps.

Load-bearing premise

The speaker embeddings predicted solely from the mixture will align closely enough with clean single-speaker embeddings to function as effective control signals for extraction.

What would settle it

Running the extraction back-end with the predicted embeddings produces separation metrics no better than the unconditioned version or worse than using randomly chosen embeddings from the same space.

Figures

Figures reproduced from arXiv: 2604.03219 by Dhruv Jain, FNU Sidharth, Hao-Wen Dong, Meysam Asgari.

**Figure 1.** Figure 1: Teacher–student framework for mixture-derived multi-speaker embeddings. The teacher defines a singlespeaker identity space; the student predicts an unordered set of embeddings from the mixture and is trained via permutationinvariant distillation to stay aligned to the same manifold, encouraging head-wise speaker disentanglement. a 3-mode distribution spanning [−5, 25] dB (same scheme for train/test), us… view at source ↗

**Figure 2.** Figure 2: (a) TSE sensitivity to embedding interpolation/drift (DPCCN) and (b) clustering degradation under separation artifacts. embedding conditioned enrollment-free systems consistently improve background and overall quality (e.g., DPCCN: ∆BAK = +1.25, ∆OVRL = +0.21) with a small average speech-quality drop (∆SIG = −0.14), reflecting the usual suppression–distortion trade-off. We also include the official DNS b… view at source ↗

read the original abstract

Personalized or target speech extraction (TSE) typically needs a clean enrollment -- hard to obtain in real-world crowded environments. We remove the essential need for enrollment by predicting, from the mixture itself, a small set of per-speaker embeddings that serve as the control signal for extraction. Our model maps a noisy mixture directly to a small set of candidate speaker embeddings trained to align with a strong single-speaker speaker-embedding space via permutation-invariant teacher supervision. On noisy LibriMix, the resulting embeddings form a structured and clusterable identity space, outperforming WavLM+K-means and separation-derived embeddings in standard clustering metrics. Conditioning these embeddings into multiple extraction back-ends consistently improves objective quality and intelligibility, and generalizes to real DNS-Challenge recordings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core move is predicting a small set of speaker embeddings directly from the mixture via permutation-invariant alignment to a fixed teacher space, which removes the enrollment step and shows measurable gains on LibriMix and DNS data.

read the letter

The new piece is the mixture-to-set prediction trained with permutation-invariant teacher supervision so the output embeddings land in the same space as a strong single-speaker model. That framing is not how prior enrollment-free TSE work was described. On the positive side, the clustering metrics beat WavLM plus K-means and separation-derived embeddings, and feeding the predicted embeddings into standard extraction back-ends produces consistent lifts in quality and intelligibility on both simulated LibriMix and real DNS-Challenge recordings. The supervision signal appears to be doing real work rather than just copying the teacher. The central assumption—that mixture-only predictions will align well enough to act as drop-in controls—holds up in the reported numbers, with no obvious circularity in the pipeline. The soft spots are modest. The abstract leaves the exact set-size handling and conditioning details implicit, so the full paper needs to show how performance changes when the number of speakers varies or when the mixture gets very crowded. Error bars and a few more ablations on the teacher model choice would also help readers judge stability. Nothing here looks load-bearing or contradictory. This is aimed at speech-separation groups that already run TSE pipelines and want to test enrollment-free variants on public benchmarks. It is grounded enough and the experiments are on standard data, so it deserves a serious referee even if revisions will be needed on the robustness sections.

Referee Report

2 major / 2 minor

Summary. The paper proposes a mixture-to-set neural model that predicts a small set of speaker embeddings directly from a noisy input mixture. These embeddings are trained with a permutation-invariant loss to align with embeddings from a fixed pre-trained single-speaker model, allowing them to serve as control signals for downstream target speech extraction (TSE) without any enrollment utterance. On LibriMix the predicted embeddings yield better clustering metrics than WavLM+K-means or separation-derived baselines; when conditioned into multiple TSE back-ends they produce consistent gains in objective quality and intelligibility, and the approach generalizes to real DNS-Challenge recordings.

Significance. If the alignment between mixture-derived and single-speaker embeddings holds at scale, the method removes a major practical barrier to personalized TSE in crowded environments. The permutation-invariant supervision strategy is a clean way to obtain set-level supervision without explicit speaker assignment, and the reported improvements across clustering and extraction metrics on both simulated and real data suggest the approach could be broadly useful once the magnitude and statistical reliability of the gains are fully documented.

major comments (2)

[Abstract and Results] Abstract and Results section: the claim of outperforming WavLM+K-means and separation-derived embeddings in clustering metrics is central to validating the mixture-to-set alignment, yet no numerical values, standard deviations, or comparison tables are supplied. Without these data it is impossible to judge whether the reported improvements are large enough to support the downstream extraction gains.
[Methodology] Methodology: the manuscript does not detail how the predicted set size is chosen or how the permutation-invariant loss is exactly formulated (e.g., the precise matching criterion between predicted and teacher embeddings). These choices are load-bearing for the central claim that the embeddings can be used as drop-in control signals.

minor comments (2)

[Abstract] The abstract mentions generalization to DNS-Challenge recordings but does not specify which objective metrics were used or whether any domain-adaptation steps were applied; a brief clarification would improve reproducibility.
[Methods] Notation for the set of predicted embeddings (e.g., how the variable set cardinality is handled in the network output) should be defined explicitly in the first methods subsection.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and recommendation for minor revision. We address each major comment below and will update the manuscript to incorporate the requested details and numerical results.

read point-by-point responses

Referee: [Abstract and Results] Abstract and Results section: the claim of outperforming WavLM+K-means and separation-derived embeddings in clustering metrics is central to validating the mixture-to-set alignment, yet no numerical values, standard deviations, or comparison tables are supplied. Without these data it is impossible to judge whether the reported improvements are large enough to support the downstream extraction gains.

Authors: We agree that specific numerical values, standard deviations, and comparison tables are essential to substantiate the central claims. In the revised manuscript we will add a dedicated table in the Results section that reports clustering metrics (ARI, NMI, and Silhouette Score) for the proposed mixture-to-set embeddings versus WavLM+K-means and separation-derived baselines. The table will include mean values and standard deviations computed across multiple random seeds, allowing readers to assess the magnitude of the improvements and their relation to the downstream TSE gains. revision: yes
Referee: [Methodology] Methodology: the manuscript does not detail how the predicted set size is chosen or how the permutation-invariant loss is exactly formulated (e.g., the precise matching criterion between predicted and teacher embeddings). These choices are load-bearing for the central claim that the embeddings can be used as drop-in control signals.

Authors: We appreciate the referee pointing out this lack of detail. The set size is fixed at 3 to accommodate the maximum number of speakers present in LibriMix mixtures. The permutation-invariant loss is implemented via the Hungarian algorithm, which finds the optimal bipartite matching that minimizes the sum of cosine distances between the predicted set and the teacher embeddings extracted from the pre-trained single-speaker model. We will expand the Methodology section with a new subsection that provides the exact mathematical formulation, the rationale for the set-size choice, and pseudocode for the matching procedure. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper trains a mixture-to-set embedding model using permutation-invariant supervision from an external pre-trained single-speaker embedding space (teacher). This supervision signal is independent of the model's own predictions and is not derived from the target outputs. Evaluations rely on standard public datasets (LibriMix, DNS-Challenge) and metrics without reducing any claimed prediction to a fitted input or self-citation chain. No self-definitional, ansatz-smuggling, or uniqueness-imported steps appear in the abstract or described pipeline; the central claim rests on empirical alignment rather than definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, axioms, or invented entities are detailed. The method relies on an existing single-speaker embedding space and standard datasets.

axioms (1)

domain assumption Permutation-invariant teacher supervision can align mixture-derived embeddings with a pre-trained single-speaker embedding space
Invoked to train the model as described in the abstract

pith-pipeline@v0.9.0 · 5443 in / 1179 out tokens · 87655 ms · 2026-05-13T17:55:23.394562+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our model maps a noisy mixture directly to a small set of candidate speaker embeddings trained to align with a strong single-speaker speaker-embedding space via permutation-invariant teacher supervision.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Permutation-invariant supervision: ... minimize the best assignment between predicted embeddings and teacher embeddings ... using cosine distance.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 1 internal anchor

[1]

embedding-first extraction

Introduction Despite recent advances in deep learning, current methods still struggle to efficiently isolate a target speaker of interest in crowded acoustic environments. This target speech extraction (TSE) problem [1] is central for hearable devices such as true- wireless buds, hearing aids and cochlear implants. Modern TSE typically conditions an enhan...

work page
[2]

Unmixing the Crowd: Learning Mixture-to-Set Speaker Embeddings for Enrollment-Free Target Speech Extraction

System / Method We introduce the task setup and architecture for predicting mixture-derived multi-speaker embeddings and conditioning downstream enhancement/separation models. The mixture is mapped to a small unordered set of candidate speaker embed- dings in a pretrained speaker space, enabling selection-based extraction without enrollment. 2.1. Task Def...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Training and Testing Data We use the Libri2Mix and Libri3Mix pipelines [17] built from LibriSpeechtrain-clean-360[18], modified to better match conversational mixtures

Experimental Setup 3.1. Training and Testing Data We use the Libri2Mix and Libri3Mix pipelines [17] built from LibriSpeechtrain-clean-360[18], modified to better match conversational mixtures. The proposed mixture-to-set encoder is not inherently tied to a fixed speaker count and could be extended with a dynamic-head when variable cardinality is required....

work page
[4]

separate-then- embed

Results In this section, we evaluate (i) the quality and structure of the proposed mixture-derived embeddings and (ii) their value for conditioning target speech extraction (TSE). We report cluster- ing metrics and ablations (teacher choice, partial WavLM fine- tuning), then assess downstream TSE on LibriMix and DNS Challenge, along with analyses of embed...

work page
[5]

We realize this with a teacher- aligned, permutation-invariant mixture-to-set embedding en- coder that predicts one embedding per active speaker directly from noisy mixtures

Conclusion We propose an embedding-first view of TSE: the mixture it- self proposes a small set of candidate speaker identities, and extraction reduces to selecting among these candidates rather than requiring enrollment audio. We realize this with a teacher- aligned, permutation-invariant mixture-to-set embedding en- coder that predicts one embedding per...

work page
[6]

All scientific content and results are the authors’ own and were verified by the authors

Generative AI Use Disclosure Generative AI tools were used only for proofreading and im- proving writing clarity. All scientific content and results are the authors’ own and were verified by the authors

work page
[7]

Speakerbeam: Speaker aware neural network for target speaker extraction in speech mixtures,

K. ˇZmol´ıkov´a, M. Delcroix, K. Kinoshita, T. Ochiai, T. Nakatani, L. Burget, and J. ˇCernock´y, “Speakerbeam: Speaker aware neural network for target speaker extraction in speech mixtures,”IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 4, pp. 800–814, 2019

work page 2019
[8]

Neural target speech extraction: An overview,

K. Zmolikova, M. Delcroix, T. Ochiai, K. Kinoshita, J. ˇCernock´y, and D. Yu, “Neural target speech extraction: An overview,”IEEE Signal Processing Magazine, vol. 40, no. 3, pp. 8–29, 2023

work page 2023
[9]

Look once to hear: Target speech hearing with noisy examples,

B. Veluri, M. Itani, T. Chen, T. Yoshioka, and S. Gollakota, “Look once to hear: Target speech hearing with noisy examples,” inProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, ser. CHI ’24. New York, NY , USA: Association for Computing Machinery, 2024. [Online]. Available: https://doi.org/10.1145/3613904.3642057

work page doi:10.1145/3613904.3642057 2024
[10]

Target speaker extraction through comparing noisy positive and negative audio enrollments,

S. Xu, Y . Yang, N. Trigoni, and A. Markham, “Target speaker extraction through comparing noisy positive and negative audio enrollments,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Online]. Available: https://openreview.net/forum?id=awRfy4xAO5

work page 2025
[11]

Recursive attentive pooling for ex- tracting speaker embeddings from multi-speaker recordings,

S. Horiguchi, A. Ando, T. Moriya, T. Ashihara, H. Sato, N. Tawara, and M. Delcroix, “Recursive attentive pooling for ex- tracting speaker embeddings from multi-speaker recordings,” in 2024 IEEE Spoken Language Technology Workshop (SLT), 2024, pp. 1201–1208

work page 2024
[12]

Geodesic interpolation of frame-wise speaker embeddings for the diarization of meeting scenarios,

T. Cord-Landwehr, C. Boeddeker, C. Zoril ˘a, R. Doddipatla, and R. Haeb-Umbach, “Geodesic interpolation of frame-wise speaker embeddings for the diarization of meeting scenarios,” inICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 11 886–11 890

work page 2024
[13]

Neural speaker diarization using memory- aware multi-speaker embedding with sequence-to-sequence archi- tecture,

G. Yang, M. He, S. Niu, R. Wang, Y . Yue, S. Qian, S. Wu, J. Du, and C.-H. Lee, “Neural speaker diarization using memory- aware multi-speaker embedding with sequence-to-sequence archi- tecture,” inICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 11 626–11 630

work page 2024
[14]

End-to-end neural speaker diarization with self- attention,

Y . Fujita, N. Kanda, S. Horiguchi, Y . Xue, K. Nagamatsu, and S. Watanabe, “End-to-end neural speaker diarization with self- attention,” in2019 IEEE Automatic Speech Recognition and Un- derstanding Workshop (ASRU), 2019, pp. 296–303

work page 2019
[15]

End-to-end speaker diarization for an unknown number of speak- ers with encoder-decoder based attractors,

S. Horiguchi, Y . Fujita, S. Watanabe, Y . Xue, and K. Nagamatsu, “End-to-end speaker diarization for an unknown number of speak- ers with encoder-decoder based attractors,” inProc. Interspeech 2020, 2020, pp. 269–273

work page 2020
[16]

Ansd-ma-mse: Adap- tive neural speaker diarization using memory-aware multi-speaker embedding,

M.-K. He, J. Du, Q.-F. Liu, and C.-H. Lee, “Ansd-ma-mse: Adap- tive neural speaker diarization using memory-aware multi-speaker embedding,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 1561–1573, 2023

work page 2023
[17]

Guided speaker embedding,

S. Horiguchi, T. Moriya, A. Ando, T. Ashihara, H. Sato, N. Tawara, and M. Delcroix, “Guided speaker embedding,” in ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

work page 2025
[18]

Wavlm: Large-scale self- supervised pre-training for full stack speech processing,

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “Wavlm: Large-scale self- supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

work page 2022
[19]

Attentive statistics pooling for deep speaker embedding,

K. Okabe, T. Koshinaka, and K. Shinoda, “Attentive statistics pooling for deep speaker embedding,” inProc. Interspeech 2018, 2018, pp. 2252–2256

work page 2018
[20]

Diarization-guided multi- speaker embeddings,

J. Kalda, T. Alum ¨ae, H. Bredinet al., “Diarization-guided multi- speaker embeddings,” inProc. Interspeech 2025, 2025, pp. 5233– 5237

work page 2025
[21]

Arcface: Additive angular margin loss for deep face recognition,

J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” inProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, 2019, pp. 4690–4699

work page 2019
[22]

Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification,

B. Desplanques, J. Thienpondt, and K. Demuynck, “Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification,” inProc. Interspeech 2020, 2020, pp. 3830–3834

work page 2020
[23]

Librimix: An open-source dataset for generalizable speech separation,

J. Cosentino, M. Pariente, S. Cornell, A. Deleforge, and E. Vin- cent, “Librimix: An open-source dataset for generalizable speech separation,” 2020

work page 2020
[24]

Lib- rispeech: An asr corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: An asr corpus based on public domain audio books,” in2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210

work page 2015
[25]

Dccrn: Deep complex convolution recurrent network for phase-aware speech enhancement,

Y . Hu, Y . Liu, S. Lv, M. Xing, S. Zhang, Y . Fu, J. Wu, B. Zhang, and L. Xie, “Dccrn: Deep complex convolution recurrent network for phase-aware speech enhancement,” inProc. Interspeech 2020, 2020, pp. 2472–2476

work page 2020
[26]

Spex+: A complete time domain speaker extraction network,

M. Ge, C. Xu, L. Wang, E. S. Chng, J. Dang, and H. Li, “Spex+: A complete time domain speaker extraction network,” inProc. Interspeech 2020, 2020, pp. 1406–1410

work page 2020
[27]

Dpccn: Densely- connected pyramid complex convolutional network for robust speech separation and extraction,

J. Han, Y . Long, L. Burget, and J. ˇCernock`y, “Dpccn: Densely- connected pyramid complex convolutional network for robust speech separation and extraction,” inICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP). IEEE, 2022, pp. 7292–7296

work page 2022
[28]

Wesep: A scalable and flexible toolkit towards gen- eralizable target speaker extraction,

S. Wang, K. Zhang, S. Lin, J. Li, X. Wang, M. Ge, J. Yu, Y . Qian, and H. Li, “Wesep: A scalable and flexible toolkit towards gen- eralizable target speaker extraction,” inProc. Interspeech 2024, 2024, pp. 4273–4277

work page 2024
[29]

Film: Visual reasoning with a general conditioning layer,

E. Perez, F. Strub, H. De Vries, V . Dumoulin, and A. Courville, “Film: Visual reasoning with a general conditioning layer,” in Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018

work page 2018
[30]

One-shot conditional audio filtering of arbitrary sounds,

B. Gfeller, D. Roblek, and M. Tagliasacchi, “One-shot conditional audio filtering of arbitrary sounds,” inICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP). IEEE, 2021, pp. 501–505

work page 2021
[31]

Neu- ral speech extraction with human feedback,

M. Itani, A. Graves, S. E. Eskimez, and S. Gollakota, “Neu- ral speech extraction with human feedback,”arXiv preprint arXiv:2508.03041, 2025

work page arXiv 2025
[32]

The hungarian method for the assignment problem,

H. W. Kuhn, “The hungarian method for the assignment problem,” Naval Research Logistics Quarterly, vol. 2, no. 1–2, pp. 83–97, 1955

work page 1955
[33]

U-vectors: Generating cluster- able speaker embedding from unlabeled data,

M. F. Mridha, A. Q. Ohi, M. M. Monowar, M. A. Hamid, M. R. Islam, and Y . Watanobe, “U-vectors: Generating cluster- able speaker embedding from unlabeled data,”Applied Sciences, vol. 11, no. 21, p. 10079, 2021

work page 2021
[34]

Self-supervised speaker verification em- ploying a novel clustering algorithm,

A. Fathan and J. Alam, “Self-supervised speaker verification em- ploying a novel clustering algorithm,” inICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 12 597–12 601

work page 2024
[35]

An iterative framework for self- supervised deep speaker representation learning,

D. Cai, W. Wang, and M. Li, “An iterative framework for self- supervised deep speaker representation learning,” inICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2021, pp. 6728–6732

work page 2021
[36]

Revealing emo- tional clusters in speaker embeddings: A contrastive learning strategy for speech emotion recognition,

I. R. Ulgen, Z. Du, C. Busso, and B. Sisman, “Revealing emo- tional clusters in speaker embeddings: A contrastive learning strategy for speech emotion recognition,” inICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 12 081–12 085

work page 2024
[37]

Sdr– half-baked or well done?

J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “Sdr– half-baked or well done?” inICASSP 2019-2019 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 626–630

work page 2019
[38]

Perceptual eval- uation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,

A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual eval- uation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” in2001 IEEE In- ternational Conference on Acoustics, Speech, and Signal Process- ing. Proceedings (Cat. No.01CH37221), vol. 2, 2001, pp. 749– 752 vol.2

work page 2001
[39]

A short- time objective intelligibility measure for time-frequency weighted noisy speech,

C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short- time objective intelligibility measure for time-frequency weighted noisy speech,” in2010 IEEE International Conference on Acous- tics, Speech and Signal Processing, 2010, pp. 4214–4217

work page 2010
[40]

Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise sup- pressors,

C. K. Reddy, V . Gopal, and R. Cutler, “Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise sup- pressors,” inICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6493–6497

work page 2021
[41]

Personalized speech enhancement: New models and comprehensive evaluation,

S. E. Eskimez, T. Yoshioka, H. Wang, X. Wang, Z. Chen, and X. Huang, “Personalized speech enhancement: New models and comprehensive evaluation,” inICASSP 2022-2022 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). Ieee, 2022, pp. 356–360

work page 2022
[42]

Icassp 2023 deep noise suppression challenge,

H. Dubey, A. Aazami, V . Gopal, B. Naderi, S. Braun, R. Cutler, A. Ju, M. Zohourian, M. Tang, M. Golestanehet al., “Icassp 2023 deep noise suppression challenge,”IEEE Open Journal of Signal Processing, vol. 5, pp. 725–737, 2024

work page 2023
[43]

Dnsmos p.835: A non- intrusive perceptual objective speech quality metric to evaluate noise suppressors,

C. K. Reddy, V . Gopal, and R. Cutler, “Dnsmos p.835: A non- intrusive perceptual objective speech quality metric to evaluate noise suppressors,” inICASSP 2022 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022

work page 2022
[44]

Conv-tasnet: Surpassing ideal time– frequency magnitude masking for speech separation,

Y . Luo and N. Mesgarani, “Conv-tasnet: Surpassing ideal time– frequency magnitude masking for speech separation,”IEEE/ACM transactions on audio, speech, and language processing, vol. 27, no. 8, pp. 1256–1266, 2019

work page 2019