EvoTSE: Evolving Enrollment for Target Speaker Extraction
Pith reviewed 2026-05-10 17:59 UTC · model grok-4.3
The pith
Continuously updating enrollment with reliability-filtered high-confidence estimates improves target speaker extraction without extra labeled data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EvoTSE is an evolving TSE framework in which the enrollment is continuously updated through reliability-filtered retrieval over high-confidence historical estimates. This mechanism reduces speaker confusion and relaxes the quality requirements for pre-recorded enrollment without relying on additional annotated data. Experiments across multiple benchmarks demonstrate that EvoTSE achieves consistent improvements, especially when evaluated on out-of-domain (OOD) scenarios.
What carries the argument
The reliability-filtered retrieval mechanism that continuously updates the enrollment from high-confidence historical estimates.
If this is right
- - Reduces speaker confusion during extraction from mixtures.
- - Delivers consistent performance gains across benchmarks.
- - Shows especially large gains in out-of-domain conditions.
- - Relaxes dependence on high-quality initial enrollments.
- - Requires no extra annotated data for the adaptation process.
Where Pith is reading between the lines
- - The same self-updating loop could be tested on related tasks such as speaker diarization or speech enhancement where models face similar confusion risks.
- - Adding an explicit uncertainty score to the reliability filter might further limit error propagation when initial estimates are marginal.
- - Applying the method to long-form audio recordings would test whether repeated updates remain stable over extended time scales.
- - Comparing the approach against other unsupervised adaptation techniques on the same OOD splits would clarify whether reliability filtering is the decisive ingredient.
Load-bearing premise
The reliability filter consistently selects correct estimates so that updates reduce rather than increase speaker confusion or accumulate errors.
What would settle it
Running EvoTSE on a dataset of highly similar speakers where the update step produces lower extraction accuracy than the static baseline would show the filter is selecting incorrect estimates.
Figures
read the original abstract
Target Speaker Extraction (TSE) aims to isolate a specific speaker's voice from a mixture, guided by a pre-recorded enrollment. While TSE bypasses the global permutation ambiguity of blind source separation, it remains vulnerable to speaker confusion, where models mistakenly extract the interfering speaker. Furthermore, conventional TSE relies on static inference pipeline, where performance is limited by the quality of the fixed enrollment. To overcome these limitations, we propose EvoTSE, an evolving TSE framework in which the enrollment is continuously updated through reliability-filtered retrieval over high-confidence historical estimates. This mechanism reduces speaker confusion and relaxes the quality requirements for pre-recorded enrollment without relying on additional annotated data. Experiments across multiple benchmarks demonstrate that EvoTSE achieves consistent improvements, especially when evaluated on out-of-domain (OOD) scenarios. Our code and checkpoints are available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes EvoTSE, a framework for target speaker extraction (TSE) in which the enrollment embedding is continuously updated via reliability-filtered retrieval of high-confidence historical estimates. This is intended to reduce speaker confusion, relax the quality requirements on the initial enrollment, and improve robustness especially in out-of-domain (OOD) conditions, all without additional annotated data. Experiments on multiple benchmarks are reported to show consistent gains, with larger benefits in OOD settings.
Significance. If the reliability filter reliably selects correct estimates, the approach would address a practical limitation of static-enrollment TSE by enabling online adaptation. The public release of code and checkpoints strengthens the contribution by supporting reproducibility. However, the significance is currently limited by the absence of direct validation that the filter selects acoustically correct extractions rather than high-confidence errors, which is required to substantiate the claimed OOD gains and error-reduction mechanism.
major comments (3)
- [Abstract and §3] Abstract and §3 (method description): The central claim that reliability-filtered retrieval of high-confidence estimates reduces speaker confusion and improves OOD performance rests on the unverified assumption that the filter selects correct extractions; no direct measurement (e.g., precision of selected estimates against ground-truth speaker identity) is provided, so it is unclear whether reported gains originate from the filter or from other pipeline components.
- [§4] §4 (experiments): No ablation studies isolate the contribution of the reliability filter versus the retrieval mechanism or the base TSE model; without these, the load-bearing role of the evolving-enrollment component cannot be assessed, particularly for the OOD improvements highlighted in the abstract.
- [§3] §3: The criteria used to define 'high-confidence' estimates and the exact retrieval procedure are described only at a high level; concrete implementation details (thresholds, similarity metrics, update rule) are needed to evaluate whether the mechanism can avoid confirmation bias and error accumulation.
minor comments (2)
- [Abstract] The abstract states that code and checkpoints are available; this is a positive feature that should be referenced more explicitly in the experimental section to aid readers.
- [§3] Notation for the enrollment update and reliability score could be introduced earlier and used consistently to improve readability of the method description.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which help clarify how to strengthen the presentation of EvoTSE. We address each major comment below and will revise the manuscript to incorporate the suggested additions and clarifications.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (method description): The central claim that reliability-filtered retrieval of high-confidence estimates reduces speaker confusion and improves OOD performance rests on the unverified assumption that the filter selects correct extractions; no direct measurement (e.g., precision of selected estimates against ground-truth speaker identity) is provided, so it is unclear whether reported gains originate from the filter or from other pipeline components.
Authors: We agree that a direct measurement of the filter's selection accuracy would provide stronger evidence for the claimed mechanism. In the revised manuscript we will add a quantitative analysis (new table or figure in §4) that reports the precision of the reliability-filtered estimates against ground-truth speaker identities on the evaluation sets. This will allow readers to verify that the filter predominantly selects correct extractions rather than high-confidence errors and will directly link the observed OOD gains to the evolving-enrollment component. revision: yes
-
Referee: [§4] §4 (experiments): No ablation studies isolate the contribution of the reliability filter versus the retrieval mechanism or the base TSE model; without these, the load-bearing role of the evolving-enrollment component cannot be assessed, particularly for the OOD improvements highlighted in the abstract.
Authors: We acknowledge that the current experiments do not fully isolate the individual contributions. We will add a set of ablation studies in the revised §4 that (i) disable the reliability filter while keeping retrieval, (ii) replace the evolving enrollment with static enrollment, and (iii) compare against the base TSE model alone. These ablations will be reported on both in-domain and OOD conditions to quantify the specific benefit of the reliability-filtered evolving enrollment. revision: yes
-
Referee: [§3] §3: The criteria used to define 'high-confidence' estimates and the exact retrieval procedure are described only at a high level; concrete implementation details (thresholds, similarity metrics, update rule) are needed to evaluate whether the mechanism can avoid confirmation bias and error accumulation.
Authors: We will expand §3 with the missing implementation details: the exact threshold(s) used to classify an estimate as high-confidence, the similarity metric and retrieval procedure (including how historical estimates are stored and queried), and the precise update rule for the enrollment embedding. These additions will enable readers to assess the risk of confirmation bias and error accumulation and will be accompanied by a short discussion of safeguards already present in the design. revision: yes
Circularity Check
No circularity: algorithmic framework with experimental claims, not a derivation reducing to inputs
full rationale
The paper presents EvoTSE as an evolving enrollment framework that updates via reliability-filtered retrieval of high-confidence estimates. No equations, first-principles derivations, or predictions are described that reduce by construction to fitted parameters or self-definitions. The central mechanism (reliability filter) is an explicit algorithmic choice whose correctness is left to empirical validation on benchmarks, including OOD scenarios. No self-citations are invoked as load-bearing uniqueness theorems, no ansatzes are smuggled, and no known results are renamed as novel organization. The approach is self-contained as a proposed pipeline whose gains are asserted via experiments rather than tautology.
Axiom & Free-Parameter Ledger
invented entities (1)
-
reliability-filtered retrieval over high-confidence historical estimates
no independent evidence
Reference graph
Works this paper leans on
-
[1]
EvoTSE: Evolving Enrollment for Target Speaker Extraction
Introduction Target speaker extraction aims to isolate a desired voice fr om multi-talker mixtures using a reference enrollment. Despi te re- cent progress, practical deployment is fundamentally limi ted by two challenges. First, speaker confusion remains a crit- ical failure mode, where models incorrectly track interfer ing speakers that exhibit similar ...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
Related Work Target Speaker Extraction: Current TSE research follows two approaches: embedding-based and embedding-free frame - works. The former utilizes a speaker encoder to extract identity-discriminative embeddings, either through pre- trained speaker verification models like TEA-PSE family [20, 21, 22] or through jointly trained encoders as seen in X-...
-
[3]
Problem Formulation 3.1. Static TSE The objective of TSE is to isolate the target signal s(t) from a multi-talker mixture x(t), guided by a reference enrollment r(t) of the target speaker. A generalized acoustic mixture in a reverberant environment is formulated as: x(t) = s(t) ∗ hs(t) + ∑ i ni(t) ∗ hn,i (t) + ∑ j vj (t) ∗ hv,j (t) (1) where h(t) denotes ...
-
[4]
Framework Overview The EvoTSE framework redefines TSE as a retrieval-augmented task
Proposed Method 4.1. Framework Overview The EvoTSE framework redefines TSE as a retrieval-augmented task. EvoTSE transforms the conventional static mapping in to an evolving, evidence-accumulating system. As illustrate d in Fig. 2a, for each incoming mixture segment xn in a long- duration session, EvoTSE operates through a closed-loop fe ed- back pipeline ...
-
[5]
Experimental Setup 5.1. Datasets Following the configuration of the backbone models, we use the WSJ0-2mix dataset [42] for fundamental training and eva l- uation. It consists of three subsets: the training set with 2 0,000 utterances from 101 speakers, the development set with 5,00 0 utterances from 101 speakers, and the test set with 3,000 ut- terances fr...
-
[6]
Main Results Table 1 compares the performance of our proposed method with two baselines
Experimental Results 6.1. Main Results Table 1 compares the performance of our proposed method with two baselines. USEF-TFGridNet (Standard) refers to the conventional inference pipeline. USEF-TFGridNet (Sta tic) uses a grouped inference setup but keeps the enrollment fixed . EvoTSE represents our proposed grouped inference method with evolving updates. Mo...
-
[7]
Conclusions This paper presents EvoTSE, a framework that transitions TS E from static mapping to an evolving inference pipeline by evo lv- ing the enrollment representation. Our approach effectively mit- igates speaker confusion, particularly in complex OOD scen ar- ios, and significantly reduces dependency on the quality of t he initial enrollment audio. ...
-
[8]
Generative AI Use Disclosure During the preparation of this manuscript, the authors util ized large language models (LLMs) solely for grammatical verifi- cation and language refinement. Specifically, these tools we re employed to ensure grammatical consistency, improve sentence fluency, and enhance overall readability. The authors empha size that LLMs played ...
-
[9]
Tar- get confusion in end-to-end speaker extraction: Analysis a nd ap- proaches,
Z. Zhao, D. Y ang, G. Rongzhi, H. Zhang, and Y . Zou, “Tar- get confusion in end-to-end speaker extraction: Analysis a nd ap- proaches,” in Proc. INTERSPEECH 2022 , 09. 2022, pp. 5333– 5337
work page 2022
-
[10]
K. Liu, Z. Du, X. Wan, and H. Zhou, “X-sepformer: End- to-end speaker extraction network with explicit optimizat ion on speaker confusion,” in IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP) , 06 2023, pp. 1–5
work page 2023
-
[11]
Or-tse: An overlap-robust speaker encoder for target speech extraction,
Y . Zhang, L. Y ao, and Q. Y ang, “Or-tse: An overlap-robust speaker encoder for target speech extraction,” in Proc. INTER- SPEECH 2024, 09 2024, pp. 587–591
work page 2024
-
[12]
V oicefilte r: Targeted voice separation by speaker-conditioned spectro gram masking,
Q. Wang, H. Muckenhirn, K. Wilson, P . Sridhar, Z. Wu, J. He r- shey, R. Saurous, R. Weiss, Y . Jia, and I. Moreno, “V oicefilte r: Targeted voice separation by speaker-conditioned spectro gram masking,” in Proc. INTERSPEECH 2019 , 09 2019, pp. 2728– 2732
work page 2019
-
[13]
Deep extractor network for target speaker recovery from si n- gle channel speech mixtures,
J. Wang, J. Chen, D. Su, L. Chen, M. Y u, Y . Qian, and D. Y u, “Deep extractor network for target speaker recovery from si n- gle channel speech mixtures,” in Proc. INTERSPEECH 2018 , 09 2018, pp. 307–311
work page 2018
-
[14]
Speakerbeam: Speaker aware neu ral network for target speaker extraction in speech mixtures,
K. ˇZmol´ ıkov´ a, M. Delcroix, K. Kinoshita, T. Ochiai, T. Nakatani, L. Burget, and J. Cernocky, “Speakerbeam: Speaker aware neu ral network for target speaker extraction in speech mixtures,” IEEE Journal of Selected Topics in Signal Processing , vol. PP , pp. 1–1, 06 2019
work page 2019
-
[15]
Speakerfilter: Deep learning- based target speaker extraction using anchor speech,
S. He, H. Li, and X. Zhang, “Speakerfilter: Deep learning- based target speaker extraction using anchor speech,” in IEEE Interna- tional Conference on Acoustics, Speech and Signal Processi ng (ICASSP), 2020, pp. 376–380
work page 2020
-
[16]
Usef-tse: Universal speaker embeddin g free target speaker extraction,
B. Zeng and M. Li, “Usef-tse: Universal speaker embeddin g free target speaker extraction,” IEEE Transactions on Audio, Speech and Language Processing, vol. 33, pp. 2110–2124, 2025
work page 2025
-
[17]
F. Hao, X. Li, and C. Zheng, “X-tf-gridnet: A time–freque ncy do- main target speaker extraction network with adaptive speak er em- bedding fusion,” Information Fusion, vol. 112, p. 102550, 2024
work page 2024
-
[18]
Tar- get speaker extraction through comparing noisy positive and negative audio enrollments,
S. Xu, Y . Y ang, N. Trigoni, and A. Markham, “Tar- get speaker extraction through comparing noisy positive and negative audio enrollments,” 2025. [Online]. Availabl e: https://arxiv.org/abs/2502.16611
-
[19]
On the effectiveness of enrollment speech augmentation fo r tar- get speaker extraction,
J. Li, K. Zhang, S. Wang, H. Li, M.-W. Mak, and K. A. Lee, “On the effectiveness of enrollment speech augmentation fo r tar- get speaker extraction,” in 2024 IEEE Spoken Language Technol- ogy W orkshop (SLT), 12 2024, pp. 325–332
work page 2024
-
[20]
Look once to hear: Target speech hearing with noisy examples,
B. V eluri, M. Itani, T. Chen, T. Y oshioka, and S. Gollakota, “Look once to hear: Target speech hearing with noisy examples,” Pro- ceedings of the 2024 CHI Conference on Human Factors in Com- puting Systems, 2024
work page 2024
-
[21]
M. Ghane and M. S. Safari, “End-to-end target speaker sp eech recognition using context-aware attention mechanisms for chal- lenging enrollment scenario,” IEEE Signal Processing Letters , vol. 32, pp. 1940–1944, 2025
work page 1940
-
[22]
Speaker activity driven neural speech extrac tion,
M. Delcroix, K. Zmolikova, T. Ochiai, K. Kinoshita, and T. Nakatani, “Speaker activity driven neural speech extrac tion,” in IEEE International Conference on Acoustics, Speech and Sig - nal Processing (ICASSP), 2021, pp. 6099–6103
work page 2021
-
[23]
A re- gion based non-overlapping reference speech estimation me thod for speaker extraction,
Y . Zhang, Z. Li, B. Liu, H. Fan, Y . Y ang, and Q. Y ang, “A re- gion based non-overlapping reference speech estimation me thod for speaker extraction,” in MultiMedia Modeling. Springer Na- ture Switzerland, 2024, pp. 437–447
work page 2024
-
[24]
Robust speaker extraction network based on iterat ive refined adaptation,
D. Chengyun, S. Ma, Y . Sha, Y . Zhang, H. Zhang, H. Song, an d F. Wang, “Robust speaker extraction network based on iterat ive refined adaptation,” in Proc. INTERSPEECH 2021 , 08 2021, pp. 3530–3534
work page 2021
-
[25]
Multi- stage speaker extraction with utterance and frame-level re ference signals,
M. Ge, C. Xu, L. Wang, E. Chng, J. Dang, and H. Li, “Multi- stage speaker extraction with utterance and frame-level re ference signals,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 06 2021, pp. 6109–6113
work page 2021
-
[26]
Target speaker extraction with ultra-short reference speech by ve-ve fram ework,
L. Y ang, W. Liu, L. Tan, J. Y ang, and H.-G. Moon, “Target speaker extraction with ultra-short reference speech by ve-ve fram ework,” in IEEE International Conference on Acoustics, Speech and Sig - nal Processing (ICASSP), 2023, pp. 1–5
work page 2023
-
[27]
Retrieval-augmented generation for knowled ge- intensive nlp tasks,
P . Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K¨ uttler, M. Lewis, W.-t. Yih, T. Rockt¨ aschel, S. Riedel , and D. Kiela, “Retrieval-augmented generation for knowled ge- intensive nlp tasks,” in Advances in Neural Information Process- ing Systems, Red Hook, NY , USA, 2020
work page 2020
-
[28]
Y . Ju, W. Rao, X. Y an, Y . Fu, S. Lv, L. Cheng, Y . Wang, L. Xie , and S. Shang, “Tea-pse: Tencent-ethereal-audio-lab perso nalized speech enhancement system for icassp 2022 dns challenge,” i n IEEE International Conference on Acoustics, Speech and Sig nal Processing (ICASSP), 2022, pp. 9291–9295
work page 2022
-
[29]
Tea-pse 2.0: Sub-band network for real-time personalized speech enhancement,
Y . Ju, S. Zhang, W. Rao, Y . Wang, T. Y u, L. Xie, and S. Shang , “Tea-pse 2.0: Sub-band network for real-time personalized speech enhancement,” in 2022 IEEE Spoken Language Technology W ork- shop (SLT), 2023, pp. 472–479
work page 2022
-
[30]
Y . Ju, J. Chen, S. Zhang, S. He, W. Rao, W. Zhu, Y . Wang, T. Y u, and S. Shang, “Tea-pse 3.0: Tencent-ethereal-audio-lab pe rsonal- ized speech enhancement system for icassp 2023 dns-challen ge,” in IEEE International Conference on Acoustics, Speech and Sig - nal Processing (ICASSP), 2023, pp. 1–2
work page 2023
-
[31]
Spex: Multi-scale time do- main speaker extraction network,
C. Xu, W. Rao, E. Chng, and H. Li, “Spex: Multi-scale time do- main speaker extraction network,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , vol. PP , pp. 1–1, 04 2020
work page 2020
-
[32]
Spex+: A complete time domain speaker extraction network,
M. Ge, C. Xu, L. Wang, E. Chng, J. Dang, and H. Li, “Spex+: A complete time domain speaker extraction network,” in Proc. INTERSPEECH 2020, 10 2020, pp. 1406–1410
work page 2020
-
[33]
Multi-level speaker representation for target speaker ex traction,
K. Zhang, J. Li, S. Wang, Y . Wei, Y . Wang, Y . Wang, and H. Li , “Multi-level speaker representation for target speaker ex traction,” in IEEE International Conference on Acoustics, Speech and Sig - nal Processing (ICASSP), 04 2025, pp. 1–5
work page 2025
-
[34]
Hierarchical speaker representation for target speaker e xtrac- tion,
S. He, H. Zhang, W. Rao, K. Zhang, Y . Ju, Y . Y ang, and X. Zhang, “Hierarchical speaker representation for target speaker e xtrac- tion,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 10 361–10 365
work page 2024
-
[35]
Enhancing target speaker extraction with hierarchical sp eaker representation learning,
S. He, W. Xue, Y . Y ang, H. Zhang, J. Pan, and X. Zhang, “Enhancing target speaker extraction with hierarchical sp eaker representation learning,” Neural Networks, vol. 188, 8 2025. [On- line]. Available: https://doi.org/10.1016/j.neunet.2025.107388
-
[36]
Tf-gridnet: Integrating full- and sub-band modeling for speech separation,
Z.-Q. Wang, S. Cornell, S. Choi, Y . Lee, B.-Y . Kim, and S.Watan- abe, “Tf-gridnet: Integrating full- and sub-band modeling for speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. PP , pp. 1–15, 01 2023
work page 2023
-
[37]
Muse: Multi-modal targe t speaker extraction with visual cues,
Z. Pan, R. Tao, C. Xu, and H. Li, “Muse: Multi-modal targe t speaker extraction with visual cues,” in IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP ), 2021, pp. 6678–6682
work page 2021
-
[38]
J. Li, K. Zhang, S. Wang, K. A. Lee, M.-W. Mak, and H. Li, “M o- muse: Momentum multi-modal target speaker extraction for r eal- time scenarios with impaired visual cues,” in Proc. IEEE Interna- tional Conference on Multimedia and Expo (ICME) , 06 2025, pp. 1–6
work page 2025
-
[39]
Ts-sep: Joint diarization and separation co ndi- tioned on estimated speaker embeddings,
C. Boeddeker, A. Subramanian, G. Wichern, R. Haeb-Umba ch, and J. Le Roux, “Ts-sep: Joint diarization and separation co ndi- tioned on estimated speaker embeddings,” IEEE/ACM Transac- tions on Audio, Speech, and Language Processing , vol. PP , pp. 1–13, 01 2024
work page 2024
-
[40]
Wavrag: Audio-integrated retrieval augmented ge ner- ation for spoken dialogue models,
Y . Chen, S. Ji, H. Wang, Z. Wang, S. Chen, J. He, J. Xu, and Z. Zhao, “Wavrag: Audio-integrated retrieval augmented ge ner- ation for spoken dialogue models,” in Proceedings of the 63rd Annual Meeting of the Association for Computational Lingui stics (V olume 1: Long Papers). Vienna, Austria: Association for Com- putational Linguistics, Jul. 2025, pp. 12...
work page 2025
-
[41]
Recap: Retrieval-augmented audio captioning,
S. Ghosh, S. Kumar, C. Evuru, R. Duraiswami, and D. Manoc ha, “Recap: Retrieval-augmented audio captioning,” in IEEE Inter- national Conference on Acoustics, Speech and Signal Proces sing (ICASSP), 04 2024, pp. 1161–1165
work page 2024
-
[42]
Retrieval-augmented text-to-audio generation,
Y . Y uan, H. Liu, X. Liu, Q. Huang, M. D. Plumbley, and W. Wang, “Retrieval-augmented text-to-audio generation,” in IEEE Interna- tional Conference on Acoustics, Speech and Signal Processi ng (ICASSP), 2024, pp. 581–585
work page 2024
-
[43]
Listen only to me! how well can target speech ex - traction handle false alarms?
M. Delcroix, K. Kinoshita, T. Ochiai, K. ˇZmol´ ıkov´ a, H. Sato, and T. Nakatani, “Listen only to me! how well can target speech ex - traction handle false alarms?” in Proc. INTERSPEECH 2022, 09 2022, pp. 216–220
work page 2022
-
[44]
Enha ncing speaker extraction through rectifying target confusion,
J. Wang, S. Wang, J. Li, K. Zhang, Y . Qian, and H. Li, “Enha ncing speaker extraction through rectifying target confusion,” in 2024 IEEE Spoken Language Technology W orkshop (SLT), 12 2024, pp. 349–356
work page 2024
-
[45]
Z. Y ou, Z. Zhou, L. Li, and D. Wang, “Spkaugtse: A simple and efficient approach to address target confusion in end-to -end speaker extraction,” in Proc. APSIPA ASC, 2025, pp. 583–588
work page 2025
-
[46]
Continuous speech separation: Dataset a nd analysis,
Z. Chen, T. Y oshioka, L. Lu, T. Zhou, Z. Meng, Y . Luo, J. Wu , X. Xiao, and J. Li, “Continuous speech separation: Dataset a nd analysis,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2020, pp. 7284–7288
work page 2020
-
[47]
B. Desplanques, J. Thienpondt, and K. Demuynck, “Ecapa -tdnn: Emphasized channel attention, propagation and aggregatio n in tdnn based speaker verification,” in Proc. INTERSPEECH 2020 , 10 2020
work page 2020
-
[48]
Wespeaker: A research and production oriented speaker embedding learning toolkit,
H. Wang, C. Liang, S. Wang, Z. Chen, B. Zhang, X. Xiang, Y . Deng, and Y . Qian, “Wespeaker: A research and production oriented speaker embedding learning toolkit,” in IEEE Interna- tional Conference on Acoustics, Speech and Signal Processi ng (ICASSP). IEEE, 2023, pp. 1–5
work page 2023
-
[49]
emotion2vec: Self-supervised pre-training for speech em otion representation,
Z. Ma, Z. Zheng, J. Y e, J. Li, Z. Gao, S. Zhang, and X. Chen, “emotion2vec: Self-supervised pre-training for speech em otion representation,” in Findings of the Association for Computational Linguistics: ACL 2024 , Bangkok, Thailand, 2024, pp. 15 747– 15 760
work page 2024
-
[50]
Deep clus- tering: Discriminative embeddings for segmentation and se para- tion,
J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clus- tering: Discriminative embeddings for segmentation and se para- tion,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 31–35
work page 2016
-
[51]
Csr-i (w sj0) com- plete ldc93s6a,
J. Garofolo, D. Graff, D. Paul, and D. Pallett, “Csr-i (w sj0) com- plete ldc93s6a,” Web Download, Philadelphia, 1993, lDC93S 6A
work page 1993
-
[52]
Librimix: An open-source dataset for generalizable speech separation,
J. Cosentino, M. Pariente, S. Cornell, A. Deleforge, an d E. Vin- cent, “Librimix: An open-source dataset for generalizable speech separation,” arXiv: Audio and Speech Processing, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:218862876
work page 2020
-
[53]
Emotional voice co n- version: Theory, databases and esd,
K. Zhou, B. Sisman, R. Liu, and H. Li, “Emotional voice co n- version: Theory, databases and esd,” Speech Communication, vol. 137, pp. 1–18, 2022
work page 2022
-
[54]
Seen and unseen emotional style transfer for voice con- version with a new emotional speech dataset,
——, “Seen and unseen emotional style transfer for voice con- version with a new emotional speech dataset,” in IEEE Interna- tional Conference on Acoustics, Speech and Signal Processi ng (ICASSP). IEEE, 2021, pp. 920–924
work page 2021
-
[55]
X-tasnet: Robust and accu - rate time-domain speaker extraction network,
Z. Zhang, B. He, and Z. Zhang, “X-tasnet: Robust and accu - rate time-domain speaker extraction network,” in Proc. INTER- SPEECH 2020, 10 2020, pp. 1421–1425
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.