EvoTSE: Evolving Enrollment for Target Speaker Extraction

Lei Xie; Longshuai Xiao; Shuai Wang; Xingchen Li; Yike Zhu; Zikai Liu; Ziqian Wang

arxiv: 2604.06810 · v2 · submitted 2026-04-08 · 📡 eess.AS

EvoTSE: Evolving Enrollment for Target Speaker Extraction

Zikai Liu , Ziqian Wang , Xingchen Li , Yike Zhu , Shuai Wang , Longshuai Xiao , Lei Xie This is my paper

Pith reviewed 2026-05-10 17:59 UTC · model grok-4.3

classification 📡 eess.AS

keywords target speaker extractionevolving enrollmentreliability filterspeaker confusionout-of-domainspeech separationself-adaptation

0 comments

The pith

Continuously updating enrollment with reliability-filtered high-confidence estimates improves target speaker extraction without extra labeled data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EvoTSE, a framework that evolves the enrollment for target speaker extraction by continuously retrieving and incorporating high-confidence historical estimates selected by a reliability filter. This addresses speaker confusion in mixtures and the limits of static fixed enrollments that plague conventional TSE. The approach operates without additional annotated data and yields consistent gains, especially on out-of-domain test conditions. A sympathetic reader cares because real-world voice isolation often starts with imperfect enrollments and encounters unseen acoustics, so an online self-correction loop could make extraction more practical.

Core claim

EvoTSE is an evolving TSE framework in which the enrollment is continuously updated through reliability-filtered retrieval over high-confidence historical estimates. This mechanism reduces speaker confusion and relaxes the quality requirements for pre-recorded enrollment without relying on additional annotated data. Experiments across multiple benchmarks demonstrate that EvoTSE achieves consistent improvements, especially when evaluated on out-of-domain (OOD) scenarios.

What carries the argument

The reliability-filtered retrieval mechanism that continuously updates the enrollment from high-confidence historical estimates.

If this is right

- Reduces speaker confusion during extraction from mixtures.
- Delivers consistent performance gains across benchmarks.
- Shows especially large gains in out-of-domain conditions.
- Relaxes dependence on high-quality initial enrollments.
- Requires no extra annotated data for the adaptation process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

- The same self-updating loop could be tested on related tasks such as speaker diarization or speech enhancement where models face similar confusion risks.
- Adding an explicit uncertainty score to the reliability filter might further limit error propagation when initial estimates are marginal.
- Applying the method to long-form audio recordings would test whether repeated updates remain stable over extended time scales.
- Comparing the approach against other unsupervised adaptation techniques on the same OOD splits would clarify whether reliability filtering is the decisive ingredient.

Load-bearing premise

The reliability filter consistently selects correct estimates so that updates reduce rather than increase speaker confusion or accumulate errors.

What would settle it

Running EvoTSE on a dataset of highly similar speakers where the update step produces lower extraction accuracy than the static baseline would show the filter is selecting incorrect estimates.

Figures

Figures reproduced from arXiv: 2604.06810 by Lei Xie, Longshuai Xiao, Shuai Wang, Xingchen Li, Yike Zhu, Zikai Liu, Ziqian Wang.

**Figure 1.** Figure 1: System architecture: (a) Static enrollment. (b) Evolving enrollment with a right-side memory bank for state updates. 3. Problem Formulation 3.1. Static TSE The objective of TSE is to isolate the target signal s(t) from a multi-talker mixture x(t), guided by a reference enrollment r(t) of the target speaker. A generalized acoustic mixture in a reverberant environment is formulated as: x(t) = s(t) ∗ hs(t) +… view at source ↗

**Figure 2.** Figure 2: (a) Overview: The overall architecture where speaker extraction is enhanced by historical cues. (b) Retrieval: The process of querying the memory bank using mixture embeddings to construct an acoustically matched enrollment. (c) Evolution: The reliabilitygated evolution, where new estimates are validated via a threshold τ and integrated into the memory bank only segments with a high degree of identity con… view at source ↗

**Figure 3.** Figure 3: Conceptual illustration of speaker identity evolution on the manifold. tively enriches the memory bank’s diversity. 4.6. Training Strategy: Artifact-aware Learning We propose an Artifact-aware Learning strategy, which is divided into two progressive stages and shifts from static feature mapping to evolving identity alignment. In the initial stage, the TSE extractor is trained in a conventional static pip… view at source ↗

**Figure 4.** Figure 4: Performance metrics as a function of similarity threshold τ on ESD-test (k = 3, |M|max = 64). Dashed lines represent USEF-TFGridnet results. Red triangle lines and blue square lines distinguish models trained on WSJ+ESD and WSJ, respectively. 1 3 12 24 48 64 10 15 10.09 10.73 11.63 11.34 9.78 9.09 15.14 16.23 16.63 16.67 16.36 16.32 SISDRi(dB) (a) SISDRi across k 1 3 12 24 48 64 0 5 10 15 9.6 8.1 5.9 6.4 1… view at source ↗

**Figure 5.** Figure 5: Performance metrics as a function of retrieval quantity k on ESD-test (τ = 0.5, |M|max = 64). Red triangle lines indicate models trained on WSJ + ESD, while blue square lines represent models trained on WSJ only [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Target Speaker Extraction (TSE) aims to isolate a specific speaker's voice from a mixture, guided by a pre-recorded enrollment. While TSE bypasses the global permutation ambiguity of blind source separation, it remains vulnerable to speaker confusion, where models mistakenly extract the interfering speaker. Furthermore, conventional TSE relies on static inference pipeline, where performance is limited by the quality of the fixed enrollment. To overcome these limitations, we propose EvoTSE, an evolving TSE framework in which the enrollment is continuously updated through reliability-filtered retrieval over high-confidence historical estimates. This mechanism reduces speaker confusion and relaxes the quality requirements for pre-recorded enrollment without relying on additional annotated data. Experiments across multiple benchmarks demonstrate that EvoTSE achieves consistent improvements, especially when evaluated on out-of-domain (OOD) scenarios. Our code and checkpoints are available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EvoTSE's evolving enrollment via reliability-filtered retrieval is a reasonable practical tweak for TSE robustness but rests on an unverified assumption that the filter picks correct estimates rather than confident errors.

read the letter

The main point to know is that EvoTSE proposes updating the enrollment continuously in target speaker extraction by pulling in reliability-filtered high-confidence estimates from history, which they say cuts speaker confusion and helps with weak starting enrollments especially out of domain. This evolving mechanism is the novel element here, as it moves away from the usual fixed enrollment in TSE without requiring extra labeled data. The paper reports improvements on several benchmarks, stronger in OOD tests, and makes code and checkpoints public for others to use. That setup does a decent job addressing the practical problem of enrollment quality limiting performance in real audio mixtures. Where it gets shaky is the reliability filter's role. The gains depend on it selecting estimates that are actually correct rather than just confident but wrong ones that might come from similar interfering speakers. Without checks on how often the filter gets it right compared to ground truth or tests showing error doesn't accumulate, the OOD benefits could be overstated or come from other parts of the system. The abstract doesn't provide those details, so the central assumption remains unverified as noted in the stress test. For readers, this is relevant to people building TSE systems for applications like hearing aids or meeting transcription where conditions vary and initial recordings might not be ideal. Someone in the field could get value from the idea and the released resources. I would send it for peer review to get proper evaluation of the filter and experiments, as the idea has potential but needs that validation to be convincing.

Referee Report

3 major / 2 minor

Summary. The paper proposes EvoTSE, a framework for target speaker extraction (TSE) in which the enrollment embedding is continuously updated via reliability-filtered retrieval of high-confidence historical estimates. This is intended to reduce speaker confusion, relax the quality requirements on the initial enrollment, and improve robustness especially in out-of-domain (OOD) conditions, all without additional annotated data. Experiments on multiple benchmarks are reported to show consistent gains, with larger benefits in OOD settings.

Significance. If the reliability filter reliably selects correct estimates, the approach would address a practical limitation of static-enrollment TSE by enabling online adaptation. The public release of code and checkpoints strengthens the contribution by supporting reproducibility. However, the significance is currently limited by the absence of direct validation that the filter selects acoustically correct extractions rather than high-confidence errors, which is required to substantiate the claimed OOD gains and error-reduction mechanism.

major comments (3)

[Abstract and §3] Abstract and §3 (method description): The central claim that reliability-filtered retrieval of high-confidence estimates reduces speaker confusion and improves OOD performance rests on the unverified assumption that the filter selects correct extractions; no direct measurement (e.g., precision of selected estimates against ground-truth speaker identity) is provided, so it is unclear whether reported gains originate from the filter or from other pipeline components.
[§4] §4 (experiments): No ablation studies isolate the contribution of the reliability filter versus the retrieval mechanism or the base TSE model; without these, the load-bearing role of the evolving-enrollment component cannot be assessed, particularly for the OOD improvements highlighted in the abstract.
[§3] §3: The criteria used to define 'high-confidence' estimates and the exact retrieval procedure are described only at a high level; concrete implementation details (thresholds, similarity metrics, update rule) are needed to evaluate whether the mechanism can avoid confirmation bias and error accumulation.

minor comments (2)

[Abstract] The abstract states that code and checkpoints are available; this is a positive feature that should be referenced more explicitly in the experimental section to aid readers.
[§3] Notation for the enrollment update and reliability score could be introduced earlier and used consistently to improve readability of the method description.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help clarify how to strengthen the presentation of EvoTSE. We address each major comment below and will revise the manuscript to incorporate the suggested additions and clarifications.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (method description): The central claim that reliability-filtered retrieval of high-confidence estimates reduces speaker confusion and improves OOD performance rests on the unverified assumption that the filter selects correct extractions; no direct measurement (e.g., precision of selected estimates against ground-truth speaker identity) is provided, so it is unclear whether reported gains originate from the filter or from other pipeline components.

Authors: We agree that a direct measurement of the filter's selection accuracy would provide stronger evidence for the claimed mechanism. In the revised manuscript we will add a quantitative analysis (new table or figure in §4) that reports the precision of the reliability-filtered estimates against ground-truth speaker identities on the evaluation sets. This will allow readers to verify that the filter predominantly selects correct extractions rather than high-confidence errors and will directly link the observed OOD gains to the evolving-enrollment component. revision: yes
Referee: [§4] §4 (experiments): No ablation studies isolate the contribution of the reliability filter versus the retrieval mechanism or the base TSE model; without these, the load-bearing role of the evolving-enrollment component cannot be assessed, particularly for the OOD improvements highlighted in the abstract.

Authors: We acknowledge that the current experiments do not fully isolate the individual contributions. We will add a set of ablation studies in the revised §4 that (i) disable the reliability filter while keeping retrieval, (ii) replace the evolving enrollment with static enrollment, and (iii) compare against the base TSE model alone. These ablations will be reported on both in-domain and OOD conditions to quantify the specific benefit of the reliability-filtered evolving enrollment. revision: yes
Referee: [§3] §3: The criteria used to define 'high-confidence' estimates and the exact retrieval procedure are described only at a high level; concrete implementation details (thresholds, similarity metrics, update rule) are needed to evaluate whether the mechanism can avoid confirmation bias and error accumulation.

Authors: We will expand §3 with the missing implementation details: the exact threshold(s) used to classify an estimate as high-confidence, the similarity metric and retrieval procedure (including how historical estimates are stored and queried), and the precise update rule for the enrollment embedding. These additions will enable readers to assess the risk of confirmation bias and error accumulation and will be accompanied by a short discussion of safeguards already present in the design. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithmic framework with experimental claims, not a derivation reducing to inputs

full rationale

The paper presents EvoTSE as an evolving enrollment framework that updates via reliability-filtered retrieval of high-confidence estimates. No equations, first-principles derivations, or predictions are described that reduce by construction to fitted parameters or self-definitions. The central mechanism (reliability filter) is an explicit algorithmic choice whose correctness is left to empirical validation on benchmarks, including OOD scenarios. No self-citations are invoked as load-bearing uniqueness theorems, no ansatzes are smuggled, and no known results are renamed as novel organization. The approach is self-contained as a proposed pipeline whose gains are asserted via experiments rather than tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Only the abstract is available, so free parameters such as any reliability threshold or retrieval hyperparameters, background assumptions about estimate confidence, and the precise definition of the retrieval mechanism cannot be audited in detail.

invented entities (1)

reliability-filtered retrieval over high-confidence historical estimates no independent evidence
purpose: To continuously update the enrollment reference during inference
Introduced as the core mechanism of EvoTSE to reduce speaker confusion

pith-pipeline@v0.9.0 · 5451 in / 1132 out tokens · 33405 ms · 2026-05-10T17:59:26.981870+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 1 internal anchor

[1]

EvoTSE: Evolving Enrollment for Target Speaker Extraction

Introduction Target speaker extraction aims to isolate a desired voice fr om multi-talker mixtures using a reference enrollment. Despi te re- cent progress, practical deployment is fundamentally limi ted by two challenges. First, speaker confusion remains a crit- ical failure mode, where models incorrectly track interfer ing speakers that exhibit similar ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Related Work Target Speaker Extraction: Current TSE research follows two approaches: embedding-based and embedding-free frame - works. The former utilizes a speaker encoder to extract identity-discriminative embeddings, either through pre- trained speaker veriﬁcation models like TEA-PSE family [20, 21, 22] or through jointly trained encoders as seen in X-...

work page
[3]

Static TSE The objective of TSE is to isolate the target signal s(t) from a multi-talker mixture x(t), guided by a reference enrollment r(t) of the target speaker

Problem Formulation 3.1. Static TSE The objective of TSE is to isolate the target signal s(t) from a multi-talker mixture x(t), guided by a reference enrollment r(t) of the target speaker. A generalized acoustic mixture in a reverberant environment is formulated as: x(t) = s(t) ∗ hs(t) + ∑ i ni(t) ∗ hn,i (t) + ∑ j vj (t) ∗ hv,j (t) (1) where h(t) denotes ...

work page
[4]

Framework Overview The EvoTSE framework redeﬁnes TSE as a retrieval-augmented task

Proposed Method 4.1. Framework Overview The EvoTSE framework redeﬁnes TSE as a retrieval-augmented task. EvoTSE transforms the conventional static mapping in to an evolving, evidence-accumulating system. As illustrate d in Fig. 2a, for each incoming mixture segment xn in a long- duration session, EvoTSE operates through a closed-loop fe ed- back pipeline ...

work page
[5]

Datasets Following the conﬁguration of the backbone models, we use the WSJ0-2mix dataset [42] for fundamental training and eva l- uation

Experimental Setup 5.1. Datasets Following the conﬁguration of the backbone models, we use the WSJ0-2mix dataset [42] for fundamental training and eva l- uation. It consists of three subsets: the training set with 2 0,000 utterances from 101 speakers, the development set with 5,00 0 utterances from 101 speakers, and the test set with 3,000 ut- terances fr...

work page
[6]

Main Results Table 1 compares the performance of our proposed method with two baselines

Experimental Results 6.1. Main Results Table 1 compares the performance of our proposed method with two baselines. USEF-TFGridNet (Standard) refers to the conventional inference pipeline. USEF-TFGridNet (Sta tic) uses a grouped inference setup but keeps the enrollment ﬁxed . EvoTSE represents our proposed grouped inference method with evolving updates. Mo...

work page
[7]

Conclusions This paper presents EvoTSE, a framework that transitions TS E from static mapping to an evolving inference pipeline by evo lv- ing the enrollment representation. Our approach effectively mit- igates speaker confusion, particularly in complex OOD scen ar- ios, and signiﬁcantly reduces dependency on the quality of t he initial enrollment audio. ...

work page
[8]

Speciﬁcally, these tools we re employed to ensure grammatical consistency, improve sentence ﬂuency, and enhance overall readability

Generative AI Use Disclosure During the preparation of this manuscript, the authors util ized large language models (LLMs) solely for grammatical veriﬁ- cation and language reﬁnement. Speciﬁcally, these tools we re employed to ensure grammatical consistency, improve sentence ﬂuency, and enhance overall readability. The authors empha size that LLMs played ...

work page
[9]

Tar- get confusion in end-to-end speaker extraction: Analysis a nd ap- proaches,

Z. Zhao, D. Y ang, G. Rongzhi, H. Zhang, and Y . Zou, “Tar- get confusion in end-to-end speaker extraction: Analysis a nd ap- proaches,” in Proc. INTERSPEECH 2022 , 09. 2022, pp. 5333– 5337

work page 2022
[10]

X-sepformer: End- to-end speaker extraction network with explicit optimizat ion on speaker confusion,

K. Liu, Z. Du, X. Wan, and H. Zhou, “X-sepformer: End- to-end speaker extraction network with explicit optimizat ion on speaker confusion,” in IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP) , 06 2023, pp. 1–5

work page 2023
[11]

Or-tse: An overlap-robust speaker encoder for target speech extraction,

Y . Zhang, L. Y ao, and Q. Y ang, “Or-tse: An overlap-robust speaker encoder for target speech extraction,” in Proc. INTER- SPEECH 2024, 09 2024, pp. 587–591

work page 2024
[12]

V oiceﬁlte r: Targeted voice separation by speaker-conditioned spectro gram masking,

Q. Wang, H. Muckenhirn, K. Wilson, P . Sridhar, Z. Wu, J. He r- shey, R. Saurous, R. Weiss, Y . Jia, and I. Moreno, “V oiceﬁlte r: Targeted voice separation by speaker-conditioned spectro gram masking,” in Proc. INTERSPEECH 2019 , 09 2019, pp. 2728– 2732

work page 2019
[13]

Deep extractor network for target speaker recovery from si n- gle channel speech mixtures,

J. Wang, J. Chen, D. Su, L. Chen, M. Y u, Y . Qian, and D. Y u, “Deep extractor network for target speaker recovery from si n- gle channel speech mixtures,” in Proc. INTERSPEECH 2018 , 09 2018, pp. 307–311

work page 2018
[14]

Speakerbeam: Speaker aware neu ral network for target speaker extraction in speech mixtures,

K. ˇZmol´ ıkov´ a, M. Delcroix, K. Kinoshita, T. Ochiai, T. Nakatani, L. Burget, and J. Cernocky, “Speakerbeam: Speaker aware neu ral network for target speaker extraction in speech mixtures,” IEEE Journal of Selected Topics in Signal Processing , vol. PP , pp. 1–1, 06 2019

work page 2019
[15]

Speakerﬁlter: Deep learning- based target speaker extraction using anchor speech,

S. He, H. Li, and X. Zhang, “Speakerﬁlter: Deep learning- based target speaker extraction using anchor speech,” in IEEE Interna- tional Conference on Acoustics, Speech and Signal Processi ng (ICASSP), 2020, pp. 376–380

work page 2020
[16]

Usef-tse: Universal speaker embeddin g free target speaker extraction,

B. Zeng and M. Li, “Usef-tse: Universal speaker embeddin g free target speaker extraction,” IEEE Transactions on Audio, Speech and Language Processing, vol. 33, pp. 2110–2124, 2025

work page 2025
[17]

X-tf-gridnet: A time–freque ncy do- main target speaker extraction network with adaptive speak er em- bedding fusion,

F. Hao, X. Li, and C. Zheng, “X-tf-gridnet: A time–freque ncy do- main target speaker extraction network with adaptive speak er em- bedding fusion,” Information Fusion, vol. 112, p. 102550, 2024

work page 2024
[18]

Tar- get speaker extraction through comparing noisy positive and negative audio enrollments,

S. Xu, Y . Y ang, N. Trigoni, and A. Markham, “Tar- get speaker extraction through comparing noisy positive and negative audio enrollments,” 2025. [Online]. Availabl e: https://arxiv.org/abs/2502.16611

work page arXiv 2025
[19]

On the effectiveness of enrollment speech augmentation fo r tar- get speaker extraction,

J. Li, K. Zhang, S. Wang, H. Li, M.-W. Mak, and K. A. Lee, “On the effectiveness of enrollment speech augmentation fo r tar- get speaker extraction,” in 2024 IEEE Spoken Language Technol- ogy W orkshop (SLT), 12 2024, pp. 325–332

work page 2024
[20]

Look once to hear: Target speech hearing with noisy examples,

B. V eluri, M. Itani, T. Chen, T. Y oshioka, and S. Gollakota, “Look once to hear: Target speech hearing with noisy examples,” Pro- ceedings of the 2024 CHI Conference on Human Factors in Com- puting Systems, 2024

work page 2024
[21]

End-to-end target speaker sp eech recognition using context-aware attention mechanisms for chal- lenging enrollment scenario,

M. Ghane and M. S. Safari, “End-to-end target speaker sp eech recognition using context-aware attention mechanisms for chal- lenging enrollment scenario,” IEEE Signal Processing Letters , vol. 32, pp. 1940–1944, 2025

work page 1940
[22]

Speaker activity driven neural speech extrac tion,

M. Delcroix, K. Zmolikova, T. Ochiai, K. Kinoshita, and T. Nakatani, “Speaker activity driven neural speech extrac tion,” in IEEE International Conference on Acoustics, Speech and Sig - nal Processing (ICASSP), 2021, pp. 6099–6103

work page 2021
[23]

A re- gion based non-overlapping reference speech estimation me thod for speaker extraction,

Y . Zhang, Z. Li, B. Liu, H. Fan, Y . Y ang, and Q. Y ang, “A re- gion based non-overlapping reference speech estimation me thod for speaker extraction,” in MultiMedia Modeling. Springer Na- ture Switzerland, 2024, pp. 437–447

work page 2024
[24]

Robust speaker extraction network based on iterat ive reﬁned adaptation,

D. Chengyun, S. Ma, Y . Sha, Y . Zhang, H. Zhang, H. Song, an d F. Wang, “Robust speaker extraction network based on iterat ive reﬁned adaptation,” in Proc. INTERSPEECH 2021 , 08 2021, pp. 3530–3534

work page 2021
[25]

Multi- stage speaker extraction with utterance and frame-level re ference signals,

M. Ge, C. Xu, L. Wang, E. Chng, J. Dang, and H. Li, “Multi- stage speaker extraction with utterance and frame-level re ference signals,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 06 2021, pp. 6109–6113

work page 2021
[26]

Target speaker extraction with ultra-short reference speech by ve-ve fram ework,

L. Y ang, W. Liu, L. Tan, J. Y ang, and H.-G. Moon, “Target speaker extraction with ultra-short reference speech by ve-ve fram ework,” in IEEE International Conference on Acoustics, Speech and Sig - nal Processing (ICASSP), 2023, pp. 1–5

work page 2023
[27]

Retrieval-augmented generation for knowled ge- intensive nlp tasks,

P . Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K¨ uttler, M. Lewis, W.-t. Yih, T. Rockt¨ aschel, S. Riedel , and D. Kiela, “Retrieval-augmented generation for knowled ge- intensive nlp tasks,” in Advances in Neural Information Process- ing Systems, Red Hook, NY , USA, 2020

work page 2020
[28]

Tea-pse: Tencent-ethereal-audio-lab perso nalized speech enhancement system for icassp 2022 dns challenge,

Y . Ju, W. Rao, X. Y an, Y . Fu, S. Lv, L. Cheng, Y . Wang, L. Xie , and S. Shang, “Tea-pse: Tencent-ethereal-audio-lab perso nalized speech enhancement system for icassp 2022 dns challenge,” i n IEEE International Conference on Acoustics, Speech and Sig nal Processing (ICASSP), 2022, pp. 9291–9295

work page 2022
[29]

Tea-pse 2.0: Sub-band network for real-time personalized speech enhancement,

Y . Ju, S. Zhang, W. Rao, Y . Wang, T. Y u, L. Xie, and S. Shang , “Tea-pse 2.0: Sub-band network for real-time personalized speech enhancement,” in 2022 IEEE Spoken Language Technology W ork- shop (SLT), 2023, pp. 472–479

work page 2022
[30]

Tea-pse 3.0: Tencent-ethereal-audio-lab pe rsonal- ized speech enhancement system for icassp 2023 dns-challen ge,

Y . Ju, J. Chen, S. Zhang, S. He, W. Rao, W. Zhu, Y . Wang, T. Y u, and S. Shang, “Tea-pse 3.0: Tencent-ethereal-audio-lab pe rsonal- ized speech enhancement system for icassp 2023 dns-challen ge,” in IEEE International Conference on Acoustics, Speech and Sig - nal Processing (ICASSP), 2023, pp. 1–2

work page 2023
[31]

Spex: Multi-scale time do- main speaker extraction network,

C. Xu, W. Rao, E. Chng, and H. Li, “Spex: Multi-scale time do- main speaker extraction network,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , vol. PP , pp. 1–1, 04 2020

work page 2020
[32]

Spex+: A complete time domain speaker extraction network,

M. Ge, C. Xu, L. Wang, E. Chng, J. Dang, and H. Li, “Spex+: A complete time domain speaker extraction network,” in Proc. INTERSPEECH 2020, 10 2020, pp. 1406–1410

work page 2020
[33]

Multi-level speaker representation for target speaker ex traction,

K. Zhang, J. Li, S. Wang, Y . Wei, Y . Wang, Y . Wang, and H. Li , “Multi-level speaker representation for target speaker ex traction,” in IEEE International Conference on Acoustics, Speech and Sig - nal Processing (ICASSP), 04 2025, pp. 1–5

work page 2025
[34]

Hierarchical speaker representation for target speaker e xtrac- tion,

S. He, H. Zhang, W. Rao, K. Zhang, Y . Ju, Y . Y ang, and X. Zhang, “Hierarchical speaker representation for target speaker e xtrac- tion,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 10 361–10 365

work page 2024
[35]

Enhancing target speaker extraction with hierarchical sp eaker representation learning,

S. He, W. Xue, Y . Y ang, H. Zhang, J. Pan, and X. Zhang, “Enhancing target speaker extraction with hierarchical sp eaker representation learning,” Neural Networks, vol. 188, 8 2025. [On- line]. Available: https://doi.org/10.1016/j.neunet.2025.107388

work page doi:10.1016/j.neunet.2025.107388 2025
[36]

Tf-gridnet: Integrating full- and sub-band modeling for speech separation,

Z.-Q. Wang, S. Cornell, S. Choi, Y . Lee, B.-Y . Kim, and S.Watan- abe, “Tf-gridnet: Integrating full- and sub-band modeling for speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. PP , pp. 1–15, 01 2023

work page 2023
[37]

Muse: Multi-modal targe t speaker extraction with visual cues,

Z. Pan, R. Tao, C. Xu, and H. Li, “Muse: Multi-modal targe t speaker extraction with visual cues,” in IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP ), 2021, pp. 6678–6682

work page 2021
[38]

M o- muse: Momentum multi-modal target speaker extraction for r eal- time scenarios with impaired visual cues,

J. Li, K. Zhang, S. Wang, K. A. Lee, M.-W. Mak, and H. Li, “M o- muse: Momentum multi-modal target speaker extraction for r eal- time scenarios with impaired visual cues,” in Proc. IEEE Interna- tional Conference on Multimedia and Expo (ICME) , 06 2025, pp. 1–6

work page 2025
[39]

Ts-sep: Joint diarization and separation co ndi- tioned on estimated speaker embeddings,

C. Boeddeker, A. Subramanian, G. Wichern, R. Haeb-Umba ch, and J. Le Roux, “Ts-sep: Joint diarization and separation co ndi- tioned on estimated speaker embeddings,” IEEE/ACM Transac- tions on Audio, Speech, and Language Processing , vol. PP , pp. 1–13, 01 2024

work page 2024
[40]

Wavrag: Audio-integrated retrieval augmented ge ner- ation for spoken dialogue models,

Y . Chen, S. Ji, H. Wang, Z. Wang, S. Chen, J. He, J. Xu, and Z. Zhao, “Wavrag: Audio-integrated retrieval augmented ge ner- ation for spoken dialogue models,” in Proceedings of the 63rd Annual Meeting of the Association for Computational Lingui stics (V olume 1: Long Papers). Vienna, Austria: Association for Com- putational Linguistics, Jul. 2025, pp. 12...

work page 2025
[41]

Recap: Retrieval-augmented audio captioning,

S. Ghosh, S. Kumar, C. Evuru, R. Duraiswami, and D. Manoc ha, “Recap: Retrieval-augmented audio captioning,” in IEEE Inter- national Conference on Acoustics, Speech and Signal Proces sing (ICASSP), 04 2024, pp. 1161–1165

work page 2024
[42]

Retrieval-augmented text-to-audio generation,

Y . Y uan, H. Liu, X. Liu, Q. Huang, M. D. Plumbley, and W. Wang, “Retrieval-augmented text-to-audio generation,” in IEEE Interna- tional Conference on Acoustics, Speech and Signal Processi ng (ICASSP), 2024, pp. 581–585

work page 2024
[43]

Listen only to me! how well can target speech ex - traction handle false alarms?

M. Delcroix, K. Kinoshita, T. Ochiai, K. ˇZmol´ ıkov´ a, H. Sato, and T. Nakatani, “Listen only to me! how well can target speech ex - traction handle false alarms?” in Proc. INTERSPEECH 2022, 09 2022, pp. 216–220

work page 2022
[44]

Enha ncing speaker extraction through rectifying target confusion,

J. Wang, S. Wang, J. Li, K. Zhang, Y . Qian, and H. Li, “Enha ncing speaker extraction through rectifying target confusion,” in 2024 IEEE Spoken Language Technology W orkshop (SLT), 12 2024, pp. 349–356

work page 2024
[45]

Spkaugtse: A simple and efﬁcient approach to address target confusion in end-to -end speaker extraction,

Z. Y ou, Z. Zhou, L. Li, and D. Wang, “Spkaugtse: A simple and efﬁcient approach to address target confusion in end-to -end speaker extraction,” in Proc. APSIPA ASC, 2025, pp. 583–588

work page 2025
[46]

Continuous speech separation: Dataset a nd analysis,

Z. Chen, T. Y oshioka, L. Lu, T. Zhou, Z. Meng, Y . Luo, J. Wu , X. Xiao, and J. Li, “Continuous speech separation: Dataset a nd analysis,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2020, pp. 7284–7288

work page 2020
[47]

Ecapa -tdnn: Emphasized channel attention, propagation and aggregatio n in tdnn based speaker veriﬁcation,

B. Desplanques, J. Thienpondt, and K. Demuynck, “Ecapa -tdnn: Emphasized channel attention, propagation and aggregatio n in tdnn based speaker veriﬁcation,” in Proc. INTERSPEECH 2020 , 10 2020

work page 2020
[48]

Wespeaker: A research and production oriented speaker embedding learning toolkit,

H. Wang, C. Liang, S. Wang, Z. Chen, B. Zhang, X. Xiang, Y . Deng, and Y . Qian, “Wespeaker: A research and production oriented speaker embedding learning toolkit,” in IEEE Interna- tional Conference on Acoustics, Speech and Signal Processi ng (ICASSP). IEEE, 2023, pp. 1–5

work page 2023
[49]

emotion2vec: Self-supervised pre-training for speech em otion representation,

Z. Ma, Z. Zheng, J. Y e, J. Li, Z. Gao, S. Zhang, and X. Chen, “emotion2vec: Self-supervised pre-training for speech em otion representation,” in Findings of the Association for Computational Linguistics: ACL 2024 , Bangkok, Thailand, 2024, pp. 15 747– 15 760

work page 2024
[50]

Deep clus- tering: Discriminative embeddings for segmentation and se para- tion,

J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clus- tering: Discriminative embeddings for segmentation and se para- tion,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 31–35

work page 2016
[51]

Csr-i (w sj0) com- plete ldc93s6a,

J. Garofolo, D. Graff, D. Paul, and D. Pallett, “Csr-i (w sj0) com- plete ldc93s6a,” Web Download, Philadelphia, 1993, lDC93S 6A

work page 1993
[52]

Librimix: An open-source dataset for generalizable speech separation,

J. Cosentino, M. Pariente, S. Cornell, A. Deleforge, an d E. Vin- cent, “Librimix: An open-source dataset for generalizable speech separation,” arXiv: Audio and Speech Processing, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:218862876

work page 2020
[53]

Emotional voice co n- version: Theory, databases and esd,

K. Zhou, B. Sisman, R. Liu, and H. Li, “Emotional voice co n- version: Theory, databases and esd,” Speech Communication, vol. 137, pp. 1–18, 2022

work page 2022
[54]

Seen and unseen emotional style transfer for voice con- version with a new emotional speech dataset,

——, “Seen and unseen emotional style transfer for voice con- version with a new emotional speech dataset,” in IEEE Interna- tional Conference on Acoustics, Speech and Signal Processi ng (ICASSP). IEEE, 2021, pp. 920–924

work page 2021
[55]

X-tasnet: Robust and accu - rate time-domain speaker extraction network,

Z. Zhang, B. He, and Z. Zhang, “X-tasnet: Robust and accu - rate time-domain speaker extraction network,” in Proc. INTER- SPEECH 2020, 10 2020, pp. 1421–1425

work page 2020

[1] [1]

EvoTSE: Evolving Enrollment for Target Speaker Extraction

Introduction Target speaker extraction aims to isolate a desired voice fr om multi-talker mixtures using a reference enrollment. Despi te re- cent progress, practical deployment is fundamentally limi ted by two challenges. First, speaker confusion remains a crit- ical failure mode, where models incorrectly track interfer ing speakers that exhibit similar ...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

Related Work Target Speaker Extraction: Current TSE research follows two approaches: embedding-based and embedding-free frame - works. The former utilizes a speaker encoder to extract identity-discriminative embeddings, either through pre- trained speaker veriﬁcation models like TEA-PSE family [20, 21, 22] or through jointly trained encoders as seen in X-...

work page

[3] [3]

Static TSE The objective of TSE is to isolate the target signal s(t) from a multi-talker mixture x(t), guided by a reference enrollment r(t) of the target speaker

Problem Formulation 3.1. Static TSE The objective of TSE is to isolate the target signal s(t) from a multi-talker mixture x(t), guided by a reference enrollment r(t) of the target speaker. A generalized acoustic mixture in a reverberant environment is formulated as: x(t) = s(t) ∗ hs(t) + ∑ i ni(t) ∗ hn,i (t) + ∑ j vj (t) ∗ hv,j (t) (1) where h(t) denotes ...

work page

[4] [4]

Framework Overview The EvoTSE framework redeﬁnes TSE as a retrieval-augmented task

Proposed Method 4.1. Framework Overview The EvoTSE framework redeﬁnes TSE as a retrieval-augmented task. EvoTSE transforms the conventional static mapping in to an evolving, evidence-accumulating system. As illustrate d in Fig. 2a, for each incoming mixture segment xn in a long- duration session, EvoTSE operates through a closed-loop fe ed- back pipeline ...

work page

[5] [5]

Datasets Following the conﬁguration of the backbone models, we use the WSJ0-2mix dataset [42] for fundamental training and eva l- uation

Experimental Setup 5.1. Datasets Following the conﬁguration of the backbone models, we use the WSJ0-2mix dataset [42] for fundamental training and eva l- uation. It consists of three subsets: the training set with 2 0,000 utterances from 101 speakers, the development set with 5,00 0 utterances from 101 speakers, and the test set with 3,000 ut- terances fr...

work page

[6] [6]

Main Results Table 1 compares the performance of our proposed method with two baselines

Experimental Results 6.1. Main Results Table 1 compares the performance of our proposed method with two baselines. USEF-TFGridNet (Standard) refers to the conventional inference pipeline. USEF-TFGridNet (Sta tic) uses a grouped inference setup but keeps the enrollment ﬁxed . EvoTSE represents our proposed grouped inference method with evolving updates. Mo...

work page

[7] [7]

Conclusions This paper presents EvoTSE, a framework that transitions TS E from static mapping to an evolving inference pipeline by evo lv- ing the enrollment representation. Our approach effectively mit- igates speaker confusion, particularly in complex OOD scen ar- ios, and signiﬁcantly reduces dependency on the quality of t he initial enrollment audio. ...

work page

[8] [8]

Speciﬁcally, these tools we re employed to ensure grammatical consistency, improve sentence ﬂuency, and enhance overall readability

Generative AI Use Disclosure During the preparation of this manuscript, the authors util ized large language models (LLMs) solely for grammatical veriﬁ- cation and language reﬁnement. Speciﬁcally, these tools we re employed to ensure grammatical consistency, improve sentence ﬂuency, and enhance overall readability. The authors empha size that LLMs played ...

work page

[9] [9]

Tar- get confusion in end-to-end speaker extraction: Analysis a nd ap- proaches,

Z. Zhao, D. Y ang, G. Rongzhi, H. Zhang, and Y . Zou, “Tar- get confusion in end-to-end speaker extraction: Analysis a nd ap- proaches,” in Proc. INTERSPEECH 2022 , 09. 2022, pp. 5333– 5337

work page 2022

[10] [10]

X-sepformer: End- to-end speaker extraction network with explicit optimizat ion on speaker confusion,

K. Liu, Z. Du, X. Wan, and H. Zhou, “X-sepformer: End- to-end speaker extraction network with explicit optimizat ion on speaker confusion,” in IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP) , 06 2023, pp. 1–5

work page 2023

[11] [11]

Or-tse: An overlap-robust speaker encoder for target speech extraction,

Y . Zhang, L. Y ao, and Q. Y ang, “Or-tse: An overlap-robust speaker encoder for target speech extraction,” in Proc. INTER- SPEECH 2024, 09 2024, pp. 587–591

work page 2024

[12] [12]

V oiceﬁlte r: Targeted voice separation by speaker-conditioned spectro gram masking,

Q. Wang, H. Muckenhirn, K. Wilson, P . Sridhar, Z. Wu, J. He r- shey, R. Saurous, R. Weiss, Y . Jia, and I. Moreno, “V oiceﬁlte r: Targeted voice separation by speaker-conditioned spectro gram masking,” in Proc. INTERSPEECH 2019 , 09 2019, pp. 2728– 2732

work page 2019

[13] [13]

Deep extractor network for target speaker recovery from si n- gle channel speech mixtures,

J. Wang, J. Chen, D. Su, L. Chen, M. Y u, Y . Qian, and D. Y u, “Deep extractor network for target speaker recovery from si n- gle channel speech mixtures,” in Proc. INTERSPEECH 2018 , 09 2018, pp. 307–311

work page 2018

[14] [14]

Speakerbeam: Speaker aware neu ral network for target speaker extraction in speech mixtures,

K. ˇZmol´ ıkov´ a, M. Delcroix, K. Kinoshita, T. Ochiai, T. Nakatani, L. Burget, and J. Cernocky, “Speakerbeam: Speaker aware neu ral network for target speaker extraction in speech mixtures,” IEEE Journal of Selected Topics in Signal Processing , vol. PP , pp. 1–1, 06 2019

work page 2019

[15] [15]

Speakerﬁlter: Deep learning- based target speaker extraction using anchor speech,

S. He, H. Li, and X. Zhang, “Speakerﬁlter: Deep learning- based target speaker extraction using anchor speech,” in IEEE Interna- tional Conference on Acoustics, Speech and Signal Processi ng (ICASSP), 2020, pp. 376–380

work page 2020

[16] [16]

Usef-tse: Universal speaker embeddin g free target speaker extraction,

B. Zeng and M. Li, “Usef-tse: Universal speaker embeddin g free target speaker extraction,” IEEE Transactions on Audio, Speech and Language Processing, vol. 33, pp. 2110–2124, 2025

work page 2025

[17] [17]

X-tf-gridnet: A time–freque ncy do- main target speaker extraction network with adaptive speak er em- bedding fusion,

F. Hao, X. Li, and C. Zheng, “X-tf-gridnet: A time–freque ncy do- main target speaker extraction network with adaptive speak er em- bedding fusion,” Information Fusion, vol. 112, p. 102550, 2024

work page 2024

[18] [18]

Tar- get speaker extraction through comparing noisy positive and negative audio enrollments,

S. Xu, Y . Y ang, N. Trigoni, and A. Markham, “Tar- get speaker extraction through comparing noisy positive and negative audio enrollments,” 2025. [Online]. Availabl e: https://arxiv.org/abs/2502.16611

work page arXiv 2025

[19] [19]

On the effectiveness of enrollment speech augmentation fo r tar- get speaker extraction,

J. Li, K. Zhang, S. Wang, H. Li, M.-W. Mak, and K. A. Lee, “On the effectiveness of enrollment speech augmentation fo r tar- get speaker extraction,” in 2024 IEEE Spoken Language Technol- ogy W orkshop (SLT), 12 2024, pp. 325–332

work page 2024

[20] [20]

Look once to hear: Target speech hearing with noisy examples,

B. V eluri, M. Itani, T. Chen, T. Y oshioka, and S. Gollakota, “Look once to hear: Target speech hearing with noisy examples,” Pro- ceedings of the 2024 CHI Conference on Human Factors in Com- puting Systems, 2024

work page 2024

[21] [21]

End-to-end target speaker sp eech recognition using context-aware attention mechanisms for chal- lenging enrollment scenario,

M. Ghane and M. S. Safari, “End-to-end target speaker sp eech recognition using context-aware attention mechanisms for chal- lenging enrollment scenario,” IEEE Signal Processing Letters , vol. 32, pp. 1940–1944, 2025

work page 1940

[22] [22]

Speaker activity driven neural speech extrac tion,

M. Delcroix, K. Zmolikova, T. Ochiai, K. Kinoshita, and T. Nakatani, “Speaker activity driven neural speech extrac tion,” in IEEE International Conference on Acoustics, Speech and Sig - nal Processing (ICASSP), 2021, pp. 6099–6103

work page 2021

[23] [23]

A re- gion based non-overlapping reference speech estimation me thod for speaker extraction,

Y . Zhang, Z. Li, B. Liu, H. Fan, Y . Y ang, and Q. Y ang, “A re- gion based non-overlapping reference speech estimation me thod for speaker extraction,” in MultiMedia Modeling. Springer Na- ture Switzerland, 2024, pp. 437–447

work page 2024

[24] [24]

Robust speaker extraction network based on iterat ive reﬁned adaptation,

D. Chengyun, S. Ma, Y . Sha, Y . Zhang, H. Zhang, H. Song, an d F. Wang, “Robust speaker extraction network based on iterat ive reﬁned adaptation,” in Proc. INTERSPEECH 2021 , 08 2021, pp. 3530–3534

work page 2021

[25] [25]

Multi- stage speaker extraction with utterance and frame-level re ference signals,

M. Ge, C. Xu, L. Wang, E. Chng, J. Dang, and H. Li, “Multi- stage speaker extraction with utterance and frame-level re ference signals,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 06 2021, pp. 6109–6113

work page 2021

[26] [26]

Target speaker extraction with ultra-short reference speech by ve-ve fram ework,

L. Y ang, W. Liu, L. Tan, J. Y ang, and H.-G. Moon, “Target speaker extraction with ultra-short reference speech by ve-ve fram ework,” in IEEE International Conference on Acoustics, Speech and Sig - nal Processing (ICASSP), 2023, pp. 1–5

work page 2023

[27] [27]

Retrieval-augmented generation for knowled ge- intensive nlp tasks,

P . Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K¨ uttler, M. Lewis, W.-t. Yih, T. Rockt¨ aschel, S. Riedel , and D. Kiela, “Retrieval-augmented generation for knowled ge- intensive nlp tasks,” in Advances in Neural Information Process- ing Systems, Red Hook, NY , USA, 2020

work page 2020

[28] [28]

Tea-pse: Tencent-ethereal-audio-lab perso nalized speech enhancement system for icassp 2022 dns challenge,

Y . Ju, W. Rao, X. Y an, Y . Fu, S. Lv, L. Cheng, Y . Wang, L. Xie , and S. Shang, “Tea-pse: Tencent-ethereal-audio-lab perso nalized speech enhancement system for icassp 2022 dns challenge,” i n IEEE International Conference on Acoustics, Speech and Sig nal Processing (ICASSP), 2022, pp. 9291–9295

work page 2022

[29] [29]

Tea-pse 2.0: Sub-band network for real-time personalized speech enhancement,

Y . Ju, S. Zhang, W. Rao, Y . Wang, T. Y u, L. Xie, and S. Shang , “Tea-pse 2.0: Sub-band network for real-time personalized speech enhancement,” in 2022 IEEE Spoken Language Technology W ork- shop (SLT), 2023, pp. 472–479

work page 2022

[30] [30]

Tea-pse 3.0: Tencent-ethereal-audio-lab pe rsonal- ized speech enhancement system for icassp 2023 dns-challen ge,

Y . Ju, J. Chen, S. Zhang, S. He, W. Rao, W. Zhu, Y . Wang, T. Y u, and S. Shang, “Tea-pse 3.0: Tencent-ethereal-audio-lab pe rsonal- ized speech enhancement system for icassp 2023 dns-challen ge,” in IEEE International Conference on Acoustics, Speech and Sig - nal Processing (ICASSP), 2023, pp. 1–2

work page 2023

[31] [31]

Spex: Multi-scale time do- main speaker extraction network,

C. Xu, W. Rao, E. Chng, and H. Li, “Spex: Multi-scale time do- main speaker extraction network,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , vol. PP , pp. 1–1, 04 2020

work page 2020

[32] [32]

Spex+: A complete time domain speaker extraction network,

M. Ge, C. Xu, L. Wang, E. Chng, J. Dang, and H. Li, “Spex+: A complete time domain speaker extraction network,” in Proc. INTERSPEECH 2020, 10 2020, pp. 1406–1410

work page 2020

[33] [33]

Multi-level speaker representation for target speaker ex traction,

K. Zhang, J. Li, S. Wang, Y . Wei, Y . Wang, Y . Wang, and H. Li , “Multi-level speaker representation for target speaker ex traction,” in IEEE International Conference on Acoustics, Speech and Sig - nal Processing (ICASSP), 04 2025, pp. 1–5

work page 2025

[34] [34]

Hierarchical speaker representation for target speaker e xtrac- tion,

S. He, H. Zhang, W. Rao, K. Zhang, Y . Ju, Y . Y ang, and X. Zhang, “Hierarchical speaker representation for target speaker e xtrac- tion,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 10 361–10 365

work page 2024

[35] [35]

Enhancing target speaker extraction with hierarchical sp eaker representation learning,

S. He, W. Xue, Y . Y ang, H. Zhang, J. Pan, and X. Zhang, “Enhancing target speaker extraction with hierarchical sp eaker representation learning,” Neural Networks, vol. 188, 8 2025. [On- line]. Available: https://doi.org/10.1016/j.neunet.2025.107388

work page doi:10.1016/j.neunet.2025.107388 2025

[36] [36]

Tf-gridnet: Integrating full- and sub-band modeling for speech separation,

Z.-Q. Wang, S. Cornell, S. Choi, Y . Lee, B.-Y . Kim, and S.Watan- abe, “Tf-gridnet: Integrating full- and sub-band modeling for speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. PP , pp. 1–15, 01 2023

work page 2023

[37] [37]

Muse: Multi-modal targe t speaker extraction with visual cues,

Z. Pan, R. Tao, C. Xu, and H. Li, “Muse: Multi-modal targe t speaker extraction with visual cues,” in IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP ), 2021, pp. 6678–6682

work page 2021

[38] [38]

M o- muse: Momentum multi-modal target speaker extraction for r eal- time scenarios with impaired visual cues,

J. Li, K. Zhang, S. Wang, K. A. Lee, M.-W. Mak, and H. Li, “M o- muse: Momentum multi-modal target speaker extraction for r eal- time scenarios with impaired visual cues,” in Proc. IEEE Interna- tional Conference on Multimedia and Expo (ICME) , 06 2025, pp. 1–6

work page 2025

[39] [39]

Ts-sep: Joint diarization and separation co ndi- tioned on estimated speaker embeddings,

C. Boeddeker, A. Subramanian, G. Wichern, R. Haeb-Umba ch, and J. Le Roux, “Ts-sep: Joint diarization and separation co ndi- tioned on estimated speaker embeddings,” IEEE/ACM Transac- tions on Audio, Speech, and Language Processing , vol. PP , pp. 1–13, 01 2024

work page 2024

[40] [40]

Wavrag: Audio-integrated retrieval augmented ge ner- ation for spoken dialogue models,

Y . Chen, S. Ji, H. Wang, Z. Wang, S. Chen, J. He, J. Xu, and Z. Zhao, “Wavrag: Audio-integrated retrieval augmented ge ner- ation for spoken dialogue models,” in Proceedings of the 63rd Annual Meeting of the Association for Computational Lingui stics (V olume 1: Long Papers). Vienna, Austria: Association for Com- putational Linguistics, Jul. 2025, pp. 12...

work page 2025

[41] [41]

Recap: Retrieval-augmented audio captioning,

S. Ghosh, S. Kumar, C. Evuru, R. Duraiswami, and D. Manoc ha, “Recap: Retrieval-augmented audio captioning,” in IEEE Inter- national Conference on Acoustics, Speech and Signal Proces sing (ICASSP), 04 2024, pp. 1161–1165

work page 2024

[42] [42]

Retrieval-augmented text-to-audio generation,

Y . Y uan, H. Liu, X. Liu, Q. Huang, M. D. Plumbley, and W. Wang, “Retrieval-augmented text-to-audio generation,” in IEEE Interna- tional Conference on Acoustics, Speech and Signal Processi ng (ICASSP), 2024, pp. 581–585

work page 2024

[43] [43]

Listen only to me! how well can target speech ex - traction handle false alarms?

M. Delcroix, K. Kinoshita, T. Ochiai, K. ˇZmol´ ıkov´ a, H. Sato, and T. Nakatani, “Listen only to me! how well can target speech ex - traction handle false alarms?” in Proc. INTERSPEECH 2022, 09 2022, pp. 216–220

work page 2022

[44] [44]

Enha ncing speaker extraction through rectifying target confusion,

J. Wang, S. Wang, J. Li, K. Zhang, Y . Qian, and H. Li, “Enha ncing speaker extraction through rectifying target confusion,” in 2024 IEEE Spoken Language Technology W orkshop (SLT), 12 2024, pp. 349–356

work page 2024

[45] [45]

Spkaugtse: A simple and efﬁcient approach to address target confusion in end-to -end speaker extraction,

Z. Y ou, Z. Zhou, L. Li, and D. Wang, “Spkaugtse: A simple and efﬁcient approach to address target confusion in end-to -end speaker extraction,” in Proc. APSIPA ASC, 2025, pp. 583–588

work page 2025

[46] [46]

Continuous speech separation: Dataset a nd analysis,

Z. Chen, T. Y oshioka, L. Lu, T. Zhou, Z. Meng, Y . Luo, J. Wu , X. Xiao, and J. Li, “Continuous speech separation: Dataset a nd analysis,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2020, pp. 7284–7288

work page 2020

[47] [47]

Ecapa -tdnn: Emphasized channel attention, propagation and aggregatio n in tdnn based speaker veriﬁcation,

B. Desplanques, J. Thienpondt, and K. Demuynck, “Ecapa -tdnn: Emphasized channel attention, propagation and aggregatio n in tdnn based speaker veriﬁcation,” in Proc. INTERSPEECH 2020 , 10 2020

work page 2020

[48] [48]

Wespeaker: A research and production oriented speaker embedding learning toolkit,

H. Wang, C. Liang, S. Wang, Z. Chen, B. Zhang, X. Xiang, Y . Deng, and Y . Qian, “Wespeaker: A research and production oriented speaker embedding learning toolkit,” in IEEE Interna- tional Conference on Acoustics, Speech and Signal Processi ng (ICASSP). IEEE, 2023, pp. 1–5

work page 2023

[49] [49]

emotion2vec: Self-supervised pre-training for speech em otion representation,

Z. Ma, Z. Zheng, J. Y e, J. Li, Z. Gao, S. Zhang, and X. Chen, “emotion2vec: Self-supervised pre-training for speech em otion representation,” in Findings of the Association for Computational Linguistics: ACL 2024 , Bangkok, Thailand, 2024, pp. 15 747– 15 760

work page 2024

[50] [50]

Deep clus- tering: Discriminative embeddings for segmentation and se para- tion,

J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clus- tering: Discriminative embeddings for segmentation and se para- tion,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 31–35

work page 2016

[51] [51]

Csr-i (w sj0) com- plete ldc93s6a,

J. Garofolo, D. Graff, D. Paul, and D. Pallett, “Csr-i (w sj0) com- plete ldc93s6a,” Web Download, Philadelphia, 1993, lDC93S 6A

work page 1993

[52] [52]

Librimix: An open-source dataset for generalizable speech separation,

J. Cosentino, M. Pariente, S. Cornell, A. Deleforge, an d E. Vin- cent, “Librimix: An open-source dataset for generalizable speech separation,” arXiv: Audio and Speech Processing, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:218862876

work page 2020

[53] [53]

Emotional voice co n- version: Theory, databases and esd,

K. Zhou, B. Sisman, R. Liu, and H. Li, “Emotional voice co n- version: Theory, databases and esd,” Speech Communication, vol. 137, pp. 1–18, 2022

work page 2022

[54] [54]

Seen and unseen emotional style transfer for voice con- version with a new emotional speech dataset,

——, “Seen and unseen emotional style transfer for voice con- version with a new emotional speech dataset,” in IEEE Interna- tional Conference on Acoustics, Speech and Signal Processi ng (ICASSP). IEEE, 2021, pp. 920–924

work page 2021

[55] [55]

X-tasnet: Robust and accu - rate time-domain speaker extraction network,

Z. Zhang, B. He, and Z. Zhang, “X-tasnet: Robust and accu - rate time-domain speaker extraction network,” in Proc. INTER- SPEECH 2020, 10 2020, pp. 1421–1425

work page 2020