pith. sign in

arxiv: 2601.06006 · v2 · pith:DXKRG35Lnew · submitted 2026-01-09 · 📡 eess.AS · cs.SD

Discriminative-Generative Target Speaker Extraction with Decoder-Only Language Models

Pith reviewed 2026-05-21 15:38 UTC · model grok-4.3

classification 📡 eess.AS cs.SD
keywords target speaker extractionspeech enhancementdiscriminative-generative frameworkdecoder-only language modelneural audio codecperceptual qualityspeaker consistencyinterference suppression
0
0 comments X

The pith

A two-stage framework pairs a discriminative front-end for interference suppression with a generative decoder-only language model back-end for speech reconstruction to improve perceptual quality and speaker consistency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to establish that a hybrid discriminative-generative approach for target speaker extraction and speech enhancement can combine strong separation ability with more natural output than either type of model achieves by itself. Purely discriminative systems remove unwanted sounds effectively but often produce artificial speech, while purely generative systems built on language models can introduce invented content or lose the target speaker's identity. The proposed design first creates target-focused representations that suppress interference, then feeds them to an autoregressive decoder-only language model that rebuilds the speech waveform inside a neural audio codec space. This integration is intended to deliver results that score well on listening quality, word accuracy, and voice matching at the same time.

Core claim

The central claim is that the discriminative-generative two-stage framework, in which a discriminative front-end first produces target-related representations with strong interference suppression and a generative back-end then reconstructs high-quality speech in neural audio codec representation space, achieves a better balance among perceptual quality, intelligibility, and speaker consistency than purely discriminative or purely generative baselines on TSE and SE benchmarks. The work also examines several ways the two stages can work together, including front-end freezing, joint fine-tuning, SI-SDR regularization, and choices between autoregressive and non-autoregressive inference.

What carries the argument

The two-stage architecture in which a discriminative front-end supplies target-related representations for interference suppression and a generative back-end based on an autoregressive decoder-only language model reconstructs the final speech inside neural audio codec space.

If this is right

  • The framework keeps the interference suppression strength of discriminative models while adding the natural reconstruction strength of generative models.
  • Strategies such as freezing the front-end, joint fine-tuning, and adding SI-SDR regularization can be used to adjust how the stages interact.
  • Gains appear on both dedicated target speaker extraction tasks and broader speech enhancement under noise.
  • Conditioning the generative stage on discriminative representations reduces the hallucination and content drift that appear in purely generative systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same intermediate representations could let researchers swap in different front-end or back-end models for other audio separation problems.
  • Switching to non-autoregressive inference in the back-end might support lower-latency uses while retaining most of the quality gain.
  • Scaling the size of the decoder-only language model in the back-end could produce further lifts in naturalness for the same front-end output.

Load-bearing premise

The generative back-end can reconstruct accurate high-quality speech from the discriminative front-end's representations without adding hallucinations or shifting the spoken content even under complex noisy or overlapping conditions.

What would settle it

A direct comparison on overlapping multi-speaker mixtures showing higher word error rates or lower speaker similarity scores for the two-stage outputs than for a strong discriminative baseline alone would disprove the claimed balance.

Figures

Figures reproduced from arXiv: 2601.06006 by Bang Zeng, Beilong Tang, Ming Li, Wang Xiang.

Figure 1
Figure 1. Figure 1: The diagram of a typical target speaker extraction method. The speaker embedding extractor is typically a pre-trained speaker recognition model. ’C’ [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The diagram of LauraTSE network. ‘m’ and ‘r’ denote the mixed speech and reference speech, respectively. We use two weight sharing conformer [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The diagram of discriminative-generative target speaker extraction framework. ‘m’ and ‘r’ denote the mixed speech and reference speech, respectively. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The diagram of USEF-Laura-TSE. ‘m’ and ‘r’ denote the mixed speech and reference speech, respectively. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: dWER versus training data scale across models. Annotations ”(- [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Target speaker extraction (TSE) aims to recover the speech of a desired speaker from a mixture given a short enrollment utterance, while speech enhancement (SE) focuses on improving speech quality under noisy conditions. Most existing TSE and SE systems are based on discriminative modeling and have shown strong interference suppression ability, but they often remain limited in perceptual quality and naturalness. To address this issue, we first introduce LauraTSE, a generative TSE model built on an autoregressive decoder-only language model. Although generative modeling is promising for quality enhancement, purely generative TSE may suffer from hallucination, content drift, and limited controllability in complex acoustic conditions. We therefore propose a discriminative-generative two-stage framework, where a discriminative front-end first produces target-related representations with strong interference suppression, and a generative back-end then reconstructs high-quality speech in the neural audio codec representation space. This design combines the controllability of discriminative extraction with the reconstruction capability of generative modeling. We further investigate several collaboration strategies for the two-stage framework, including front-end freezing, joint fine-tuning, SI-SDR regularization, and autoregressive/non-autoregressive inference. Experimental results on both TSE and SE benchmarks show that the proposed framework achieves a better balance among perceptual quality, intelligibility, and speaker consistency than purely discriminative or purely generative baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces LauraTSE, a generative TSE model based on an autoregressive decoder-only language model, and proposes a discriminative-generative two-stage framework for target speaker extraction (TSE) and speech enhancement (SE). A discriminative front-end first produces target-related representations for strong interference suppression; a generative back-end then reconstructs high-quality speech in neural audio codec space. The authors explore collaboration strategies including front-end freezing, joint fine-tuning, SI-SDR regularization, and AR/NAR inference, claiming that the hybrid framework achieves a better balance among perceptual quality, intelligibility, and speaker consistency than purely discriminative or purely generative baselines on TSE and SE benchmarks.

Significance. If the central claim holds with rigorous verification that content is preserved, the hybrid approach could meaningfully advance audio and speech processing by combining the controllability of discriminative models with the naturalness of generative reconstruction, potentially improving real-world robustness in noisy or multi-speaker conditions.

major comments (2)
  1. [§3] §3 (two-stage framework description): the claim that the generative back-end reliably reconstructs without hallucination or content drift rests on an unverified transfer property of the interface; no ablation isolates semantic fidelity via ASR-WER on held-out transcripts or embedding similarity to enrollment utterances, even though standard metrics (PESQ, STOI, speaker similarity) can improve while content drifts under reverberation or mismatch.
  2. [§4] §4 (experimental results): the reported balance on TSE/SE benchmarks lacks error bars, dataset split details, and direct comparisons that would confirm the generative stage does not introduce drift; without these, it is unclear whether the improvement over baselines is robust or dependent on post-hoc choices in collaboration strategies.
minor comments (1)
  1. [Abstract] Abstract: quantitative metrics and dataset names are omitted, making it harder for readers to immediately gauge the strength of the balance claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We have carefully considered the concerns regarding verification of content preservation and experimental rigor in the two-stage framework. Below we provide point-by-point responses and indicate the revisions made to the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (two-stage framework description): the claim that the generative back-end reliably reconstructs without hallucination or content drift rests on an unverified transfer property of the interface; no ablation isolates semantic fidelity via ASR-WER on held-out transcripts or embedding similarity to enrollment utterances, even though standard metrics (PESQ, STOI, speaker similarity) can improve while content drifts under reverberation or mismatch.

    Authors: We agree that ASR-WER and embedding similarity provide valuable additional evidence for semantic fidelity. The original manuscript relied on speaker similarity and STOI as proxies for content preservation, which are standard in the TSE literature and showed consistent gains for the hybrid approach. To strengthen the claim, we have added an ablation study in the revised Section 3 (and Appendix) that reports ASR-WER on held-out transcripts from a pre-trained ASR model as well as cosine similarity between speaker embeddings extracted from the enrollment utterance and the reconstructed output. These new results confirm that the hybrid framework exhibits lower content drift than the purely generative baseline, particularly under reverberant conditions. We have also expanded the description of the interface to better articulate why the discriminative front-end representations support stable transfer to the generative back-end. revision: yes

  2. Referee: [§4] §4 (experimental results): the reported balance on TSE/SE benchmarks lacks error bars, dataset split details, and direct comparisons that would confirm the generative stage does not introduce drift; without these, it is unclear whether the improvement over baselines is robust or dependent on post-hoc choices in collaboration strategies.

    Authors: We acknowledge the importance of these details for assessing robustness. In the revised manuscript we now report mean and standard deviation across three random seeds as error bars in all result tables. A new subsection in Section 4 explicitly describes the train/validation/test splits for each benchmark dataset. We have also added direct comparisons of the front-end alone, the generative back-end alone, and the full two-stage system (with and without SI-SDR regularization) on both clean and reverberant test conditions. These comparisons demonstrate that the generative stage improves perceptual quality without measurable content drift relative to the front-end output, and that the reported gains are not sensitive to the specific collaboration strategy chosen. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with independent experimental validation

full rationale

The paper proposes a two-stage discriminative-generative TSE framework and evaluates it on standard benchmarks using perceptual metrics. No equations, derivations, or fitted-parameter predictions appear in the abstract or described content. Claims rest on comparative results rather than any self-definitional reduction, self-citation chain, or ansatz smuggled via prior work. The collaboration strategies (freezing, joint fine-tuning, SI-SDR regularization) are presented as design choices and tested directly, without reducing to inputs by construction. This is a standard empirical architecture paper whose central claims remain falsifiable outside any internal fit.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based solely on abstract; no explicit free parameters, axioms, or invented entities are stated. Model training likely involves standard neural network hyperparameters not detailed here.

pith-pipeline@v0.9.0 · 5769 in / 1085 out tokens · 34474 ms · 2026-05-21T15:38:06.087070+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We propose a discriminative–generative two-stage TSE framework, where a discriminative front-end first produces target-related representations... and a generative back-end then reconstructs high-quality speech in the neural audio codec representation space.

  • IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat_equiv_Nat unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    LauraTSE comprises a compact AR decoder-only LM that predicts coarse-grained target speech representations... together with a lightweight encoder-only LM designed to recover fine-grained acoustic details.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

93 extracted references · 93 canonical work pages · 2 internal anchors

  1. [1]

    Some experiments on the recognition of speech, with one and with two ears,

    E. C. Cherry, “Some experiments on the recognition of speech, with one and with two ears,”Journal of the Acoustical Society of America, vol. 25, pp. 975–979, 1953

  2. [2]

    The cocktail party phenomenon: A review of research on speech intelligibility in multiple-talker conditions,

    A. W. Bronkhorst, “The cocktail party phenomenon: A review of research on speech intelligibility in multiple-talker conditions,”Acta acustica united with acustica, vol. 86, no. 1, pp. 117–128, 2000

  3. [3]

    Single-channel speech separation using sparse non-negative matrix factorization

    M. N. Schmidt and R. K. Olsson, “Single-channel speech separation using sparse non-negative matrix factorization.” inProc. of Interspeech, vol. 2. Citeseer, 2006, pp. 2–5

  4. [4]

    New algorithms for non- negative matrix factorization in applications to blind source separation,

    A. Cichocki, R. Zdunek, and S.-i. Amari, “New algorithms for non- negative matrix factorization in applications to blind source separation,” in2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, vol. 5, 2006, pp. V–V

  5. [5]

    A computational model of binaural localization and separa- tion,

    R. Lyon, “A computational model of binaural localization and separa- tion,” inProc. of ICASSP, vol. 8, 1983, pp. 1148–1151

  6. [6]

    Wang and G

    D. Wang and G. J. Brown,Computational auditory scene analysis: Principles, algorithms, and applications. Wiley-IEEE press, 2006

  7. [7]

    Auditory segmentation based on onset and offset analysis,

    G. Hu and D. Wang, “Auditory segmentation based on onset and offset analysis,”IEEE/ACM transactions on audio, speech, and language processing, vol. 15, no. 2, pp. 396–405, 2007

  8. [8]

    Deep clustering: Discriminative embeddings for segmentation and separation,

    J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” inProc. of ICASSP, 2016, pp. 31–35

  9. [9]

    Single-Channel Multi-Speaker Separation using Deep Clustering

    Y . Isik, J. L. Roux, Z. Chen, S. Watanabe, and J. R. Hershey, “Single- channel multi-speaker separation using deep clustering,”arXiv preprint arXiv:1607.02173, 2016

  10. [10]

    Multi-channel deep clustering: Discriminative spectral and spatial embeddings for speaker- independent speech separation,

    Z.-Q. Wang, J. Le Roux, and J. R. Hershey, “Multi-channel deep clustering: Discriminative spectral and spatial embeddings for speaker- independent speech separation,” inProc. of ICASSP, 2018, pp. 1–5

  11. [11]

    Deep attractor network for single- microphone speaker separation,

    Z. Chen, Y . Luo, and N. Mesgarani, “Deep attractor network for single- microphone speaker separation,” inProc. of ICASSP, 2017, pp. 246–250

  12. [12]

    Supervised speech separation based on deep learning: An overview,

    D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,”IEEE/ACM transactions on audio, speech, and language processing, vol. 26, no. 10, pp. 1702–1726, 2018

  13. [13]

    Speaker-independent speech separation with deep attractor network,

    Y . Luo, Z. Chen, and N. Mesgarani, “Speaker-independent speech separation with deep attractor network,”IEEE/ACM transactions on Audio, Speech, and Language Processing, vol. 26, no. 4, pp. 787–796, 2018

  14. [14]

    Permutation invariant training of deep models for speaker-independent multi-talker speech separation,

    D. Yu, M. Kolbæk, Z.-H. Tan, and J. Jensen, “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” inProc. of ICASSP, 2017, pp. 241–245

  15. [15]

    Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,

    M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen, “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 10, pp. 1901–1913, 2017

  16. [16]

    Tasnet: time-domain audio separation network for real-time, single-channel speech separation,

    Y . Luo and N. Mesgarani, “Tasnet: time-domain audio separation network for real-time, single-channel speech separation,” inProc. of ICASSP, 2018, pp. 696–700

  17. [17]

    Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation,

    Y . Luo, Z. Chen, and T. Yoshioka, “Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation,” inProc. of ICASSP, 2020, pp. 46–50

  18. [18]

    Sudo rm-rf: Efficient networks for universal audio source separation,

    E. Tzinis, Z. Wang, and P. Smaragdis, “Sudo rm-rf: Efficient networks for universal audio source separation,” inProc. of MLSP, 2020, pp. 1–6

  19. [19]

    Conv-tasnet: Surpassing ideal time– frequency magnitude masking for speech separation,

    Y . Luo and N. Mesgarani, “Conv-tasnet: Surpassing ideal time– frequency magnitude masking for speech separation,”IEEE/ACM trans- actions on audio, speech, and language processing, vol. 27, no. 8, pp. 1256–1266, 2019

  20. [20]

    Wavesplit: End-to-end speech separation by speaker clustering,

    N. Zeghidour and D. Grangier, “Wavesplit: End-to-end speech separation by speaker clustering,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2840–2849, 2021

  21. [21]

    Dual-path rnn for long recording speech separation,

    C. Li, Y . Luo, C. Han, J. Li, T. Yoshioka, T. Zhou, M. Delcroix, K. Kinoshita, C. Boeddeker, Y . Qianet al., “Dual-path rnn for long recording speech separation,” inProc. of SLT, 2021, pp. 865–872

  22. [22]

    Dual-path transformer network: Direct context-aware modeling for end-to-end monaural speech separation,

    J. Chen, Q. Mao, and D. Liu, “Dual-path transformer network: Direct context-aware modeling for end-to-end monaural speech separation,” arXiv preprint arXiv:2007.13975, 2020

  23. [23]

    Attention is all you need in speech separation,

    C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi, and J. Zhong, “Attention is all you need in speech separation,” inProc. of ICASSP, 2021, pp. 21–25

  24. [24]

    An efficient encoder-decoder architec- ture with top-down attention for speech separation,

    K. Li, R. Yang, and X. Hu, “An efficient encoder-decoder architec- ture with top-down attention for speech separation,”arXiv preprint arXiv:2209.15200, 2022

  25. [25]

    Qdpn-quasi-dual-path network for single- channel speech separation

    J. Rixen and M. Renz, “Qdpn-quasi-dual-path network for single- channel speech separation.” inProc. of Interspeech, 2022, pp. 5353– 5357

  26. [26]

    Spgm: Prioritizing local features for enhanced speech separation performance,

    J. Q. Yip, S. Zhao, Y . Ma, C. Ni, C. Zhang, H. Wang, T. H. Nguyen, K. Zhou, D. Ng, E. S. Chnget al., “Spgm: Prioritizing local features for enhanced speech separation performance,” inProc. of ICASSP, 2024, pp. 326–330

  27. [27]

    Mossformer: Pushing the performance limit of monaural speech separation using gated single-head transformer with convolution-augmented joint self-attentions,

    S. Zhao and B. Ma, “Mossformer: Pushing the performance limit of monaural speech separation using gated single-head transformer with convolution-augmented joint self-attentions,” inProc. of ICASSP, 2023, pp. 1–5. 15

  28. [28]

    Mossformer2: Combining transformer and rnn-free recurrent network for enhanced time-domain monaural speech separation,

    S. Zhao, Y . Ma, C. Ni, C. Zhang, H. Wang, T. H. Nguyen, K. Zhou, J. Q. Yip, D. Ng, and B. Ma, “Mossformer2: Combining transformer and rnn-free recurrent network for enhanced time-domain monaural speech separation,” inProc. of ICASSP, 2024, pp. 10 356–10 360

  29. [29]

    V oiceFilter: Tar- geted V oice Separation by Speaker-Conditioned Spectrogram Masking,

    Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. R. Hershey, R. A. Saurous, R. J. Weiss, Y . Jia, and I. L. Moreno, “V oiceFilter: Tar- geted V oice Separation by Speaker-Conditioned Spectrogram Masking,” inProc. of Interspeech, 2019, pp. 2728–2732

  30. [30]

    Speakerbeam: Speaker aware neural network for target speaker extraction in speech mixtures,

    K. ˇZmol´ıkov´a, M. Delcroix, K. Kinoshita, T. Ochiai, T. Nakatani, L. Bur- get, and J. ˇCernock`y, “Speakerbeam: Speaker aware neural network for target speaker extraction in speech mixtures,”IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 4, pp. 800–814, 2019

  31. [31]

    Improving speaker discrimination of target speech extraction with time-domain speakerbeam,

    M. Delcroix, T. Ochiai, K. Zmolikova, K. Kinoshita, N. Tawara, T. Nakatani, and S. Araki, “Improving speaker discrimination of target speech extraction with time-domain speakerbeam,” inProc. of ICASSP, 2020, pp. 691–695

  32. [32]

    A unified framework for low-latency speaker extraction in cocktail party environments

    Y . Hao, J. Xu, J. Shi, P. Zhang, L. Qin, and B. Xu, “A unified framework for low-latency speaker extraction in cocktail party environments.” in Proc. of Interspeech, 2020, pp. 1431–1435

  33. [33]

    Atss-Net: Target Speaker Separation via Attention-Based Neural Network,

    T. Li, Q. Lin, Y . Bao, and M. Li, “Atss-Net: Target Speaker Separation via Attention-Based Neural Network,” inProc. of Interspeech, 2020, pp. 1411–1415

  34. [34]

    X-tasnet: Robust and accurate time- domain speaker extraction network,

    Z. Zhang, B. He, and Z. Zhang, “X-tasnet: Robust and accurate time- domain speaker extraction network,” inProc. of Interspeech, 2020

  35. [35]

    Spex: Multi-scale time domain speaker extraction network,

    C. Xu, W. Rao, E. S. Chng, and H. Li, “Spex: Multi-scale time domain speaker extraction network,”IEEE/ACM transactions on audio, speech, and language processing, vol. 28, pp. 1370–1384, 2020

  36. [36]

    Spex+: A complete time domain speaker extraction network,

    M. Ge, C. Xu, L. Wang, E. S. Chng, J. Dang, and H. Li, “Spex+: A complete time domain speaker extraction network,” inProc. of Interspeech, 2020, pp. 1406–1410

  37. [37]

    Neural speaker extraction with speaker-speech cross-attention network

    W. Wang, C. Xu, M. Ge, and H. Li, “Neural speaker extraction with speaker-speech cross-attention network.” inProc. of Interspeech, 2021, pp. 3535–3539

  38. [38]

    Multi-stage speaker extraction with utterance and frame-level reference signals,

    M. Ge, C. Xu, L. Wang, E. S. Chng, J. Dang, and H. Li, “Multi-stage speaker extraction with utterance and frame-level reference signals,” in Proc. of ICASSP, 2021, pp. 6109–6113

  39. [39]

    X-sepformer: End-to-end speaker extraction network with explicit optimization on speaker confusion,

    K. Liu, Z. Du, X. Wan, and H. Zhou, “X-sepformer: End-to-end speaker extraction network with explicit optimization on speaker confusion,” in Proc. of ICASSP, 2023, pp. 1–5

  40. [40]

    X-tf-gridnet: A time–frequency domain target speaker extraction network with adaptive speaker embedding fusion,

    F. Hao, X. Li, and C. Zheng, “X-tf-gridnet: A time–frequency domain target speaker extraction network with adaptive speaker embedding fusion,”Information Fusion, vol. 112, p. 102550, 2024

  41. [41]

    New insights on target speaker extraction,

    M. Elminshawi, W. Mack, S. R. Chetupalli, S. Chakrabarty, and E. A. Habets, “New insights on target speaker extraction,”arXiv preprint arXiv:2202.00733, 2022

  42. [42]

    Sef-net: Speaker embedding free target speaker extraction network,

    B. Zeng, H. Suo, Y . Wan, and M. Li, “Sef-net: Speaker embedding free target speaker extraction network,” inProc. of Interspeech, 2023, pp. 3452–3456

  43. [43]

    Smma-net: An audio clue- based target speaker extraction network with spectrogram matching and mutual attention,

    Y . Hu, H. Xu, Z. Guo, H. Huang, and L. He, “Smma-net: An audio clue- based target speaker extraction network with spectrogram matching and mutual attention,” inProc. of ICASSP, 2024, pp. 1496–1500

  44. [44]

    Target speaker extraction by directly exploiting contextual information in the time-frequency domain,

    X. Yang, C. Bao, J. Zhou, and X. Chen, “Target speaker extraction by directly exploiting contextual information in the time-frequency domain,” inProc. of ICASSP, 2024, pp. 10 476–10 480

  45. [45]

    Usef-tse: Universal speaker embedding free target speaker extraction,

    B. Zeng and M. Li, “Usef-tse: Universal speaker embedding free target speaker extraction,”IEEE Transactions on Audio, Speech and Language Processing, vol. 33, pp. 2110–2124, 2025

  46. [46]

    Bridging the gap between monaural speech enhancement and recognition with distortion-independent acoustic mod- eling,

    P. Wang, K. Tanet al., “Bridging the gap between monaural speech enhancement and recognition with distortion-independent acoustic mod- eling,”IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 28, pp. 39–48, 2019

  47. [47]

    Target speech extraction with conditional diffusion model,

    N. Kamo, M. Delcroix, and T. Nakatani, “Target speech extraction with conditional diffusion model,” inProceeding of Interspeech, 2023, pp. 176–180

  48. [48]

    Tokensplit: Using discrete speech representations for direct, refined, and transcript-conditioned speech separation and recognition,

    H. Erdogan, S. Wisdom, X. Chang, Z. Borsos, M. Tagliasacchi, N. Zeghidour, and J. R. Hershey, “Tokensplit: Using discrete speech representations for direct, refined, and transcript-conditioned speech separation and recognition,” inProceeding of Interspeech, 2023, pp. 3462–3466

  49. [49]

    Anyenhance: A unified generative model with prompt-guidance and self-critic for voice enhancement,

    J. Zhang, J. Yang, Z. Fang, Y . Wang, Z. Zhang, Z. Wang, F. Fan, and Z. Wu, “Anyenhance: A unified generative model with prompt-guidance and self-critic for voice enhancement,”arXiv preprint arXiv:2501.15417, 2025

  50. [50]

    Llase-g1: Incentivizing generalization capability for llama-based speech enhancement,

    B. Kang, X. Zhu, Z. Zhang, Z. Ye, M. Liu, Z. Wang, Y . Zhu, G. Ma, J. Chen, L. Xiaoet al., “Llase-g1: Incentivizing generalization capability for llama-based speech enhancement,”arXiv preprint arXiv:2503.00493, 2025

  51. [51]

    Dual-channel target speaker extraction based on conditional variational autoencoder and directional infor- mation,

    R. Wang, L. Li, and T. Toda, “Dual-channel target speaker extraction based on conditional variational autoencoder and directional infor- mation,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 1968–1979, 2024

  52. [52]

    Tselm: Target speaker extraction using discrete tokens and language models,

    B. Tang, B. Zeng, and M. Li, “Tselm: Target speaker extraction using discrete tokens and language models,”arXiv preprint arXiv:2409.07841, 2024

  53. [53]

    Wavlm: Large-scale self-supervised pre- training for full stack speech processing,

    S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y . Qian, Y . Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “Wavlm: Large-scale self-supervised pre- training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

  54. [54]

    Speechx: Neural codec lan- guage model as a versatile speech transformer,

    X. Wang, M. Thakker, Z. Chen, N. Kanda, S. E. Eskimez, S. Chen, M. Tang, S. Liu, J. Li, and T. Yoshioka, “Speechx: Neural codec lan- guage model as a versatile speech transformer,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024

  55. [55]

    Lauratse: Target speaker extraction using auto-regressive decoder-only language models,

    B. Tang, B. Zeng, and M. Li, “Lauratse: Target speaker extraction using auto-regressive decoder-only language models,”arXiv preprint arXiv:2504.07402, 2025

  56. [56]

    V oicefilter-lite: Streaming targeted voice separation for on-device speech recognition,

    Q. Wang, I. L. Moreno, M. Saglam, K. Wilson, A. Chiao, R. Liu, Y . He, W. Li, J. Pelecanos, M. Nikaet al., “V oicefilter-lite: Streaming targeted voice separation for on-device speech recognition,”arXiv preprint arXiv:2009.04323, 2020

  57. [57]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProc. of ICPR, 2016, pp. 770–778

  58. [58]

    In defence of metric learning for speaker recognition,

    J. S. Chung, J. Huh, S. Mun, M. Lee, H. S. Heo, S. Choe, C. Ham, S. Jung, B.-J. Lee, and I. Han, “In defence of metric learning for speaker recognition,” inProc. of Interspeech, 2020

  59. [59]

    Generalized end-to-end loss for speaker verification,

    L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” inProc. of ICASSP, 2018, pp. 4879–4883

  60. [60]

    Arcface: Additive angular margin loss for deep face recognition,

    J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” inProc. of ICPR, 2019, pp. 4690–4699

  61. [61]

    Single-channel speech extraction using speaker inventory and attention network,

    X. Xiao, Z. Chen, T. Yoshioka, H. Erdogan, C. Liu, D. Dimitriadis, J. Droppo, and Y . Gong, “Single-channel speech extraction using speaker inventory and attention network,” inProc. of ICASSP, 2019, pp. 86–90

  62. [62]

    Target speaker extraction with ultra-short reference speech by ve-ve framework,

    L. Yang, W. Liu, L. Tan, J. Yang, and H.-G. Moon, “Target speaker extraction with ultra-short reference speech by ve-ve framework,” in Proc. of ICASSP, 2023, pp. 1–5

  63. [63]

    X-vectors: Robust dnn embeddings for speaker recognition,

    D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition,” inProc. of ICASSP, 2018, pp. 5329–5333

  64. [64]

    Tabe: Decoupling spatial and spectral processing with taylor’s unfolding method in the beamspace domain for multi-channel speech enhancement,

    A. Li, G. Yu, Z. Xu, C. Fan, X. Li, and C. Zheng, “Tabe: Decoupling spatial and spectral processing with taylor’s unfolding method in the beamspace domain for multi-channel speech enhancement,”Information Fusion, vol. 101, p. 101976, 2024

  65. [65]

    Diffusion-based generative speech source separation,

    R. Scheibler, Y . Ji, S.-W. Chung, J. Byun, S. Choe, and M.-S. Choi, “Diffusion-based generative speech source separation,” inICASSP 2023- 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

  66. [66]

    Speech enhancement and dereverberation with diffusion-based genera- tive models,

    J. Richter, S. Welker, J.-M. Lemercier, B. Lay, and T. Gerkmann, “Speech enhancement and dereverberation with diffusion-based genera- tive models,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2351–2364, 2023

  67. [67]

    Conditional diffusion probabilistic model for speech enhancement,

    Y .-J. Lu, Z.-Q. Wang, S. Watanabe, A. Richard, C. Yu, and Y . Tsao, “Conditional diffusion probabilistic model for speech enhancement,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Ieee, 2022, pp. 7402–7406

  68. [68]

    Storm: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation,

    J.-M. Lemercier, J. Richter, S. Welker, and T. Gerkmann, “Storm: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2724–2737, 2023

  69. [69]

    Target speech extraction with conditional diffusion model,

    N. Kamo, M. Delcroix, and T. Nakatani, “Target speech extraction with conditional diffusion model,” inInterspeech 2023, 2023, pp. 176–180

  70. [70]

    Ddtse: Discriminative diffusion model for target speech extraction,

    L. Zhang, Y . Qian, L. Yu, H. Wang, H. Yang, S. Liu, L. Zhou, and Y . Qian, “Ddtse: Discriminative diffusion model for target speech extraction,” in2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 294–301

  71. [71]

    Target speaker extraction based on conditional variational autoencoder and directional information in un- derdetermined condition,

    R. Wang, L. Li, and T. Toda, “Target speaker extraction based on conditional variational autoencoder and directional information in un- derdetermined condition,”IEICE Technical Report; IEICE Tech. Rep., vol. 121, no. 383, pp. 76–81, 2022

  72. [72]

    Dual-channel target speaker extraction based on conditional variational autoencoder and directional information,

    ——, “Dual-channel target speaker extraction based on conditional variational autoencoder and directional information,”IEEE/ACM Trans- actions on Audio, Speech, and Language Processing, vol. 32, pp. 1968– 1979, 2024. 16

  73. [73]

    Bert: Pre-training of deep bidirectional transformers for language understanding,

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” inPro- ceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186

  74. [74]

    Language mod- els are few-shot learners,

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language mod- els are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

  75. [75]

    Exploring the limits of transfer learning with a unified text-to-text transformer,

    C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,”Journal of machine learning research, vol. 21, no. 140, pp. 1–67, 2020

  76. [76]

    Funcodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec,

    Z. Du, S. Zhang, K. Hu, and S. Zheng, “Funcodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec,” inProceeding of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 591–595

  77. [77]

    LauraGPT: Listen, attend, understand, and regenerate audio with GPT

    Z. Du, J. Wang, Q. Chen, Y . Chu, Z. Gao, Z. Li, K. Hu, X. Zhou, J. Xu, Z. Maet al., “Lauragpt: Listen, attend, understand, and regenerate audio with gpt,”arXiv preprint arXiv:2310.04673, 2023

  78. [78]

    Conformer: Convolution- augmented transformer for speech recognition,

    A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wu, and R. Pang, “Conformer: Convolution- augmented transformer for speech recognition,” inProceeding of Inter- speech, 2020, pp. 5036–5040

  79. [79]

    Librispeech: an asr corpus based on public domain audio books,

    V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” inProceeding of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210

  80. [80]

    LibriMix: An open-source dataset for generalizable speech separation

    J. Cosentino, M. Pariente, S. Cornell, A. Deleforge, and E. Vincent, “Librimix: An open-source dataset for generalizable speech separation,” arXiv preprint arXiv:2005.11262, 2020

Showing first 80 references.