Discriminative-Generative Target Speaker Extraction with Decoder-Only Language Models
Pith reviewed 2026-05-21 15:38 UTC · model grok-4.3
The pith
A two-stage framework pairs a discriminative front-end for interference suppression with a generative decoder-only language model back-end for speech reconstruction to improve perceptual quality and speaker consistency.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the discriminative-generative two-stage framework, in which a discriminative front-end first produces target-related representations with strong interference suppression and a generative back-end then reconstructs high-quality speech in neural audio codec representation space, achieves a better balance among perceptual quality, intelligibility, and speaker consistency than purely discriminative or purely generative baselines on TSE and SE benchmarks. The work also examines several ways the two stages can work together, including front-end freezing, joint fine-tuning, SI-SDR regularization, and choices between autoregressive and non-autoregressive inference.
What carries the argument
The two-stage architecture in which a discriminative front-end supplies target-related representations for interference suppression and a generative back-end based on an autoregressive decoder-only language model reconstructs the final speech inside neural audio codec space.
If this is right
- The framework keeps the interference suppression strength of discriminative models while adding the natural reconstruction strength of generative models.
- Strategies such as freezing the front-end, joint fine-tuning, and adding SI-SDR regularization can be used to adjust how the stages interact.
- Gains appear on both dedicated target speaker extraction tasks and broader speech enhancement under noise.
- Conditioning the generative stage on discriminative representations reduces the hallucination and content drift that appear in purely generative systems.
Where Pith is reading between the lines
- The same intermediate representations could let researchers swap in different front-end or back-end models for other audio separation problems.
- Switching to non-autoregressive inference in the back-end might support lower-latency uses while retaining most of the quality gain.
- Scaling the size of the decoder-only language model in the back-end could produce further lifts in naturalness for the same front-end output.
Load-bearing premise
The generative back-end can reconstruct accurate high-quality speech from the discriminative front-end's representations without adding hallucinations or shifting the spoken content even under complex noisy or overlapping conditions.
What would settle it
A direct comparison on overlapping multi-speaker mixtures showing higher word error rates or lower speaker similarity scores for the two-stage outputs than for a strong discriminative baseline alone would disprove the claimed balance.
Figures
read the original abstract
Target speaker extraction (TSE) aims to recover the speech of a desired speaker from a mixture given a short enrollment utterance, while speech enhancement (SE) focuses on improving speech quality under noisy conditions. Most existing TSE and SE systems are based on discriminative modeling and have shown strong interference suppression ability, but they often remain limited in perceptual quality and naturalness. To address this issue, we first introduce LauraTSE, a generative TSE model built on an autoregressive decoder-only language model. Although generative modeling is promising for quality enhancement, purely generative TSE may suffer from hallucination, content drift, and limited controllability in complex acoustic conditions. We therefore propose a discriminative-generative two-stage framework, where a discriminative front-end first produces target-related representations with strong interference suppression, and a generative back-end then reconstructs high-quality speech in the neural audio codec representation space. This design combines the controllability of discriminative extraction with the reconstruction capability of generative modeling. We further investigate several collaboration strategies for the two-stage framework, including front-end freezing, joint fine-tuning, SI-SDR regularization, and autoregressive/non-autoregressive inference. Experimental results on both TSE and SE benchmarks show that the proposed framework achieves a better balance among perceptual quality, intelligibility, and speaker consistency than purely discriminative or purely generative baselines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LauraTSE, a generative TSE model based on an autoregressive decoder-only language model, and proposes a discriminative-generative two-stage framework for target speaker extraction (TSE) and speech enhancement (SE). A discriminative front-end first produces target-related representations for strong interference suppression; a generative back-end then reconstructs high-quality speech in neural audio codec space. The authors explore collaboration strategies including front-end freezing, joint fine-tuning, SI-SDR regularization, and AR/NAR inference, claiming that the hybrid framework achieves a better balance among perceptual quality, intelligibility, and speaker consistency than purely discriminative or purely generative baselines on TSE and SE benchmarks.
Significance. If the central claim holds with rigorous verification that content is preserved, the hybrid approach could meaningfully advance audio and speech processing by combining the controllability of discriminative models with the naturalness of generative reconstruction, potentially improving real-world robustness in noisy or multi-speaker conditions.
major comments (2)
- [§3] §3 (two-stage framework description): the claim that the generative back-end reliably reconstructs without hallucination or content drift rests on an unverified transfer property of the interface; no ablation isolates semantic fidelity via ASR-WER on held-out transcripts or embedding similarity to enrollment utterances, even though standard metrics (PESQ, STOI, speaker similarity) can improve while content drifts under reverberation or mismatch.
- [§4] §4 (experimental results): the reported balance on TSE/SE benchmarks lacks error bars, dataset split details, and direct comparisons that would confirm the generative stage does not introduce drift; without these, it is unclear whether the improvement over baselines is robust or dependent on post-hoc choices in collaboration strategies.
minor comments (1)
- [Abstract] Abstract: quantitative metrics and dataset names are omitted, making it harder for readers to immediately gauge the strength of the balance claim.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We have carefully considered the concerns regarding verification of content preservation and experimental rigor in the two-stage framework. Below we provide point-by-point responses and indicate the revisions made to the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (two-stage framework description): the claim that the generative back-end reliably reconstructs without hallucination or content drift rests on an unverified transfer property of the interface; no ablation isolates semantic fidelity via ASR-WER on held-out transcripts or embedding similarity to enrollment utterances, even though standard metrics (PESQ, STOI, speaker similarity) can improve while content drifts under reverberation or mismatch.
Authors: We agree that ASR-WER and embedding similarity provide valuable additional evidence for semantic fidelity. The original manuscript relied on speaker similarity and STOI as proxies for content preservation, which are standard in the TSE literature and showed consistent gains for the hybrid approach. To strengthen the claim, we have added an ablation study in the revised Section 3 (and Appendix) that reports ASR-WER on held-out transcripts from a pre-trained ASR model as well as cosine similarity between speaker embeddings extracted from the enrollment utterance and the reconstructed output. These new results confirm that the hybrid framework exhibits lower content drift than the purely generative baseline, particularly under reverberant conditions. We have also expanded the description of the interface to better articulate why the discriminative front-end representations support stable transfer to the generative back-end. revision: yes
-
Referee: [§4] §4 (experimental results): the reported balance on TSE/SE benchmarks lacks error bars, dataset split details, and direct comparisons that would confirm the generative stage does not introduce drift; without these, it is unclear whether the improvement over baselines is robust or dependent on post-hoc choices in collaboration strategies.
Authors: We acknowledge the importance of these details for assessing robustness. In the revised manuscript we now report mean and standard deviation across three random seeds as error bars in all result tables. A new subsection in Section 4 explicitly describes the train/validation/test splits for each benchmark dataset. We have also added direct comparisons of the front-end alone, the generative back-end alone, and the full two-stage system (with and without SI-SDR regularization) on both clean and reverberant test conditions. These comparisons demonstrate that the generative stage improves perceptual quality without measurable content drift relative to the front-end output, and that the reported gains are not sensitive to the specific collaboration strategy chosen. revision: yes
Circularity Check
No circularity: empirical framework with independent experimental validation
full rationale
The paper proposes a two-stage discriminative-generative TSE framework and evaluates it on standard benchmarks using perceptual metrics. No equations, derivations, or fitted-parameter predictions appear in the abstract or described content. Claims rest on comparative results rather than any self-definitional reduction, self-citation chain, or ansatz smuggled via prior work. The collaboration strategies (freezing, joint fine-tuning, SI-SDR regularization) are presented as design choices and tested directly, without reducing to inputs by construction. This is a standard empirical architecture paper whose central claims remain falsifiable outside any internal fit.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a discriminative–generative two-stage TSE framework, where a discriminative front-end first produces target-related representations... and a generative back-end then reconstructs high-quality speech in the neural audio codec representation space.
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat_equiv_Nat unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
LauraTSE comprises a compact AR decoder-only LM that predicts coarse-grained target speech representations... together with a lightweight encoder-only LM designed to recover fine-grained acoustic details.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Some experiments on the recognition of speech, with one and with two ears,
E. C. Cherry, “Some experiments on the recognition of speech, with one and with two ears,”Journal of the Acoustical Society of America, vol. 25, pp. 975–979, 1953
work page 1953
-
[2]
A. W. Bronkhorst, “The cocktail party phenomenon: A review of research on speech intelligibility in multiple-talker conditions,”Acta acustica united with acustica, vol. 86, no. 1, pp. 117–128, 2000
work page 2000
-
[3]
Single-channel speech separation using sparse non-negative matrix factorization
M. N. Schmidt and R. K. Olsson, “Single-channel speech separation using sparse non-negative matrix factorization.” inProc. of Interspeech, vol. 2. Citeseer, 2006, pp. 2–5
work page 2006
-
[4]
New algorithms for non- negative matrix factorization in applications to blind source separation,
A. Cichocki, R. Zdunek, and S.-i. Amari, “New algorithms for non- negative matrix factorization in applications to blind source separation,” in2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, vol. 5, 2006, pp. V–V
work page 2006
-
[5]
A computational model of binaural localization and separa- tion,
R. Lyon, “A computational model of binaural localization and separa- tion,” inProc. of ICASSP, vol. 8, 1983, pp. 1148–1151
work page 1983
-
[6]
D. Wang and G. J. Brown,Computational auditory scene analysis: Principles, algorithms, and applications. Wiley-IEEE press, 2006
work page 2006
-
[7]
Auditory segmentation based on onset and offset analysis,
G. Hu and D. Wang, “Auditory segmentation based on onset and offset analysis,”IEEE/ACM transactions on audio, speech, and language processing, vol. 15, no. 2, pp. 396–405, 2007
work page 2007
-
[8]
Deep clustering: Discriminative embeddings for segmentation and separation,
J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” inProc. of ICASSP, 2016, pp. 31–35
work page 2016
-
[9]
Single-Channel Multi-Speaker Separation using Deep Clustering
Y . Isik, J. L. Roux, Z. Chen, S. Watanabe, and J. R. Hershey, “Single- channel multi-speaker separation using deep clustering,”arXiv preprint arXiv:1607.02173, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[10]
Z.-Q. Wang, J. Le Roux, and J. R. Hershey, “Multi-channel deep clustering: Discriminative spectral and spatial embeddings for speaker- independent speech separation,” inProc. of ICASSP, 2018, pp. 1–5
work page 2018
-
[11]
Deep attractor network for single- microphone speaker separation,
Z. Chen, Y . Luo, and N. Mesgarani, “Deep attractor network for single- microphone speaker separation,” inProc. of ICASSP, 2017, pp. 246–250
work page 2017
-
[12]
Supervised speech separation based on deep learning: An overview,
D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,”IEEE/ACM transactions on audio, speech, and language processing, vol. 26, no. 10, pp. 1702–1726, 2018
work page 2018
-
[13]
Speaker-independent speech separation with deep attractor network,
Y . Luo, Z. Chen, and N. Mesgarani, “Speaker-independent speech separation with deep attractor network,”IEEE/ACM transactions on Audio, Speech, and Language Processing, vol. 26, no. 4, pp. 787–796, 2018
work page 2018
-
[14]
D. Yu, M. Kolbæk, Z.-H. Tan, and J. Jensen, “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” inProc. of ICASSP, 2017, pp. 241–245
work page 2017
-
[15]
M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen, “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 10, pp. 1901–1913, 2017
work page 1901
-
[16]
Tasnet: time-domain audio separation network for real-time, single-channel speech separation,
Y . Luo and N. Mesgarani, “Tasnet: time-domain audio separation network for real-time, single-channel speech separation,” inProc. of ICASSP, 2018, pp. 696–700
work page 2018
-
[17]
Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation,
Y . Luo, Z. Chen, and T. Yoshioka, “Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation,” inProc. of ICASSP, 2020, pp. 46–50
work page 2020
-
[18]
Sudo rm-rf: Efficient networks for universal audio source separation,
E. Tzinis, Z. Wang, and P. Smaragdis, “Sudo rm-rf: Efficient networks for universal audio source separation,” inProc. of MLSP, 2020, pp. 1–6
work page 2020
-
[19]
Conv-tasnet: Surpassing ideal time– frequency magnitude masking for speech separation,
Y . Luo and N. Mesgarani, “Conv-tasnet: Surpassing ideal time– frequency magnitude masking for speech separation,”IEEE/ACM trans- actions on audio, speech, and language processing, vol. 27, no. 8, pp. 1256–1266, 2019
work page 2019
-
[20]
Wavesplit: End-to-end speech separation by speaker clustering,
N. Zeghidour and D. Grangier, “Wavesplit: End-to-end speech separation by speaker clustering,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2840–2849, 2021
work page 2021
-
[21]
Dual-path rnn for long recording speech separation,
C. Li, Y . Luo, C. Han, J. Li, T. Yoshioka, T. Zhou, M. Delcroix, K. Kinoshita, C. Boeddeker, Y . Qianet al., “Dual-path rnn for long recording speech separation,” inProc. of SLT, 2021, pp. 865–872
work page 2021
-
[22]
J. Chen, Q. Mao, and D. Liu, “Dual-path transformer network: Direct context-aware modeling for end-to-end monaural speech separation,” arXiv preprint arXiv:2007.13975, 2020
-
[23]
Attention is all you need in speech separation,
C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi, and J. Zhong, “Attention is all you need in speech separation,” inProc. of ICASSP, 2021, pp. 21–25
work page 2021
-
[24]
An efficient encoder-decoder architec- ture with top-down attention for speech separation,
K. Li, R. Yang, and X. Hu, “An efficient encoder-decoder architec- ture with top-down attention for speech separation,”arXiv preprint arXiv:2209.15200, 2022
-
[25]
Qdpn-quasi-dual-path network for single- channel speech separation
J. Rixen and M. Renz, “Qdpn-quasi-dual-path network for single- channel speech separation.” inProc. of Interspeech, 2022, pp. 5353– 5357
work page 2022
-
[26]
Spgm: Prioritizing local features for enhanced speech separation performance,
J. Q. Yip, S. Zhao, Y . Ma, C. Ni, C. Zhang, H. Wang, T. H. Nguyen, K. Zhou, D. Ng, E. S. Chnget al., “Spgm: Prioritizing local features for enhanced speech separation performance,” inProc. of ICASSP, 2024, pp. 326–330
work page 2024
-
[27]
S. Zhao and B. Ma, “Mossformer: Pushing the performance limit of monaural speech separation using gated single-head transformer with convolution-augmented joint self-attentions,” inProc. of ICASSP, 2023, pp. 1–5. 15
work page 2023
-
[28]
S. Zhao, Y . Ma, C. Ni, C. Zhang, H. Wang, T. H. Nguyen, K. Zhou, J. Q. Yip, D. Ng, and B. Ma, “Mossformer2: Combining transformer and rnn-free recurrent network for enhanced time-domain monaural speech separation,” inProc. of ICASSP, 2024, pp. 10 356–10 360
work page 2024
-
[29]
V oiceFilter: Tar- geted V oice Separation by Speaker-Conditioned Spectrogram Masking,
Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. R. Hershey, R. A. Saurous, R. J. Weiss, Y . Jia, and I. L. Moreno, “V oiceFilter: Tar- geted V oice Separation by Speaker-Conditioned Spectrogram Masking,” inProc. of Interspeech, 2019, pp. 2728–2732
work page 2019
-
[30]
Speakerbeam: Speaker aware neural network for target speaker extraction in speech mixtures,
K. ˇZmol´ıkov´a, M. Delcroix, K. Kinoshita, T. Ochiai, T. Nakatani, L. Bur- get, and J. ˇCernock`y, “Speakerbeam: Speaker aware neural network for target speaker extraction in speech mixtures,”IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 4, pp. 800–814, 2019
work page 2019
-
[31]
Improving speaker discrimination of target speech extraction with time-domain speakerbeam,
M. Delcroix, T. Ochiai, K. Zmolikova, K. Kinoshita, N. Tawara, T. Nakatani, and S. Araki, “Improving speaker discrimination of target speech extraction with time-domain speakerbeam,” inProc. of ICASSP, 2020, pp. 691–695
work page 2020
-
[32]
A unified framework for low-latency speaker extraction in cocktail party environments
Y . Hao, J. Xu, J. Shi, P. Zhang, L. Qin, and B. Xu, “A unified framework for low-latency speaker extraction in cocktail party environments.” in Proc. of Interspeech, 2020, pp. 1431–1435
work page 2020
-
[33]
Atss-Net: Target Speaker Separation via Attention-Based Neural Network,
T. Li, Q. Lin, Y . Bao, and M. Li, “Atss-Net: Target Speaker Separation via Attention-Based Neural Network,” inProc. of Interspeech, 2020, pp. 1411–1415
work page 2020
-
[34]
X-tasnet: Robust and accurate time- domain speaker extraction network,
Z. Zhang, B. He, and Z. Zhang, “X-tasnet: Robust and accurate time- domain speaker extraction network,” inProc. of Interspeech, 2020
work page 2020
-
[35]
Spex: Multi-scale time domain speaker extraction network,
C. Xu, W. Rao, E. S. Chng, and H. Li, “Spex: Multi-scale time domain speaker extraction network,”IEEE/ACM transactions on audio, speech, and language processing, vol. 28, pp. 1370–1384, 2020
work page 2020
-
[36]
Spex+: A complete time domain speaker extraction network,
M. Ge, C. Xu, L. Wang, E. S. Chng, J. Dang, and H. Li, “Spex+: A complete time domain speaker extraction network,” inProc. of Interspeech, 2020, pp. 1406–1410
work page 2020
-
[37]
Neural speaker extraction with speaker-speech cross-attention network
W. Wang, C. Xu, M. Ge, and H. Li, “Neural speaker extraction with speaker-speech cross-attention network.” inProc. of Interspeech, 2021, pp. 3535–3539
work page 2021
-
[38]
Multi-stage speaker extraction with utterance and frame-level reference signals,
M. Ge, C. Xu, L. Wang, E. S. Chng, J. Dang, and H. Li, “Multi-stage speaker extraction with utterance and frame-level reference signals,” in Proc. of ICASSP, 2021, pp. 6109–6113
work page 2021
-
[39]
X-sepformer: End-to-end speaker extraction network with explicit optimization on speaker confusion,
K. Liu, Z. Du, X. Wan, and H. Zhou, “X-sepformer: End-to-end speaker extraction network with explicit optimization on speaker confusion,” in Proc. of ICASSP, 2023, pp. 1–5
work page 2023
-
[40]
F. Hao, X. Li, and C. Zheng, “X-tf-gridnet: A time–frequency domain target speaker extraction network with adaptive speaker embedding fusion,”Information Fusion, vol. 112, p. 102550, 2024
work page 2024
-
[41]
New insights on target speaker extraction,
M. Elminshawi, W. Mack, S. R. Chetupalli, S. Chakrabarty, and E. A. Habets, “New insights on target speaker extraction,”arXiv preprint arXiv:2202.00733, 2022
-
[42]
Sef-net: Speaker embedding free target speaker extraction network,
B. Zeng, H. Suo, Y . Wan, and M. Li, “Sef-net: Speaker embedding free target speaker extraction network,” inProc. of Interspeech, 2023, pp. 3452–3456
work page 2023
-
[43]
Y . Hu, H. Xu, Z. Guo, H. Huang, and L. He, “Smma-net: An audio clue- based target speaker extraction network with spectrogram matching and mutual attention,” inProc. of ICASSP, 2024, pp. 1496–1500
work page 2024
-
[44]
X. Yang, C. Bao, J. Zhou, and X. Chen, “Target speaker extraction by directly exploiting contextual information in the time-frequency domain,” inProc. of ICASSP, 2024, pp. 10 476–10 480
work page 2024
-
[45]
Usef-tse: Universal speaker embedding free target speaker extraction,
B. Zeng and M. Li, “Usef-tse: Universal speaker embedding free target speaker extraction,”IEEE Transactions on Audio, Speech and Language Processing, vol. 33, pp. 2110–2124, 2025
work page 2025
-
[46]
P. Wang, K. Tanet al., “Bridging the gap between monaural speech enhancement and recognition with distortion-independent acoustic mod- eling,”IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 28, pp. 39–48, 2019
work page 2019
-
[47]
Target speech extraction with conditional diffusion model,
N. Kamo, M. Delcroix, and T. Nakatani, “Target speech extraction with conditional diffusion model,” inProceeding of Interspeech, 2023, pp. 176–180
work page 2023
-
[48]
H. Erdogan, S. Wisdom, X. Chang, Z. Borsos, M. Tagliasacchi, N. Zeghidour, and J. R. Hershey, “Tokensplit: Using discrete speech representations for direct, refined, and transcript-conditioned speech separation and recognition,” inProceeding of Interspeech, 2023, pp. 3462–3466
work page 2023
-
[49]
Anyenhance: A unified generative model with prompt-guidance and self-critic for voice enhancement,
J. Zhang, J. Yang, Z. Fang, Y . Wang, Z. Zhang, Z. Wang, F. Fan, and Z. Wu, “Anyenhance: A unified generative model with prompt-guidance and self-critic for voice enhancement,”arXiv preprint arXiv:2501.15417, 2025
-
[50]
Llase-g1: Incentivizing generalization capability for llama-based speech enhancement,
B. Kang, X. Zhu, Z. Zhang, Z. Ye, M. Liu, Z. Wang, Y . Zhu, G. Ma, J. Chen, L. Xiaoet al., “Llase-g1: Incentivizing generalization capability for llama-based speech enhancement,”arXiv preprint arXiv:2503.00493, 2025
-
[51]
R. Wang, L. Li, and T. Toda, “Dual-channel target speaker extraction based on conditional variational autoencoder and directional infor- mation,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 1968–1979, 2024
work page 1968
-
[52]
Tselm: Target speaker extraction using discrete tokens and language models,
B. Tang, B. Zeng, and M. Li, “Tselm: Target speaker extraction using discrete tokens and language models,”arXiv preprint arXiv:2409.07841, 2024
-
[53]
Wavlm: Large-scale self-supervised pre- training for full stack speech processing,
S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y . Qian, Y . Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “Wavlm: Large-scale self-supervised pre- training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022
work page 2022
-
[54]
Speechx: Neural codec lan- guage model as a versatile speech transformer,
X. Wang, M. Thakker, Z. Chen, N. Kanda, S. E. Eskimez, S. Chen, M. Tang, S. Liu, J. Li, and T. Yoshioka, “Speechx: Neural codec lan- guage model as a versatile speech transformer,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024
work page 2024
-
[55]
Lauratse: Target speaker extraction using auto-regressive decoder-only language models,
B. Tang, B. Zeng, and M. Li, “Lauratse: Target speaker extraction using auto-regressive decoder-only language models,”arXiv preprint arXiv:2504.07402, 2025
-
[56]
V oicefilter-lite: Streaming targeted voice separation for on-device speech recognition,
Q. Wang, I. L. Moreno, M. Saglam, K. Wilson, A. Chiao, R. Liu, Y . He, W. Li, J. Pelecanos, M. Nikaet al., “V oicefilter-lite: Streaming targeted voice separation for on-device speech recognition,”arXiv preprint arXiv:2009.04323, 2020
-
[57]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProc. of ICPR, 2016, pp. 770–778
work page 2016
-
[58]
In defence of metric learning for speaker recognition,
J. S. Chung, J. Huh, S. Mun, M. Lee, H. S. Heo, S. Choe, C. Ham, S. Jung, B.-J. Lee, and I. Han, “In defence of metric learning for speaker recognition,” inProc. of Interspeech, 2020
work page 2020
-
[59]
Generalized end-to-end loss for speaker verification,
L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” inProc. of ICASSP, 2018, pp. 4879–4883
work page 2018
-
[60]
Arcface: Additive angular margin loss for deep face recognition,
J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” inProc. of ICPR, 2019, pp. 4690–4699
work page 2019
-
[61]
Single-channel speech extraction using speaker inventory and attention network,
X. Xiao, Z. Chen, T. Yoshioka, H. Erdogan, C. Liu, D. Dimitriadis, J. Droppo, and Y . Gong, “Single-channel speech extraction using speaker inventory and attention network,” inProc. of ICASSP, 2019, pp. 86–90
work page 2019
-
[62]
Target speaker extraction with ultra-short reference speech by ve-ve framework,
L. Yang, W. Liu, L. Tan, J. Yang, and H.-G. Moon, “Target speaker extraction with ultra-short reference speech by ve-ve framework,” in Proc. of ICASSP, 2023, pp. 1–5
work page 2023
-
[63]
X-vectors: Robust dnn embeddings for speaker recognition,
D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition,” inProc. of ICASSP, 2018, pp. 5329–5333
work page 2018
-
[64]
A. Li, G. Yu, Z. Xu, C. Fan, X. Li, and C. Zheng, “Tabe: Decoupling spatial and spectral processing with taylor’s unfolding method in the beamspace domain for multi-channel speech enhancement,”Information Fusion, vol. 101, p. 101976, 2024
work page 2024
-
[65]
Diffusion-based generative speech source separation,
R. Scheibler, Y . Ji, S.-W. Chung, J. Byun, S. Choe, and M.-S. Choi, “Diffusion-based generative speech source separation,” inICASSP 2023- 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5
work page 2023
-
[66]
Speech enhancement and dereverberation with diffusion-based genera- tive models,
J. Richter, S. Welker, J.-M. Lemercier, B. Lay, and T. Gerkmann, “Speech enhancement and dereverberation with diffusion-based genera- tive models,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2351–2364, 2023
work page 2023
-
[67]
Conditional diffusion probabilistic model for speech enhancement,
Y .-J. Lu, Z.-Q. Wang, S. Watanabe, A. Richard, C. Yu, and Y . Tsao, “Conditional diffusion probabilistic model for speech enhancement,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Ieee, 2022, pp. 7402–7406
work page 2022
-
[68]
Storm: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation,
J.-M. Lemercier, J. Richter, S. Welker, and T. Gerkmann, “Storm: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2724–2737, 2023
work page 2023
-
[69]
Target speech extraction with conditional diffusion model,
N. Kamo, M. Delcroix, and T. Nakatani, “Target speech extraction with conditional diffusion model,” inInterspeech 2023, 2023, pp. 176–180
work page 2023
-
[70]
Ddtse: Discriminative diffusion model for target speech extraction,
L. Zhang, Y . Qian, L. Yu, H. Wang, H. Yang, S. Liu, L. Zhou, and Y . Qian, “Ddtse: Discriminative diffusion model for target speech extraction,” in2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 294–301
work page 2024
-
[71]
R. Wang, L. Li, and T. Toda, “Target speaker extraction based on conditional variational autoencoder and directional information in un- derdetermined condition,”IEICE Technical Report; IEICE Tech. Rep., vol. 121, no. 383, pp. 76–81, 2022
work page 2022
-
[72]
——, “Dual-channel target speaker extraction based on conditional variational autoencoder and directional information,”IEEE/ACM Trans- actions on Audio, Speech, and Language Processing, vol. 32, pp. 1968– 1979, 2024. 16
work page 1968
-
[73]
Bert: Pre-training of deep bidirectional transformers for language understanding,
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” inPro- ceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186
work page 2019
-
[74]
Language mod- els are few-shot learners,
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language mod- els are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020
work page 1901
-
[75]
Exploring the limits of transfer learning with a unified text-to-text transformer,
C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,”Journal of machine learning research, vol. 21, no. 140, pp. 1–67, 2020
work page 2020
-
[76]
Funcodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec,
Z. Du, S. Zhang, K. Hu, and S. Zheng, “Funcodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec,” inProceeding of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 591–595
work page 2024
-
[77]
LauraGPT: Listen, attend, understand, and regenerate audio with GPT
Z. Du, J. Wang, Q. Chen, Y . Chu, Z. Gao, Z. Li, K. Hu, X. Zhou, J. Xu, Z. Maet al., “Lauragpt: Listen, attend, understand, and regenerate audio with gpt,”arXiv preprint arXiv:2310.04673, 2023
-
[78]
Conformer: Convolution- augmented transformer for speech recognition,
A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wu, and R. Pang, “Conformer: Convolution- augmented transformer for speech recognition,” inProceeding of Inter- speech, 2020, pp. 5036–5040
work page 2020
-
[79]
Librispeech: an asr corpus based on public domain audio books,
V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” inProceeding of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210
work page 2015
-
[80]
LibriMix: An open-source dataset for generalizable speech separation
J. Cosentino, M. Pariente, S. Cornell, A. Deleforge, and E. Vincent, “Librimix: An open-source dataset for generalizable speech separation,” arXiv preprint arXiv:2005.11262, 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.