pTSE-T: Presentation Target Speaker Extraction using Unaligned Text Cues

Jiahe Lei; Wei Xue; Xinyuan Qian; Xueyan Chen; Yifan Zhang; Zexu Pan; Ziyang Jiang

arxiv: 2411.03109 · v3 · submitted 2024-11-05 · 💻 cs.SD · cs.MM· eess.AS

pTSE-T: Presentation Target Speaker Extraction using Unaligned Text Cues

Ziyang Jiang , Jiahe Lei , Xueyan Chen , Yifan Zhang , Zexu Pan , Wei Xue , Xinyuan Qian This is my paper

Pith reviewed 2026-05-23 17:54 UTC · model grok-4.3

classification 💻 cs.SD cs.MMeess.AS

keywords target speaker extractionunaligned text cuessemantic cuespresentation slidesaudio mixturestime-frequency maskingTPE network

0 comments

The pith

Semantic cues from unaligned presentation text enable target speaker extraction from audio mixtures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that target speaker extraction can be conditioned on semantic cues derived from limited and unaligned text, such as condensed points from presentation slides. This is useful in practical scenarios like meetings and lectures where acquiring stronger cues like visual or spatial information is challenging. The proposed Text Prompt Extractor Network fuses these text cues with audio features to generate time-frequency masks that isolate the target speech. Results indicate effective extraction with specific metric improvements. This approach reduces reliance on aligned or pre-recorded auxiliary signals.

Core claim

The authors claim that by using semantic cues from limited and unaligned text contents to condition the TSE algorithm, their Text Prompt Extractor Network can facilitate accurate time-frequency mask generation, leading to effective extraction of the target speaker's speech in audio mixtures, as shown by the reported performance metrics.

What carries the argument

The Text Prompt Extractor Network (TPE) that fuses audio features with content-based semantic cues from text for time-frequency mask generation.

Load-bearing premise

Semantic cues extracted from limited and unaligned text are sufficient to guide accurate time-frequency mask generation for target speaker extraction.

What would settle it

Observing no improvement in extraction quality when text cues are replaced with random or unrelated text would indicate that the semantic information is not driving the mask generation.

Figures

Figures reproduced from arXiv: 2411.03109 by Jiahe Lei, Wei Xue, Xinyuan Qian, Xueyan Chen, Yifan Zhang, Zexu Pan, Ziyang Jiang.

**Figure 2.** Figure 2: 1) The speech encoder that transforms mixed speech waveforms x(τ ) into mixed speech embedding X(t). 2) A text encoder that encodes the text prompts of the slides into text embedding Ptext. 3) A fusion layer that fuses X(t) and Ptext to obtain the fused features F(t). 4) A mask estimator that takes the fused features F(t) as input and estimates a mask, which is then multiplied element-wise by X(t) to gene… view at source ↗

**Figure 2.** Figure 2: The structure of our proposed TPE network. The input consists of mixed speech waveform [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The structure of our proposed TSR. The wave [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: The data processing pipeline for constructing our [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison between our proposed TPE and DPRNN-TSR in terms of (a) SI-SDRi histogram; (b) average SI-SDRi [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

Target Speaker Extraction (TSE) aims to extract the clean speech of the target speaker in an audio mixture, eliminating irrelevant background noise and speech. While prior work has explored various auxiliary cues including pre-recorded speech, visual information, and spatial information, the acquisition and selection of such strong cues are infeasible in many practical scenarios. Differently, in this paper, we condition the TSE algorithm on semantic cues extracted from limited and unaligned text contents, such as condensed points from a presentation slide. This method is particularly useful in scenarios like meetings, poster sessions, or lecture presentations, where acquiring other cues in real time may be challenging. To this end, we design two different networks. Specifically, our proposed Text Prompt Extractor Network (TPE) fuses audio features with content-based semantic cues to facilitate time-frequency mask generation to filter out extraneous noise. The experimental results show the efficacy in accurately extracting the target speaker's speech by utilizing semantic cues derived from limited and unaligned text, resulting in SI-SDRi of 12.16 dB, SDRi of 12.66 dB, PESQi of 0.830 and STOIi of 0.150.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proposes using unaligned text from slides as a cue for target speaker extraction in presentations, but the abstract gives no way to check if the reported gains are real.

read the letter

The main point is that this work conditions target speaker extraction on semantic cues pulled from limited unaligned text like slide points, instead of needing a reference recording or video. That targets a real gap in meetings or lectures where those stronger cues are impractical to get in real time. They introduce a TPE network that fuses audio features with the text-derived semantics to produce time-frequency masks for separation. The reported numbers are SI-SDRi of 12.16 dB, SDRi of 12.66 dB, PESQi of 0.830 and STOIi of 0.150, which on the surface look like usable gains for the setting. The idea itself is a straightforward extension of auxiliary-cue TSE to a new signal type, and it makes sense as a practical alternative when other cues are unavailable. The soft spots are substantial though. The abstract supplies no dataset description, no baseline comparisons, no training details, no ablations on the text cue itself, and no statistical tests, so there is no evidence the text actually drives the improvement or that the numbers hold up under normal conditions. The core assumption that limited unaligned text is enough to guide accurate mask generation remains untested in the provided text. This is for researchers working on speech separation who care about constrained real-world scenarios like presentations. A reader could pick up the cue idea if the full experiments are sound, but the current write-up is too thin to evaluate. It deserves a serious referee because the application is useful and the cue choice is distinct from prior work, even if heavy revision for methods and controls would be needed.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes pTSE-T, a target speaker extraction (TSE) method conditioned on semantic cues extracted from limited, unaligned text (e.g., condensed points from presentation slides). It introduces a Text Prompt Extractor Network (TPE) that fuses audio features with these content-based semantic cues to generate time-frequency masks for isolating the target speaker. The central claim is that this approach is effective in practical scenarios such as meetings or lectures where stronger cues are unavailable, supported by reported metrics of SI-SDRi 12.16 dB, SDRi 12.66 dB, PESQi 0.830, and STOIi 0.150.

Significance. If the experimental claims hold after full details are supplied, the work could address a practical gap in TSE by demonstrating that weak, unaligned textual semantic cues suffice for mask generation in constrained settings. This would be a modest but useful extension beyond prior auxiliary-cue methods. However, the absence of any dataset, baseline, architecture, or ablation information in the provided text prevents assessment of whether the result is novel, reproducible, or superior to existing approaches.

major comments (2)

[Abstract] Abstract: the reported metric values (SI-SDRi 12.16 dB, SDRi 12.66 dB, PESQi 0.830, STOIi 0.150) are presented as evidence of efficacy, yet the text supplies no information whatsoever on datasets, baselines, training details, statistical tests, or potential confounds. These omissions are load-bearing for the central empirical claim.
[Abstract] Abstract: the TPE network is described only at the level of 'fuses audio features with content-based semantic cues to facilitate time-frequency mask generation,' with no equations, architecture diagram, fusion mechanism, or implementation details. Without these, the technical contribution cannot be evaluated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their comments. We address each point below and will revise the manuscript to supply the requested details.

read point-by-point responses

Referee: [Abstract] Abstract: the reported metric values (SI-SDRi 12.16 dB, SDRi 12.66 dB, PESQi 0.830, STOIi 0.150) are presented as evidence of efficacy, yet the text supplies no information whatsoever on datasets, baselines, training details, statistical tests, or potential confounds. These omissions are load-bearing for the central empirical claim.

Authors: We agree that the abstract as written does not contain this information. The full manuscript contains dedicated experimental and implementation sections describing the datasets, baselines, training procedure, and evaluation protocol. In the revision we will expand the abstract with a brief reference to the evaluation setup and ensure all supporting details are clearly signposted from the opening paragraphs. revision: yes
Referee: [Abstract] Abstract: the TPE network is described only at the level of 'fuses audio features with content-based semantic cues to facilitate time-frequency mask generation,' with no equations, architecture diagram, fusion mechanism, or implementation details. Without these, the technical contribution cannot be evaluated.

Authors: We acknowledge that the abstract provides only a high-level summary. The body of the manuscript includes the TPE architecture, fusion mechanism, and associated diagram. To improve self-containment we will revise the abstract to include a concise statement of the fusion approach and will ensure the methods section supplies the equations and implementation details required for reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an empirical TSE method conditioned on semantic cues from unaligned text via a proposed TPE network for mask generation. No equations, derivations, fitted parameters renamed as predictions, or mathematical claims appear in the provided content. Reported metrics (SI-SDRi 12.16 dB etc.) are framed as experimental outcomes from the network architecture rather than reductions by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The work is self-contained as an empirical contribution with no derivation chain that reduces to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract.

pith-pipeline@v0.9.0 · 5763 in / 1050 out tokens · 25899 ms · 2026-05-23T17:54:46.879740+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 1 internal anchor

[1]

Cherry, E. C. 1953. Some experiments on the recognition of speech, with one and with two ears. The Journal of the acoustical society of America, 25(5): 975--979

work page 1953
[2]

S.; and Kang, H.-G

Chung, S.-W.; Choe, S.; Chung, J. S.; and Kang, H.-G. 2020. Facefilter: Audio-visual speech separation using still images. arXiv preprint arXiv:2005.07074

work page arXiv 2020
[3]

Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

Flanagan, J.; Johnston, J.; Zahn, R.; and Elko, G. 1985. Computer-steered microphone arrays for sound transduction in large rooms. The Journal of the Acoustical Society of America, 78(5): 1508--1518

work page 1985
[5]

B.; and Torralba, A

Gan, C.; Huang, D.; Zhao, H.; Tenenbaum, J. B.; and Torralba, A. 2020. Music gesture for visual sound separation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10478--10487

work page 2020
[6]

Gannot, S.; Vincent, E.; Markovich-Golan, S.; and Ozerov, A. 2017. A consolidated perspective on multimicrophone speech enhancement and source separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(4): 692--730

work page 2017
[7]

S.; Dang, J.; and Li, H

Ge, M.; Xu, C.; Wang, L.; Chng, E. S.; Dang, J.; and Li, H. 2020 a . SpEx +: A Complete Time Domain Speaker Extraction Network. In Proc. of Interspeech, 1406--1410

work page 2020
[8]

S.; Dang, J.; and Li, H

Ge, M.; Xu, C.; Wang, L.; Chng, E. S.; Dang, J.; and Li, H. 2020 b . Spex+: A complete time domain speaker extraction network. arXiv preprint arXiv:2005.04686

work page arXiv 2020
[9]

S.; Dang, J.; and Li, H

Ge, M.; Xu, C.; Wang, L.; Chng, E. S.; Dang, J.; and Li, H. 2021. Multi-stage speaker extraction with utterance and frame-level reference signals. In Proc. of IEEE Int. Conf. on Audio, Speech and Signal Processing , 6109--6113. IEEE

work page 2021
[10]

S.; Dang, J.; and Li, H

Ge, M.; Xu, C.; Wang, L.; Chng, E. S.; Dang, J.; and Li, H. 2022. L-spex: Localized target speaker extraction. In Proc. of IEEE Int. Conf. on Audio, Speech and Signal Processing , 7287--7291. IEEE

work page 2022
[11]

Gu, R.; Chen, L.; Zhang, S.-X.; Zheng, J.; Xu, Y.; Yu, M.; Su, D.; Zou, Y.; and Yu, D. 2019. Neural Spatial Filter: Target Speaker Speech Separation Assisted with Directional Information. In Proc. of Interspeech, 4290--4294

work page 2019
[12]

Heitkaemper, J.; Feh \'e r, T.; Freitag, M.; and Haeb-Umbach, R. 2019. A study on online source extraction in the presence of changing speaker positions. In Statistical Language and Speech Processing: Int. Conf., 198--209. Springer

work page 2019
[13]

R.; Chen , Z.; Le Roux , J.; and Watanabe , S

Hershey , J. R.; Chen , Z.; Le Roux , J.; and Watanabe , S. 2016. Deep clustering: Discriminative embeddings for segmentation and separation. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 31--35

work page 2016
[14]

Jansk \`y , J.; M \'a lek, J.; C mejla, J.; Kounovsk \`y , T.; Koldovsk \`y , Z.; and Z d’ \'a nsk \`y , J. 2020. Adaptive blind audio source extraction supervised by dominant speaker identification using x-vectors. In Proc. of IEEE Int. Conf. on Audio, Speech and Signal Processing , 676--680. IEEE

work page 2020
[15]

Kilgour, K.; Gfeller, B.; Huang, Q.; Jansen, A.; Wisdom, S.; and Tagliasacchi, M. 2022. Text-driven separation of arbitrary sounds. arXiv preprint arXiv:2204.05738

work page arXiv 2022
[16]

Kolb k, M.; Yu, D.; Tan, Z.-H.; and Jensen, J. 2017. Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Trans. Audio, Speech, Lang. Process., 25(10): 1901--1913

work page 2017
[17]

Liu, K.; Du, Z.; Wan, X.; and Zhou, H. 2023. X-sepformer: End-to-end speaker extraction network with explicit optimization on speaker confusion. In Proc. of IEEE Int. Conf. on Audio, Speech and Signal Processing , 1--5. IEEE

work page 2023
[18]

D.; and Wang, W

Liu, X.; Liu, H.; Kong, Q.; Mei, X.; Zhao, J.; Huang, Q.; Plumbley, M. D.; and Wang, W. 2022. Separate What You Describe: Language-Queried Audio Source Separation. In Proc. of Interspeech, 1801--1805

work page 2022
[19]

Liu, Y.; and Wang, D. 2019. Divide and conquer: A deep CASA approach to talker-independent monaural speaker separation. IEEE/ACM Trans. Audio, Speech, Lang. Process., 27(12): 2092--2102

work page 2019
[20]

Luo, Y.; Chen, Z.; and Yoshioka, T. 2020. Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation. In Proc. of IEEE Int. Conf. on Audio, Speech and Signal Processing , 46--50. IEEE

work page 2020
[21]

Luo , Y.; and Mesgarani , N. 2019. Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation . IEEE/ACM Trans. Audio, Speech, Lang. Process., 27(8): 1256--1266

work page 2019
[22]

Ma, H.; Peng, Z.; Shao, M.; Liu, J.; Li, X.; and Wu, X. 2024. CLAPSep: Leveraging Contrastive Pre-trained Models for Multi-Modal Query-Conditioned Target Sound Extraction. arXiv preprint arXiv:2402.17455

work page arXiv 2024
[23]

Pan, Z.; Ge, M.; and Li, H. 2022. USEV: Universal speaker extraction with visual cue. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30: 3032--3045

work page 2022
[24]

Pan, Z.; Luo, Z.; Yang, J.; and Li, H. 2020. Multi-Modal Attention for Speech Emotion Recognition. In Proc. of Interspeech, 364--368

work page 2020
[25]

Pan, Z.; Qian, X.; and Li, H. 2022. Speaker extraction with co-speech gestures cue. IEEE Signal Processing Letters, 29: 1467--1471

work page 2022
[26]

Pan, Z.; Tao, R.; Xu, C.; and Li, H. 2021. Muse: Multi-modal target speaker extraction with visual cues. In Proc. of IEEE Int. Conf. on Audio, Speech and Signal Processing , 6678--6682. IEEE

work page 2021
[27]

Perez, E.; Strub, F.; De Vries, H.; Dumoulin, V.; and Courville, A. 2018. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, volume 32

work page 2018
[28]

Qian, X.; Madhavi, M.; Pan, Z.; Wang, J.; and Li, H. 2021. Multi-target DoA Estimation with an Audio-visual Fusion Mechanism. In Proc. of IEEE Int. Conf. on Audio, Speech and Signal Processing , 4280--4284

work page 2021
[29]

W.; Xu, T.; Brockman, G.; McLeavey, C.; and Sutskever, I

Radford, A.; Kim, J. W.; Xu, T.; Brockman, G.; McLeavey, C.; and Sutskever, I. 2023. Robust speech recognition via large-scale weak supervision. In Proc. of Int. Conf. on Machine Learning, 28492--28518. PMLR

work page 2023
[30]

K.; Qian, X.; Shou, M

Tao, R.; Pan, Z.; Das, R. K.; Qian, X.; Shou, M. Z.; and Li, H. 2021. Is Someone Speaking? E xploring Long-term Temporal Features for Audio-visual Active Speaker Detection. In Proc. of ACM Int. Conf. on Multimedia , 3927--3935

work page 2021
[31]

R.; Saurous, R

Wang, Q.; Muckenhirn, H.; Wilson, K.; Sridhar, P.; Wu, Z.; Hershey, J. R.; Saurous, R. A.; Weiss, R. J.; Jia, Y.; and Moreno, I. L. 2019. VoiceFilter : Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking. In Proc. of Interspeech, 2728--2732

work page 2019
[32]

R.; McQuinn, E.; Crow, D.; Manilow, E.; and Le Roux, J

Wichern, G.; Antognini, J.; Flynn, M.; Zhu, L. R.; McQuinn, E.; Crow, D.; Manilow, E.; and Le Roux, J. 2019. WHAM!: Extending Speech Separation to Noisy Environments. In Proc. of Interspeech

work page 2019
[33]

Wu , J.; Xu , Y.; Zhang , S.; Chen , L.; Yu , M.; Xie , L.; and Yu , D. 2019. Time Domain Audio Visual Speech Separation. In Proc. IEEE Autom. Speech Recognit. Understanding Workshop, 667--673

work page 2019
[34]

Wu, Y.; Chen, K.; Zhang, T.; Hui, Y.; Berg-Kirkpatrick, T.; and Dubnov, S. 2023. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In Proc. of IEEE Int. Conf. on Audio, Speech and Signal Processing , 1--5. IEEE

work page 2023
[35]

S.; and Li , H

Xu , C.; Rao , W.; Chng , E. S.; and Li , H. 2020. SpEx : Multi-Scale Time Domain Speaker Extraction Network. IEEE/ACM Trans. Audio, Speech, Lang. Process., 28: 1370--1384

work page 2020
[36]

S.; and Li, H

Xu, C.; Rao, W.; Chng, E. S.; and Li, H. 2020. Spex: Multi-scale time domain speaker extraction network. IEEE/ACM Trans. on Audio, Speech and Language Processing, 28: 1370--1384

work page 2020
[37]

Yue, X.; Lee, G.; Y lmaz, E.; Deng, F.; and Li, H. 2019. End-to-end code-switching ASR for low-resourced language pairs. In Proc. IEEE Autom. Speech Recognit. Understanding Workshop, 972--979

work page 2019
[38]

Zeghidour, N.; and Grangier, D. 2021. Wavesplit: End-to-end speech separation by speaker clustering. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29: 2840--2849

work page 2021
[39]

Z mol \' kov \'a , K.; Delcroix, M.; Kinoshita, K.; Ochiai, T.; Nakatani, T.; Burget, L.; and C ernock \`y , J. 2019. Speakerbeam: Speaker aware neural network for target speaker extraction in speech mixtures. IEEE Journal of Selected Topics in Signal Processing, 13(4): 800--814

work page 2019
[40]

Zmolikova, K.; Delcroix, M.; Ochiai, T.; Kinoshita, K.; C ernock \`y , J.; and Yu, D. 2023. Neural target speech extraction: An overview. IEEE Signal Processing Magazine, 40(3): 8--29

work page 2023
[41]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page
[42]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[1] [1]

Cherry, E. C. 1953. Some experiments on the recognition of speech, with one and with two ears. The Journal of the acoustical society of America, 25(5): 975--979

work page 1953

[2] [2]

S.; and Kang, H.-G

Chung, S.-W.; Choe, S.; Chung, J. S.; and Kang, H.-G. 2020. Facefilter: Audio-visual speech separation using still images. arXiv preprint arXiv:2005.07074

work page arXiv 2020

[3] [3]

Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

work page internal anchor Pith review Pith/arXiv arXiv 2018

[4] [4]

Flanagan, J.; Johnston, J.; Zahn, R.; and Elko, G. 1985. Computer-steered microphone arrays for sound transduction in large rooms. The Journal of the Acoustical Society of America, 78(5): 1508--1518

work page 1985

[5] [5]

B.; and Torralba, A

Gan, C.; Huang, D.; Zhao, H.; Tenenbaum, J. B.; and Torralba, A. 2020. Music gesture for visual sound separation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10478--10487

work page 2020

[6] [6]

Gannot, S.; Vincent, E.; Markovich-Golan, S.; and Ozerov, A. 2017. A consolidated perspective on multimicrophone speech enhancement and source separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(4): 692--730

work page 2017

[7] [7]

S.; Dang, J.; and Li, H

Ge, M.; Xu, C.; Wang, L.; Chng, E. S.; Dang, J.; and Li, H. 2020 a . SpEx +: A Complete Time Domain Speaker Extraction Network. In Proc. of Interspeech, 1406--1410

work page 2020

[8] [8]

S.; Dang, J.; and Li, H

Ge, M.; Xu, C.; Wang, L.; Chng, E. S.; Dang, J.; and Li, H. 2020 b . Spex+: A complete time domain speaker extraction network. arXiv preprint arXiv:2005.04686

work page arXiv 2020

[9] [9]

S.; Dang, J.; and Li, H

Ge, M.; Xu, C.; Wang, L.; Chng, E. S.; Dang, J.; and Li, H. 2021. Multi-stage speaker extraction with utterance and frame-level reference signals. In Proc. of IEEE Int. Conf. on Audio, Speech and Signal Processing , 6109--6113. IEEE

work page 2021

[10] [10]

S.; Dang, J.; and Li, H

Ge, M.; Xu, C.; Wang, L.; Chng, E. S.; Dang, J.; and Li, H. 2022. L-spex: Localized target speaker extraction. In Proc. of IEEE Int. Conf. on Audio, Speech and Signal Processing , 7287--7291. IEEE

work page 2022

[11] [11]

Gu, R.; Chen, L.; Zhang, S.-X.; Zheng, J.; Xu, Y.; Yu, M.; Su, D.; Zou, Y.; and Yu, D. 2019. Neural Spatial Filter: Target Speaker Speech Separation Assisted with Directional Information. In Proc. of Interspeech, 4290--4294

work page 2019

[12] [12]

Heitkaemper, J.; Feh \'e r, T.; Freitag, M.; and Haeb-Umbach, R. 2019. A study on online source extraction in the presence of changing speaker positions. In Statistical Language and Speech Processing: Int. Conf., 198--209. Springer

work page 2019

[13] [13]

R.; Chen , Z.; Le Roux , J.; and Watanabe , S

Hershey , J. R.; Chen , Z.; Le Roux , J.; and Watanabe , S. 2016. Deep clustering: Discriminative embeddings for segmentation and separation. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 31--35

work page 2016

[14] [14]

Jansk \`y , J.; M \'a lek, J.; C mejla, J.; Kounovsk \`y , T.; Koldovsk \`y , Z.; and Z d’ \'a nsk \`y , J. 2020. Adaptive blind audio source extraction supervised by dominant speaker identification using x-vectors. In Proc. of IEEE Int. Conf. on Audio, Speech and Signal Processing , 676--680. IEEE

work page 2020

[15] [15]

Kilgour, K.; Gfeller, B.; Huang, Q.; Jansen, A.; Wisdom, S.; and Tagliasacchi, M. 2022. Text-driven separation of arbitrary sounds. arXiv preprint arXiv:2204.05738

work page arXiv 2022

[16] [16]

Kolb k, M.; Yu, D.; Tan, Z.-H.; and Jensen, J. 2017. Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Trans. Audio, Speech, Lang. Process., 25(10): 1901--1913

work page 2017

[17] [17]

Liu, K.; Du, Z.; Wan, X.; and Zhou, H. 2023. X-sepformer: End-to-end speaker extraction network with explicit optimization on speaker confusion. In Proc. of IEEE Int. Conf. on Audio, Speech and Signal Processing , 1--5. IEEE

work page 2023

[18] [18]

D.; and Wang, W

Liu, X.; Liu, H.; Kong, Q.; Mei, X.; Zhao, J.; Huang, Q.; Plumbley, M. D.; and Wang, W. 2022. Separate What You Describe: Language-Queried Audio Source Separation. In Proc. of Interspeech, 1801--1805

work page 2022

[19] [19]

Liu, Y.; and Wang, D. 2019. Divide and conquer: A deep CASA approach to talker-independent monaural speaker separation. IEEE/ACM Trans. Audio, Speech, Lang. Process., 27(12): 2092--2102

work page 2019

[20] [20]

Luo, Y.; Chen, Z.; and Yoshioka, T. 2020. Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation. In Proc. of IEEE Int. Conf. on Audio, Speech and Signal Processing , 46--50. IEEE

work page 2020

[21] [21]

Luo , Y.; and Mesgarani , N. 2019. Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation . IEEE/ACM Trans. Audio, Speech, Lang. Process., 27(8): 1256--1266

work page 2019

[22] [22]

Ma, H.; Peng, Z.; Shao, M.; Liu, J.; Li, X.; and Wu, X. 2024. CLAPSep: Leveraging Contrastive Pre-trained Models for Multi-Modal Query-Conditioned Target Sound Extraction. arXiv preprint arXiv:2402.17455

work page arXiv 2024

[23] [23]

Pan, Z.; Ge, M.; and Li, H. 2022. USEV: Universal speaker extraction with visual cue. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30: 3032--3045

work page 2022

[24] [24]

Pan, Z.; Luo, Z.; Yang, J.; and Li, H. 2020. Multi-Modal Attention for Speech Emotion Recognition. In Proc. of Interspeech, 364--368

work page 2020

[25] [25]

Pan, Z.; Qian, X.; and Li, H. 2022. Speaker extraction with co-speech gestures cue. IEEE Signal Processing Letters, 29: 1467--1471

work page 2022

[26] [26]

Pan, Z.; Tao, R.; Xu, C.; and Li, H. 2021. Muse: Multi-modal target speaker extraction with visual cues. In Proc. of IEEE Int. Conf. on Audio, Speech and Signal Processing , 6678--6682. IEEE

work page 2021

[27] [27]

Perez, E.; Strub, F.; De Vries, H.; Dumoulin, V.; and Courville, A. 2018. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, volume 32

work page 2018

[28] [28]

Qian, X.; Madhavi, M.; Pan, Z.; Wang, J.; and Li, H. 2021. Multi-target DoA Estimation with an Audio-visual Fusion Mechanism. In Proc. of IEEE Int. Conf. on Audio, Speech and Signal Processing , 4280--4284

work page 2021

[29] [29]

W.; Xu, T.; Brockman, G.; McLeavey, C.; and Sutskever, I

Radford, A.; Kim, J. W.; Xu, T.; Brockman, G.; McLeavey, C.; and Sutskever, I. 2023. Robust speech recognition via large-scale weak supervision. In Proc. of Int. Conf. on Machine Learning, 28492--28518. PMLR

work page 2023

[30] [30]

K.; Qian, X.; Shou, M

Tao, R.; Pan, Z.; Das, R. K.; Qian, X.; Shou, M. Z.; and Li, H. 2021. Is Someone Speaking? E xploring Long-term Temporal Features for Audio-visual Active Speaker Detection. In Proc. of ACM Int. Conf. on Multimedia , 3927--3935

work page 2021

[31] [31]

R.; Saurous, R

Wang, Q.; Muckenhirn, H.; Wilson, K.; Sridhar, P.; Wu, Z.; Hershey, J. R.; Saurous, R. A.; Weiss, R. J.; Jia, Y.; and Moreno, I. L. 2019. VoiceFilter : Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking. In Proc. of Interspeech, 2728--2732

work page 2019

[32] [32]

R.; McQuinn, E.; Crow, D.; Manilow, E.; and Le Roux, J

Wichern, G.; Antognini, J.; Flynn, M.; Zhu, L. R.; McQuinn, E.; Crow, D.; Manilow, E.; and Le Roux, J. 2019. WHAM!: Extending Speech Separation to Noisy Environments. In Proc. of Interspeech

work page 2019

[33] [33]

Wu , J.; Xu , Y.; Zhang , S.; Chen , L.; Yu , M.; Xie , L.; and Yu , D. 2019. Time Domain Audio Visual Speech Separation. In Proc. IEEE Autom. Speech Recognit. Understanding Workshop, 667--673

work page 2019

[34] [34]

Wu, Y.; Chen, K.; Zhang, T.; Hui, Y.; Berg-Kirkpatrick, T.; and Dubnov, S. 2023. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In Proc. of IEEE Int. Conf. on Audio, Speech and Signal Processing , 1--5. IEEE

work page 2023

[35] [35]

S.; and Li , H

Xu , C.; Rao , W.; Chng , E. S.; and Li , H. 2020. SpEx : Multi-Scale Time Domain Speaker Extraction Network. IEEE/ACM Trans. Audio, Speech, Lang. Process., 28: 1370--1384

work page 2020

[36] [36]

S.; and Li, H

Xu, C.; Rao, W.; Chng, E. S.; and Li, H. 2020. Spex: Multi-scale time domain speaker extraction network. IEEE/ACM Trans. on Audio, Speech and Language Processing, 28: 1370--1384

work page 2020

[37] [37]

Yue, X.; Lee, G.; Y lmaz, E.; Deng, F.; and Li, H. 2019. End-to-end code-switching ASR for low-resourced language pairs. In Proc. IEEE Autom. Speech Recognit. Understanding Workshop, 972--979

work page 2019

[38] [38]

Zeghidour, N.; and Grangier, D. 2021. Wavesplit: End-to-end speech separation by speaker clustering. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29: 2840--2849

work page 2021

[39] [39]

Z mol \' kov \'a , K.; Delcroix, M.; Kinoshita, K.; Ochiai, T.; Nakatani, T.; Burget, L.; and C ernock \`y , J. 2019. Speakerbeam: Speaker aware neural network for target speaker extraction in speech mixtures. IEEE Journal of Selected Topics in Signal Processing, 13(4): 800--814

work page 2019

[40] [40]

Zmolikova, K.; Delcroix, M.; Ochiai, T.; Kinoshita, K.; C ernock \`y , J.; and Yu, D. 2023. Neural target speech extraction: An overview. IEEE Signal Processing Magazine, 40(3): 8--29

work page 2023

[41] [41]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page

[42] [42]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page