pith. sign in

arxiv: 2204.04811 · v2 · pith:WCQZXBMWnew · submitted 2022-04-11 · 📡 eess.AS · cs.SD

Listen only to me! How well can target speech extraction handle false alarms?

classification 📡 eess.AS cs.SD
keywords speakertargetextractionspeechalarmscasesfalsehandle
0
0 comments X
read the original abstract

Target speech extraction (TSE) extracts the speech of a target speaker in a mixture given auxiliary clues characterizing the speaker, such as an enrollment utterance. TSE addresses thus the challenging problem of simultaneously performing separation and speaker identification. There has been much progress in extraction performance following the recent development of neural networks for speech enhancement and separation. Most studies have focused on processing mixtures where the target speaker is actively speaking. However, the target speaker is sometimes silent in practice, i.e., inactive speaker (IS). A typical TSE system will tend to output a signal in IS cases, causing false alarms. It is a severe problem for the practical deployment of TSE systems. This paper aims at understanding better how well TSE systems can handle IS cases. We consider two approaches to deal with IS, (1) training a system to directly output zero signals or (2) detecting IS with an extra speaker verification module. We perform an extensive experimental comparison of these schemes in terms of extraction performance and IS detection using the LibriMix dataset and reveal their pros and cons.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Don't Listen to Me: A Lightweight, Low-Latency Model for Own-Voice Cancellation in Far-Field Speech Enhancement

    eess.AS 2026-06 unverdicted novelty 6.0

    Introduces own-voice cancellation as a complement to target speaker extraction and benchmarks lightweight 2 ms latency models for far-field speech enhancement.