IsoNet combines multi-channel STFT, GCC-PHAT cues, face embeddings and DOA supervision in a U-Net to deliver 9.31 dB SI-SDR on simulated -1 to 10 dB SNR mixtures, outperforming oracle beamformers.
Visualvoice: Audio-visual speech separation with cross-modal consistency
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.SD 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
IsoNet: Spatially-aware audio-visual target speech extraction in complex acoustic environments
IsoNet combines multi-channel STFT, GCC-PHAT cues, face embeddings and DOA supervision in a U-Net to deliver 9.31 dB SI-SDR on simulated -1 to 10 dB SNR mixtures, outperforming oracle beamformers.