The Conversation: Deep Audio-Visual Speech Enhancement

Andrew Zisserman; Joon Son Chung; Triantafyllos Afouras

arxiv: 1804.04121 · v2 · pith:5G4ORXGGnew · submitted 2018-04-11 · 💻 cs.CV · cs.SD

The Conversation: Deep Audio-Visual Speech Enhancement

Triantafyllos Afouras , Joon Son Chung , Andrew Zisserman This is my paper

classification 💻 cs.CV cs.SD

keywords speakersspeechaudio-visualdeepenhancementenvironmentsseparateable

0 comments

read the original abstract

Our goal is to isolate individual speakers from multi-talker simultaneous speech in videos. Existing works in this area have focussed on trying to separate utterances from known speakers in controlled environments. In this paper, we propose a deep audio-visual speech enhancement network that is able to separate a speaker's voice given lip regions in the corresponding video, by predicting both the magnitude and the phase of the target signal. The method is applicable to speakers unheard and unseen during training, and for unconstrained environments. We demonstrate strong quantitative and qualitative results, isolating extremely challenging real-world examples.

This paper has not been read by Pith yet.

The Conversation: Deep Audio-Visual Speech Enhancement

discussion (0)