My lips are concealed: Audio-visual speech enhancement through obstructions

Andrew Zisserman; Joon Son Chung; Triantafyllos Afouras

arxiv: 1907.04975 · v1 · pith:ZDE3DCZ6new · submitted 2019-07-11 · 💻 cs.CV · cs.SD· eess.AS

My lips are concealed: Audio-visual speech enhancement through obstructions

Triantafyllos Afouras , Joon Son Chung , Andrew Zisserman This is my paper

Pith reviewed 2026-05-24 23:36 UTC · model grok-4.3

classification 💻 cs.CV cs.SDeess.AS

keywords audio-visual speech enhancementspeech separationocclusion handlinglip movementsvoice representationself-enrollmentspeaker-independent

0 comments

The pith

An audio-visual network separates a speaker's voice from mixtures even when lips are occluded by conditioning on lip movements or a learned voice representation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a method to isolate one person's speech from a noisy mixture that may include other voices or background sounds. The key is a neural network that can use either the visible movements of the speaker's lips or a stored representation of their voice, and the voice profile can be built automatically if enough clear video is available. Training involves mixing different audios together and covering the mouth area artificially so the system learns to rely on both senses and handles cases where lips disappear. Because it does not need to know the speaker beforehand, it works on new people. If successful, this would make speech enhancement more reliable in everyday situations where faces are partly blocked.

Core claim

The central claim is that a deep audio-visual speech enhancement network can separate a speaker's voice by conditioning on lip movements and/or a voice representation obtained by enrollment or self-enrollment from unobstructed visual input. Training uses audio blending and artificial occlusions to ensure generalization to real occlusions and to avoid visual dominance. The method is speaker-independent and shows improvements on real examples of unheard and unseen speakers, particularly when visuals are occluded.

What carries the argument

The deep audio-visual speech enhancement network that conditions on the speaker's lip movements and a voice representation from enrollment or self-enrollment.

If this is right

The approach works for speakers not encountered in training.
It handles cases where visual cues are temporarily absent.
It outperforms previous models especially under occlusion.
Self-enrollment enables voice representation learning without prior enrollment data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The self-enrollment mechanism could allow the system to adapt to new speakers in ongoing conversations without separate training sessions.
This method might be useful for enhancing audio in scenarios like masked speakers or video with temporary obstructions.
Extending the artificial occlusion training to other visual blocks could broaden applicability.

Load-bearing premise

Training by blending audios and introducing artificial occlusions around the mouth will enable generalization to real occlusions without the visual modality dominating the model on unseen data.

What would settle it

Measuring whether the network achieves better speech separation on real occluded videos of unseen speakers than audio-only or prior audio-visual baselines would test the claim.

Figures

Figures reproduced from arXiv: 1907.04975 by Andrew Zisserman, Joon Son Chung, Triantafyllos Afouras.

**Figure 1.** Figure 1: An audio-visual speech enhancement model may fail when the lip region is occluded by e.g. a microphone. In such cases the input audio is often entirely filtered out and the result is silent output over the occluded frames. The aim of our method is to be robust to this kind of occlusions. remains an unsolved problem. With our approach, even partially occluded video can provide information on the voice char… view at source ↗

**Figure 2.** Figure 2: The architecture of the audio-visual speech enhancement network: There are 2 audio streams. The one processes the incoming noisy audio, while the other takes as input an enrollment audio sample and creates a speaker embedding that captures the speaker’s voice characteristics. A visual stream extracts frame-wise representations from the input video. The visual, speaker and audio embeddings are combined an… view at source ↗

**Figure 3.** Figure 3: Example frames of occluded videos used during training end evaluation. where Mˆ , Φˆ and M∗ , Φ ∗ are the predicted and ground truth magnitude and phase spectrograms respectively, and T and F their time and frequency resolutions. 3. Experimental Setup Datasets. The network is trained on the MV-LRS [27], LRS2 [21], and LRS3 [22] datasets, and tested on LRS3. MVLRS and LRS2 contain material from British tel… view at source ↗

**Figure 4.** Figure 4: Enhancement performance when occluding varying amounts of the visual input for the 2 Speakers and 3 Speakers scenarios. Model notations are explained in the caption of [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

read the original abstract

Our objective is an audio-visual model for separating a single speaker from a mixture of sounds such as other speakers and background noise. Moreover, we wish to hear the speaker even when the visual cues are temporarily absent due to occlusion. To this end we introduce a deep audio-visual speech enhancement network that is able to separate a speaker's voice by conditioning on both the speaker's lip movements and/or a representation of their voice. The voice representation can be obtained by either (i) enrollment, or (ii) by self-enrollment -- learning the representation on-the-fly given sufficient unobstructed visual input. The model is trained by blending audios, and by introducing artificial occlusions around the mouth region that prevent the visual modality from dominating. The method is speaker-independent, and we demonstrate it on real examples of speakers unheard (and unseen) during training. The method also improves over previous models in particular for cases of occlusion in the visual modality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Self-enrollment and synthetic occlusion training give a workable way to make AV speech separation robust to missing visuals, but transfer to real obstructions lacks quantitative checks.

read the letter

The paper's main contribution is a training approach that uses artificial occlusions around the mouth and self-enrollment to learn voice representations on the fly, allowing the audio-visual model to separate speech even when lip movements are temporarily blocked. This addresses the issue of visual cues being absent due to obstructions. They do a good job making the system speaker-independent and testing it on real examples with speakers not seen in training. The method improves on previous models specifically in occlusion scenarios, and the idea of blending audios during training is a direct way to build robustness. The soft spot is the generalization from those artificial occlusions to actual real-world ones. The stress test note is right that there's no controlled quantitative comparison shown for performance on held-out real obstructions versus the synthetic masks. The demonstrations are on real examples but appear to be qualitative only. If the paper has more detailed results in the full version, that would change the picture, but based on the description, this is the area that needs more evidence. This kind of work is useful for people building multimodal speech systems that need to work in messy real conditions. A reader focused on practical robustness in AV models would find the self-enrollment and occlusion training strategy worth looking at. I think it deserves peer review because the core idea is sensible and targets a genuine problem in the field.

Referee Report

2 major / 1 minor

Summary. The paper introduces a deep audio-visual speech enhancement network for separating a single speaker's voice from mixtures of other speakers and noise. The network conditions on lip movements and/or a voice representation obtained either by enrollment or by self-enrollment (learning the representation on-the-fly from unobstructed visual input). Training uses audio blending together with artificial occlusions around the mouth region; the method is claimed to be speaker-independent and to improve over prior models particularly when visual input is occluded, with demonstrations on real examples of unseen speakers.

Significance. If the generalization from synthetic to real occlusions holds and quantitative gains are confirmed, the work would advance practical speaker-independent audio-visual speech separation by providing a mechanism to handle temporary visual obstructions without visual dominance, extending enrollment-based conditioning to on-the-fly self-enrollment.

major comments (2)

[Abstract] Abstract and evaluation: the central claim that artificial mouth occlusions during training enable generalization to real obstructions (and prevent visual dominance) is load-bearing, yet no quantitative comparison is reported between performance on held-out real occlusions versus the synthetic masks; the real-example demonstrations are described only qualitatively.
[Method / Training] Training strategy: the description of how blending audios plus artificial occlusions produces fallback to audio when visual cues are absent lacks a controlled ablation or metric (e.g., SI-SDR or PESQ on occluded vs. unoccluded test conditions) that would verify the claimed robustness.

minor comments (1)

[Abstract] The abstract states improvement 'over previous models' without naming the baselines or reporting the quantitative metrics used for that comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly to include the requested quantitative evaluations.

read point-by-point responses

Referee: [Abstract] Abstract and evaluation: the central claim that artificial mouth occlusions during training enable generalization to real obstructions (and prevent visual dominance) is load-bearing, yet no quantitative comparison is reported between performance on held-out real occlusions versus the synthetic masks; the real-example demonstrations are described only qualitatively.

Authors: We agree that a quantitative comparison would strengthen the central claim regarding generalization from synthetic to real occlusions. The manuscript currently supports this via qualitative demonstrations on real examples of unseen speakers. In revision we will add a quantitative evaluation (using SI-SDR and PESQ) on held-out real occlusion examples versus synthetic masks to directly verify the generalization and the avoidance of visual dominance. revision: yes
Referee: [Method / Training] Training strategy: the description of how blending audios plus artificial occlusions produces fallback to audio when visual cues are absent lacks a controlled ablation or metric (e.g., SI-SDR or PESQ on occluded vs. unoccluded test conditions) that would verify the claimed robustness.

Authors: We will add a controlled ablation study to the revised manuscript. This will report SI-SDR and PESQ metrics on occluded versus unoccluded test conditions, confirming that the training strategy (audio blending combined with artificial occlusions) enables the model to fall back to audio when visual input is absent. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained empirical training

full rationale

The paper describes an audio-visual speech enhancement network trained via audio blending and artificial mouth occlusions to handle temporary visual absence. No load-bearing step reduces by construction to its own inputs: there are no self-definitional equations, no fitted parameters renamed as predictions, and no self-citation chains invoked as uniqueness theorems. The central claims rest on the external training procedure and qualitative real-example demonstrations rather than any internal reduction or renaming of known results. This matches the default case of a non-circular empirical ML paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available, so ledger is incomplete. Training relies on audio blending and artificial occlusions as key strategies; these may involve unstated modeling choices or hyperparameters typical in deep learning but not detailed here.

pith-pipeline@v0.9.0 · 5701 in / 1240 out tokens · 52927 ms · 2026-05-24T23:36:45.686610+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 3 internal anchors

[1]

My lips are concealed: Audio-visual speech enhancement through obstructions

Introduction While there has been great progress in the ﬁeld of automatic speech recognition (ASR) in recent years, some key challenges remain, particularly the understanding of speech in very noisy environments or in cases where multiple people speak simul- taneously. In this direction, isolating voices in multi-speaker scenarios, increasing the signal-t...

work page internal anchor Pith review Pith/arXiv arXiv 1907
[2]

The network receives three inputs: (i) the noisy audio to be en- hanced; (ii) the corresponding video frames; (iii) a reference audio containing speech from the target speaker

Method This section describes the architecture of the audio-visual speech enhancement network, which is given in Figure 2. The network receives three inputs: (i) the noisy audio to be en- hanced; (ii) the corresponding video frames; (iii) a reference audio containing speech from the target speaker. We summa- rize the principal modules below. Details of th...

work page
[3]

The network is trained on the MV-LRS [27], LRS2 [21], and LRS3 [22] datasets, and tested on LRS3

Experimental Setup Datasets. The network is trained on the MV-LRS [27], LRS2 [21], and LRS3 [22] datasets, and tested on LRS3. MV- LRS and LRS2 contain material from British television broad- casts, while LRS3 was created from videos of TED talks. The speakers appearing in LRS3 are to the best of our knowledge not seen in either of the other two datasets....

work page
[4]

Experiments 4.1. Evaluation protocol To evaluate the performance of our model we use the Signal to Distortion Ration (SDR) [28], a common metric expressing the ratio between the energy of the target signal and of the errors contained in the enhanced output. Furthermore to assess the in- telligibility of the output, we use the Google Cloud ASR system – we ...

work page
[5]

Conclusion In this paper, we proposed a deep audio-visual speech enhance- ment network that is able to separate a speaker’s voice by con- ditioning on both the speaker’s lip movements and/or a repre- sentation of their voice. The network is robust to partial oc- clusions, and the voice representation can be self-enrolled from the unoccluded part of the in...

work page
[6]

The conversation: Deep audio-visual speech enhancement,

T. Afouras, J. S. Chung, and A. Zisserman, “The conversation: Deep audio-visual speech enhancement,” in Proc. Interspeech,

work page
[7]

Looking to listen at the cock- tail party: A speaker-independent audio-visual model for speech separation,

A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein, “Looking to listen at the cock- tail party: A speaker-independent audio-visual model for speech separation,” SIGGRAPH, 2018. 1, 2, 3

work page 2018
[8]

Audio-visual scene analysis with self- supervised multisensory features,

A. Owens and A. A. Efros, “Audio-visual scene analysis with self- supervised multisensory features,” in Proc. ECCV, 2018, pp. 631–

work page 2018
[9]

V oice- ﬁlter: Targeted voice separation by speaker-conditioned spectro- gram masking,

Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. Her- shey, R. A. Saurous, R. J. Weiss, Y . Jia, and I. L. Moreno, “V oice- ﬁlter: Targeted voice separation by speaker-conditioned spectro- gram masking,” in Proc. Interspeech, 2018. 1, 2, 3, 4

work page 2018
[10]

Deep clus- tering: Discriminative embeddings for segmentation and separa- tion,

J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clus- tering: Discriminative embeddings for segmentation and separa- tion,” in Proc. ICASSP. IEEE, 2016, pp. 31–35. 1

work page 2016
[11]

Permutation invari- ant training of deep models for speaker-independent multi-talker speech separation,

D. Yu, M. Kolbæk, Z. Tan, and J. Jensen, “Permutation invari- ant training of deep models for speaker-independent multi-talker speech separation,” in Proc. ICASSP, 2017. 1, 3, 4

work page 2017
[12]

Soft mask methods for single-channel speaker separation,

A. M. Reddy and B. Raj, “Soft mask methods for single-channel speaker separation,” IEEE Transactions on Audio, Speech, and Language Processing, 2007. 1

work page 2007
[13]

A supervised learning approach to monaural segregation of reverberant speech,

Z. Jin and D. Wang, “A supervised learning approach to monaural segregation of reverberant speech,” IEEE Transactions on Audio, Speech, and Language Processing, 2009. 1

work page 2009
[14]

Single-channel speech sep- aration using soft mask ﬁltering,

M. H. Radfar and R. M. Dansereau, “Single-channel speech sep- aration using soft mask ﬁltering,” IEEE Transactions on Audio, Speech, and Language Processing, 2007. 1

work page 2007
[15]

Makino, T.-W

S. Makino, T.-W. Lee, and H. Sawada, Blind speech separation. Springer, 2007. 1

work page 2007
[16]

Supervised speech separation based on deep learning: an overview,

D. Wang and J. Chen, “Supervised speech separation based on deep learning: an overview,” IEEE Transactions on Audio, Speech and Language Processing, 2017. 1

work page 2017
[17]

Speaker separation using visually-derived binary masks,

F. Khan and B. Milner, “Speaker separation using visually-derived binary masks,” in AVSP, 2013. 1

work page 2013
[18]

Video assisted speech source separation,

W. Wang, D. Cosker, Y . Hicks, S. Saneit, and J. Chambers, “Video assisted speech source separation,” in Proc. ICASSP, 2005. 1

work page 2005
[19]

Audio-visual enhance- ment of speech in noise,

L. Girin, J.-L. Schwartz, and G. Feng, “Audio-visual enhance- ment of speech in noise,” The Journal of the Acoustical Society of America, 2001. 1

work page 2001
[20]

Audio-visual speech en- hancement with avcdcn (audio-visual codebook dependent cep- stral normalization),

S. Deligne, G. Potamianos, and C. Neti, “Audio-visual speech en- hancement with avcdcn (audio-visual codebook dependent cep- stral normalization),” in Sensor Array and Multichannel Signal Processing Workshop Proceedings, 2002. 1

work page 2002
[21]

Audio-visual sound separation via hidden markov models,

J. R. Hershey and M. Casey, “Audio-visual sound separation via hidden markov models,” in NIPS, 2002. 1

work page 2002
[22]

Audio- visual speech source separation: An overview of key methodolo- gies,

B. Rivet, W. Wang, S. M. Naqvi, and J. A. Chambers, “Audio- visual speech source separation: An overview of key methodolo- gies,” IEEE Signal Processing Magazine, 2014. 1

work page 2014
[23]

Seeing through noise: Visually driven speaker separation and enhancement,

A. Gabbay, A. Ephrat, T. Halperin, and S. Peleg, “Seeing through noise: Visually driven speaker separation and enhancement,”

work page
[24]

Visual Speech Enhancement

A. Gabbay, A. Shamir, and S. Peleg, “Visual Speech En- hancement using Noise-Invariant Training,” arXiv preprint arXiv:1711.08789, 2017. 1

work page internal anchor Pith review Pith/arXiv arXiv 2017
[25]

Audio-Visual Speech Enhancement Using Multi- modal Deep Convolutional Neural Networks,

J.-C. Hou, S.-S. Wang, Y .-H. Lai, Y . Tsao, H.-W. Chang, and H.-M. Wang, “Audio-Visual Speech Enhancement Using Multi- modal Deep Convolutional Neural Networks,”IEEE Transactions on Emerging Topics in Computational Intelligence, 2018. 1

work page 2018
[26]

Deep audio-visual speech recognition,

T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisser- man, “Deep audio-visual speech recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019. 2, 3

work page 2019
[27]

LRS3-TED: a large-scale dataset for visual speech recognition

T. Afouras, J. S. Chung, and A. Zisserman, “LRS3-TED: a large- scale dataset for visual speech recognition,” in arXiv preprint arXiv:1809.00496, 2018. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2018
[28]

Combining residual net- works with LSTMs for lipreading,

T. Stafylakis and G. Tzimiropoulos, “Combining residual net- works with LSTMs for lipreading,” in Proc. Interspeech, 2017. 2, 3

work page 2017
[29]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. CVPR, 2016. 2

work page 2016
[30]

Utterance- level aggregation for speaker recognition in the wild,

W. Xie, A. Nagrani, J. S. Chung, and A. Zisserman, “Utterance- level aggregation for speaker recognition in the wild,” in Proc. ICASSP, 2019. 2

work page 2019
[31]

V oxCeleb2: Deep speaker recognition,

J. S. Chung, A. Nagrani, and A. Zisserman, “V oxCeleb2: Deep speaker recognition,” in Proc. Interspeech, 2018. 2

work page 2018
[32]

Lip reading in proﬁle,

J. S. Chung and A. Zisserman, “Lip reading in proﬁle,” in Proc. BMVC., 2017. 3

work page 2017
[33]

BSS EV AL toolbox user guide,

C. F ´evotte, R. Gribonval, and E. Vincent, “BSS EV AL toolbox user guide,” IRISA Technical Report 1706. http://www.irisa.fr/metiss/bss eval/., 2005. 3

work page 2005

[1] [1]

My lips are concealed: Audio-visual speech enhancement through obstructions

Introduction While there has been great progress in the ﬁeld of automatic speech recognition (ASR) in recent years, some key challenges remain, particularly the understanding of speech in very noisy environments or in cases where multiple people speak simul- taneously. In this direction, isolating voices in multi-speaker scenarios, increasing the signal-t...

work page internal anchor Pith review Pith/arXiv arXiv 1907

[2] [2]

The network receives three inputs: (i) the noisy audio to be en- hanced; (ii) the corresponding video frames; (iii) a reference audio containing speech from the target speaker

Method This section describes the architecture of the audio-visual speech enhancement network, which is given in Figure 2. The network receives three inputs: (i) the noisy audio to be en- hanced; (ii) the corresponding video frames; (iii) a reference audio containing speech from the target speaker. We summa- rize the principal modules below. Details of th...

work page

[3] [3]

The network is trained on the MV-LRS [27], LRS2 [21], and LRS3 [22] datasets, and tested on LRS3

Experimental Setup Datasets. The network is trained on the MV-LRS [27], LRS2 [21], and LRS3 [22] datasets, and tested on LRS3. MV- LRS and LRS2 contain material from British television broad- casts, while LRS3 was created from videos of TED talks. The speakers appearing in LRS3 are to the best of our knowledge not seen in either of the other two datasets....

work page

[4] [4]

Experiments 4.1. Evaluation protocol To evaluate the performance of our model we use the Signal to Distortion Ration (SDR) [28], a common metric expressing the ratio between the energy of the target signal and of the errors contained in the enhanced output. Furthermore to assess the in- telligibility of the output, we use the Google Cloud ASR system – we ...

work page

[5] [5]

Conclusion In this paper, we proposed a deep audio-visual speech enhance- ment network that is able to separate a speaker’s voice by con- ditioning on both the speaker’s lip movements and/or a repre- sentation of their voice. The network is robust to partial oc- clusions, and the voice representation can be self-enrolled from the unoccluded part of the in...

work page

[6] [6]

The conversation: Deep audio-visual speech enhancement,

T. Afouras, J. S. Chung, and A. Zisserman, “The conversation: Deep audio-visual speech enhancement,” in Proc. Interspeech,

work page

[7] [7]

Looking to listen at the cock- tail party: A speaker-independent audio-visual model for speech separation,

A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein, “Looking to listen at the cock- tail party: A speaker-independent audio-visual model for speech separation,” SIGGRAPH, 2018. 1, 2, 3

work page 2018

[8] [8]

Audio-visual scene analysis with self- supervised multisensory features,

A. Owens and A. A. Efros, “Audio-visual scene analysis with self- supervised multisensory features,” in Proc. ECCV, 2018, pp. 631–

work page 2018

[9] [9]

V oice- ﬁlter: Targeted voice separation by speaker-conditioned spectro- gram masking,

Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. Her- shey, R. A. Saurous, R. J. Weiss, Y . Jia, and I. L. Moreno, “V oice- ﬁlter: Targeted voice separation by speaker-conditioned spectro- gram masking,” in Proc. Interspeech, 2018. 1, 2, 3, 4

work page 2018

[10] [10]

Deep clus- tering: Discriminative embeddings for segmentation and separa- tion,

J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clus- tering: Discriminative embeddings for segmentation and separa- tion,” in Proc. ICASSP. IEEE, 2016, pp. 31–35. 1

work page 2016

[11] [11]

Permutation invari- ant training of deep models for speaker-independent multi-talker speech separation,

D. Yu, M. Kolbæk, Z. Tan, and J. Jensen, “Permutation invari- ant training of deep models for speaker-independent multi-talker speech separation,” in Proc. ICASSP, 2017. 1, 3, 4

work page 2017

[12] [12]

Soft mask methods for single-channel speaker separation,

A. M. Reddy and B. Raj, “Soft mask methods for single-channel speaker separation,” IEEE Transactions on Audio, Speech, and Language Processing, 2007. 1

work page 2007

[13] [13]

A supervised learning approach to monaural segregation of reverberant speech,

Z. Jin and D. Wang, “A supervised learning approach to monaural segregation of reverberant speech,” IEEE Transactions on Audio, Speech, and Language Processing, 2009. 1

work page 2009

[14] [14]

Single-channel speech sep- aration using soft mask ﬁltering,

M. H. Radfar and R. M. Dansereau, “Single-channel speech sep- aration using soft mask ﬁltering,” IEEE Transactions on Audio, Speech, and Language Processing, 2007. 1

work page 2007

[15] [15]

Makino, T.-W

S. Makino, T.-W. Lee, and H. Sawada, Blind speech separation. Springer, 2007. 1

work page 2007

[16] [16]

Supervised speech separation based on deep learning: an overview,

D. Wang and J. Chen, “Supervised speech separation based on deep learning: an overview,” IEEE Transactions on Audio, Speech and Language Processing, 2017. 1

work page 2017

[17] [17]

Speaker separation using visually-derived binary masks,

F. Khan and B. Milner, “Speaker separation using visually-derived binary masks,” in AVSP, 2013. 1

work page 2013

[18] [18]

Video assisted speech source separation,

W. Wang, D. Cosker, Y . Hicks, S. Saneit, and J. Chambers, “Video assisted speech source separation,” in Proc. ICASSP, 2005. 1

work page 2005

[19] [19]

Audio-visual enhance- ment of speech in noise,

L. Girin, J.-L. Schwartz, and G. Feng, “Audio-visual enhance- ment of speech in noise,” The Journal of the Acoustical Society of America, 2001. 1

work page 2001

[20] [20]

Audio-visual speech en- hancement with avcdcn (audio-visual codebook dependent cep- stral normalization),

S. Deligne, G. Potamianos, and C. Neti, “Audio-visual speech en- hancement with avcdcn (audio-visual codebook dependent cep- stral normalization),” in Sensor Array and Multichannel Signal Processing Workshop Proceedings, 2002. 1

work page 2002

[21] [21]

Audio-visual sound separation via hidden markov models,

J. R. Hershey and M. Casey, “Audio-visual sound separation via hidden markov models,” in NIPS, 2002. 1

work page 2002

[22] [22]

Audio- visual speech source separation: An overview of key methodolo- gies,

B. Rivet, W. Wang, S. M. Naqvi, and J. A. Chambers, “Audio- visual speech source separation: An overview of key methodolo- gies,” IEEE Signal Processing Magazine, 2014. 1

work page 2014

[23] [23]

Seeing through noise: Visually driven speaker separation and enhancement,

A. Gabbay, A. Ephrat, T. Halperin, and S. Peleg, “Seeing through noise: Visually driven speaker separation and enhancement,”

work page

[24] [24]

Visual Speech Enhancement

A. Gabbay, A. Shamir, and S. Peleg, “Visual Speech En- hancement using Noise-Invariant Training,” arXiv preprint arXiv:1711.08789, 2017. 1

work page internal anchor Pith review Pith/arXiv arXiv 2017

[25] [25]

Audio-Visual Speech Enhancement Using Multi- modal Deep Convolutional Neural Networks,

J.-C. Hou, S.-S. Wang, Y .-H. Lai, Y . Tsao, H.-W. Chang, and H.-M. Wang, “Audio-Visual Speech Enhancement Using Multi- modal Deep Convolutional Neural Networks,”IEEE Transactions on Emerging Topics in Computational Intelligence, 2018. 1

work page 2018

[26] [26]

Deep audio-visual speech recognition,

T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisser- man, “Deep audio-visual speech recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019. 2, 3

work page 2019

[27] [27]

LRS3-TED: a large-scale dataset for visual speech recognition

T. Afouras, J. S. Chung, and A. Zisserman, “LRS3-TED: a large- scale dataset for visual speech recognition,” in arXiv preprint arXiv:1809.00496, 2018. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2018

[28] [28]

Combining residual net- works with LSTMs for lipreading,

T. Stafylakis and G. Tzimiropoulos, “Combining residual net- works with LSTMs for lipreading,” in Proc. Interspeech, 2017. 2, 3

work page 2017

[29] [29]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. CVPR, 2016. 2

work page 2016

[30] [30]

Utterance- level aggregation for speaker recognition in the wild,

W. Xie, A. Nagrani, J. S. Chung, and A. Zisserman, “Utterance- level aggregation for speaker recognition in the wild,” in Proc. ICASSP, 2019. 2

work page 2019

[31] [31]

V oxCeleb2: Deep speaker recognition,

J. S. Chung, A. Nagrani, and A. Zisserman, “V oxCeleb2: Deep speaker recognition,” in Proc. Interspeech, 2018. 2

work page 2018

[32] [32]

Lip reading in proﬁle,

J. S. Chung and A. Zisserman, “Lip reading in proﬁle,” in Proc. BMVC., 2017. 3

work page 2017

[33] [33]

BSS EV AL toolbox user guide,

C. F ´evotte, R. Gribonval, and E. Vincent, “BSS EV AL toolbox user guide,” IRISA Technical Report 1706. http://www.irisa.fr/metiss/bss eval/., 2005. 3

work page 2005