pith. sign in

arxiv: 1907.04975 · v1 · pith:ZDE3DCZ6new · submitted 2019-07-11 · 💻 cs.CV · cs.SD· eess.AS

My lips are concealed: Audio-visual speech enhancement through obstructions

Pith reviewed 2026-05-24 23:36 UTC · model grok-4.3

classification 💻 cs.CV cs.SDeess.AS
keywords audio-visual speech enhancementspeech separationocclusion handlinglip movementsvoice representationself-enrollmentspeaker-independent
0
0 comments X

The pith

An audio-visual network separates a speaker's voice from mixtures even when lips are occluded by conditioning on lip movements or a learned voice representation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a method to isolate one person's speech from a noisy mixture that may include other voices or background sounds. The key is a neural network that can use either the visible movements of the speaker's lips or a stored representation of their voice, and the voice profile can be built automatically if enough clear video is available. Training involves mixing different audios together and covering the mouth area artificially so the system learns to rely on both senses and handles cases where lips disappear. Because it does not need to know the speaker beforehand, it works on new people. If successful, this would make speech enhancement more reliable in everyday situations where faces are partly blocked.

Core claim

The central claim is that a deep audio-visual speech enhancement network can separate a speaker's voice by conditioning on lip movements and/or a voice representation obtained by enrollment or self-enrollment from unobstructed visual input. Training uses audio blending and artificial occlusions to ensure generalization to real occlusions and to avoid visual dominance. The method is speaker-independent and shows improvements on real examples of unheard and unseen speakers, particularly when visuals are occluded.

What carries the argument

The deep audio-visual speech enhancement network that conditions on the speaker's lip movements and a voice representation from enrollment or self-enrollment.

If this is right

  • The approach works for speakers not encountered in training.
  • It handles cases where visual cues are temporarily absent.
  • It outperforms previous models especially under occlusion.
  • Self-enrollment enables voice representation learning without prior enrollment data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The self-enrollment mechanism could allow the system to adapt to new speakers in ongoing conversations without separate training sessions.
  • This method might be useful for enhancing audio in scenarios like masked speakers or video with temporary obstructions.
  • Extending the artificial occlusion training to other visual blocks could broaden applicability.

Load-bearing premise

Training by blending audios and introducing artificial occlusions around the mouth will enable generalization to real occlusions without the visual modality dominating the model on unseen data.

What would settle it

Measuring whether the network achieves better speech separation on real occluded videos of unseen speakers than audio-only or prior audio-visual baselines would test the claim.

Figures

Figures reproduced from arXiv: 1907.04975 by Andrew Zisserman, Joon Son Chung, Triantafyllos Afouras.

Figure 1
Figure 1. Figure 1: An audio-visual speech enhancement model may fail when the lip region is occluded by e.g. a microphone. In such cases the input audio is often entirely filtered out and the result is silent output over the occluded frames. The aim of our method is to be robust to this kind of occlusions. remains an unsolved problem. With our approach, even par￾tially occluded video can provide information on the voice char… view at source ↗
Figure 2
Figure 2. Figure 2: The architecture of the audio-visual speech enhancement net￾work: There are 2 audio streams. The one processes the incoming noisy audio, while the other takes as input an enrollment audio sample and creates a speaker embedding that captures the speaker’s voice charac￾teristics. A visual stream extracts frame-wise representations from the input video. The visual, speaker and audio embeddings are combined an… view at source ↗
Figure 3
Figure 3. Figure 3: Example frames of occluded videos used during training end evaluation. where Mˆ , Φˆ and M∗ , Φ ∗ are the predicted and ground truth magnitude and phase spectrograms respectively, and T and F their time and frequency resolutions. 3. Experimental Setup Datasets. The network is trained on the MV-LRS [27], LRS2 [21], and LRS3 [22] datasets, and tested on LRS3. MV￾LRS and LRS2 contain material from British tel… view at source ↗
Figure 4
Figure 4. Figure 4: Enhancement performance when occluding varying amounts of the visual input for the 2 Speakers and 3 Speakers scenarios. Model notations are explained in the caption of [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
read the original abstract

Our objective is an audio-visual model for separating a single speaker from a mixture of sounds such as other speakers and background noise. Moreover, we wish to hear the speaker even when the visual cues are temporarily absent due to occlusion. To this end we introduce a deep audio-visual speech enhancement network that is able to separate a speaker's voice by conditioning on both the speaker's lip movements and/or a representation of their voice. The voice representation can be obtained by either (i) enrollment, or (ii) by self-enrollment -- learning the representation on-the-fly given sufficient unobstructed visual input. The model is trained by blending audios, and by introducing artificial occlusions around the mouth region that prevent the visual modality from dominating. The method is speaker-independent, and we demonstrate it on real examples of speakers unheard (and unseen) during training. The method also improves over previous models in particular for cases of occlusion in the visual modality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces a deep audio-visual speech enhancement network for separating a single speaker's voice from mixtures of other speakers and noise. The network conditions on lip movements and/or a voice representation obtained either by enrollment or by self-enrollment (learning the representation on-the-fly from unobstructed visual input). Training uses audio blending together with artificial occlusions around the mouth region; the method is claimed to be speaker-independent and to improve over prior models particularly when visual input is occluded, with demonstrations on real examples of unseen speakers.

Significance. If the generalization from synthetic to real occlusions holds and quantitative gains are confirmed, the work would advance practical speaker-independent audio-visual speech separation by providing a mechanism to handle temporary visual obstructions without visual dominance, extending enrollment-based conditioning to on-the-fly self-enrollment.

major comments (2)
  1. [Abstract] Abstract and evaluation: the central claim that artificial mouth occlusions during training enable generalization to real obstructions (and prevent visual dominance) is load-bearing, yet no quantitative comparison is reported between performance on held-out real occlusions versus the synthetic masks; the real-example demonstrations are described only qualitatively.
  2. [Method / Training] Training strategy: the description of how blending audios plus artificial occlusions produces fallback to audio when visual cues are absent lacks a controlled ablation or metric (e.g., SI-SDR or PESQ on occluded vs. unoccluded test conditions) that would verify the claimed robustness.
minor comments (1)
  1. [Abstract] The abstract states improvement 'over previous models' without naming the baselines or reporting the quantitative metrics used for that comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly to include the requested quantitative evaluations.

read point-by-point responses
  1. Referee: [Abstract] Abstract and evaluation: the central claim that artificial mouth occlusions during training enable generalization to real obstructions (and prevent visual dominance) is load-bearing, yet no quantitative comparison is reported between performance on held-out real occlusions versus the synthetic masks; the real-example demonstrations are described only qualitatively.

    Authors: We agree that a quantitative comparison would strengthen the central claim regarding generalization from synthetic to real occlusions. The manuscript currently supports this via qualitative demonstrations on real examples of unseen speakers. In revision we will add a quantitative evaluation (using SI-SDR and PESQ) on held-out real occlusion examples versus synthetic masks to directly verify the generalization and the avoidance of visual dominance. revision: yes

  2. Referee: [Method / Training] Training strategy: the description of how blending audios plus artificial occlusions produces fallback to audio when visual cues are absent lacks a controlled ablation or metric (e.g., SI-SDR or PESQ on occluded vs. unoccluded test conditions) that would verify the claimed robustness.

    Authors: We will add a controlled ablation study to the revised manuscript. This will report SI-SDR and PESQ metrics on occluded versus unoccluded test conditions, confirming that the training strategy (audio blending combined with artificial occlusions) enables the model to fall back to audio when visual input is absent. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained empirical training

full rationale

The paper describes an audio-visual speech enhancement network trained via audio blending and artificial mouth occlusions to handle temporary visual absence. No load-bearing step reduces by construction to its own inputs: there are no self-definitional equations, no fitted parameters renamed as predictions, and no self-citation chains invoked as uniqueness theorems. The central claims rest on the external training procedure and qualitative real-example demonstrations rather than any internal reduction or renaming of known results. This matches the default case of a non-circular empirical ML paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available, so ledger is incomplete. Training relies on audio blending and artificial occlusions as key strategies; these may involve unstated modeling choices or hyperparameters typical in deep learning but not detailed here.

pith-pipeline@v0.9.0 · 5701 in / 1240 out tokens · 52927 ms · 2026-05-24T23:36:45.686610+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 3 internal anchors

  1. [1]

    My lips are concealed: Audio-visual speech enhancement through obstructions

    Introduction While there has been great progress in the field of automatic speech recognition (ASR) in recent years, some key challenges remain, particularly the understanding of speech in very noisy environments or in cases where multiple people speak simul- taneously. In this direction, isolating voices in multi-speaker scenarios, increasing the signal-t...

  2. [2]

    The network receives three inputs: (i) the noisy audio to be en- hanced; (ii) the corresponding video frames; (iii) a reference audio containing speech from the target speaker

    Method This section describes the architecture of the audio-visual speech enhancement network, which is given in Figure 2. The network receives three inputs: (i) the noisy audio to be en- hanced; (ii) the corresponding video frames; (iii) a reference audio containing speech from the target speaker. We summa- rize the principal modules below. Details of th...

  3. [3]

    The network is trained on the MV-LRS [27], LRS2 [21], and LRS3 [22] datasets, and tested on LRS3

    Experimental Setup Datasets. The network is trained on the MV-LRS [27], LRS2 [21], and LRS3 [22] datasets, and tested on LRS3. MV- LRS and LRS2 contain material from British television broad- casts, while LRS3 was created from videos of TED talks. The speakers appearing in LRS3 are to the best of our knowledge not seen in either of the other two datasets....

  4. [4]

    Experiments 4.1. Evaluation protocol To evaluate the performance of our model we use the Signal to Distortion Ration (SDR) [28], a common metric expressing the ratio between the energy of the target signal and of the errors contained in the enhanced output. Furthermore to assess the in- telligibility of the output, we use the Google Cloud ASR system – we ...

  5. [5]

    Conclusion In this paper, we proposed a deep audio-visual speech enhance- ment network that is able to separate a speaker’s voice by con- ditioning on both the speaker’s lip movements and/or a repre- sentation of their voice. The network is robust to partial oc- clusions, and the voice representation can be self-enrolled from the unoccluded part of the in...

  6. [6]

    The conversation: Deep audio-visual speech enhancement,

    T. Afouras, J. S. Chung, and A. Zisserman, “The conversation: Deep audio-visual speech enhancement,” in Proc. Interspeech,

  7. [7]

    Looking to listen at the cock- tail party: A speaker-independent audio-visual model for speech separation,

    A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein, “Looking to listen at the cock- tail party: A speaker-independent audio-visual model for speech separation,” SIGGRAPH, 2018. 1, 2, 3

  8. [8]

    Audio-visual scene analysis with self- supervised multisensory features,

    A. Owens and A. A. Efros, “Audio-visual scene analysis with self- supervised multisensory features,” in Proc. ECCV, 2018, pp. 631–

  9. [9]

    V oice- filter: Targeted voice separation by speaker-conditioned spectro- gram masking,

    Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. Her- shey, R. A. Saurous, R. J. Weiss, Y . Jia, and I. L. Moreno, “V oice- filter: Targeted voice separation by speaker-conditioned spectro- gram masking,” in Proc. Interspeech, 2018. 1, 2, 3, 4

  10. [10]

    Deep clus- tering: Discriminative embeddings for segmentation and separa- tion,

    J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clus- tering: Discriminative embeddings for segmentation and separa- tion,” in Proc. ICASSP. IEEE, 2016, pp. 31–35. 1

  11. [11]

    Permutation invari- ant training of deep models for speaker-independent multi-talker speech separation,

    D. Yu, M. Kolbæk, Z. Tan, and J. Jensen, “Permutation invari- ant training of deep models for speaker-independent multi-talker speech separation,” in Proc. ICASSP, 2017. 1, 3, 4

  12. [12]

    Soft mask methods for single-channel speaker separation,

    A. M. Reddy and B. Raj, “Soft mask methods for single-channel speaker separation,” IEEE Transactions on Audio, Speech, and Language Processing, 2007. 1

  13. [13]

    A supervised learning approach to monaural segregation of reverberant speech,

    Z. Jin and D. Wang, “A supervised learning approach to monaural segregation of reverberant speech,” IEEE Transactions on Audio, Speech, and Language Processing, 2009. 1

  14. [14]

    Single-channel speech sep- aration using soft mask filtering,

    M. H. Radfar and R. M. Dansereau, “Single-channel speech sep- aration using soft mask filtering,” IEEE Transactions on Audio, Speech, and Language Processing, 2007. 1

  15. [15]

    Makino, T.-W

    S. Makino, T.-W. Lee, and H. Sawada, Blind speech separation. Springer, 2007. 1

  16. [16]

    Supervised speech separation based on deep learning: an overview,

    D. Wang and J. Chen, “Supervised speech separation based on deep learning: an overview,” IEEE Transactions on Audio, Speech and Language Processing, 2017. 1

  17. [17]

    Speaker separation using visually-derived binary masks,

    F. Khan and B. Milner, “Speaker separation using visually-derived binary masks,” in AVSP, 2013. 1

  18. [18]

    Video assisted speech source separation,

    W. Wang, D. Cosker, Y . Hicks, S. Saneit, and J. Chambers, “Video assisted speech source separation,” in Proc. ICASSP, 2005. 1

  19. [19]

    Audio-visual enhance- ment of speech in noise,

    L. Girin, J.-L. Schwartz, and G. Feng, “Audio-visual enhance- ment of speech in noise,” The Journal of the Acoustical Society of America, 2001. 1

  20. [20]

    Audio-visual speech en- hancement with avcdcn (audio-visual codebook dependent cep- stral normalization),

    S. Deligne, G. Potamianos, and C. Neti, “Audio-visual speech en- hancement with avcdcn (audio-visual codebook dependent cep- stral normalization),” in Sensor Array and Multichannel Signal Processing Workshop Proceedings, 2002. 1

  21. [21]

    Audio-visual sound separation via hidden markov models,

    J. R. Hershey and M. Casey, “Audio-visual sound separation via hidden markov models,” in NIPS, 2002. 1

  22. [22]

    Audio- visual speech source separation: An overview of key methodolo- gies,

    B. Rivet, W. Wang, S. M. Naqvi, and J. A. Chambers, “Audio- visual speech source separation: An overview of key methodolo- gies,” IEEE Signal Processing Magazine, 2014. 1

  23. [23]

    Seeing through noise: Visually driven speaker separation and enhancement,

    A. Gabbay, A. Ephrat, T. Halperin, and S. Peleg, “Seeing through noise: Visually driven speaker separation and enhancement,”

  24. [24]

    Visual Speech Enhancement

    A. Gabbay, A. Shamir, and S. Peleg, “Visual Speech En- hancement using Noise-Invariant Training,” arXiv preprint arXiv:1711.08789, 2017. 1

  25. [25]

    Audio-Visual Speech Enhancement Using Multi- modal Deep Convolutional Neural Networks,

    J.-C. Hou, S.-S. Wang, Y .-H. Lai, Y . Tsao, H.-W. Chang, and H.-M. Wang, “Audio-Visual Speech Enhancement Using Multi- modal Deep Convolutional Neural Networks,”IEEE Transactions on Emerging Topics in Computational Intelligence, 2018. 1

  26. [26]

    Deep audio-visual speech recognition,

    T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisser- man, “Deep audio-visual speech recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019. 2, 3

  27. [27]

    LRS3-TED: a large-scale dataset for visual speech recognition

    T. Afouras, J. S. Chung, and A. Zisserman, “LRS3-TED: a large- scale dataset for visual speech recognition,” in arXiv preprint arXiv:1809.00496, 2018. 2, 3

  28. [28]

    Combining residual net- works with LSTMs for lipreading,

    T. Stafylakis and G. Tzimiropoulos, “Combining residual net- works with LSTMs for lipreading,” in Proc. Interspeech, 2017. 2, 3

  29. [29]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. CVPR, 2016. 2

  30. [30]

    Utterance- level aggregation for speaker recognition in the wild,

    W. Xie, A. Nagrani, J. S. Chung, and A. Zisserman, “Utterance- level aggregation for speaker recognition in the wild,” in Proc. ICASSP, 2019. 2

  31. [31]

    V oxCeleb2: Deep speaker recognition,

    J. S. Chung, A. Nagrani, and A. Zisserman, “V oxCeleb2: Deep speaker recognition,” in Proc. Interspeech, 2018. 2

  32. [32]

    Lip reading in profile,

    J. S. Chung and A. Zisserman, “Lip reading in profile,” in Proc. BMVC., 2017. 3

  33. [33]

    BSS EV AL toolbox user guide,

    C. F ´evotte, R. Gribonval, and E. Vincent, “BSS EV AL toolbox user guide,” IRISA Technical Report 1706. http://www.irisa.fr/metiss/bss eval/., 2005. 3