Deep Audio-Visual Speech Recognition

arxiv: 1809.02108 · v2 · pith:QDTQHBVAnew · submitted 2018-09-06 · 💻 cs.CV

Deep Audio-Visual Speech Recognition

Triantafyllos Afouras , Joon Son Chung , Andrew Senior , Oriol Vinyals , Andrew Zisserman This is my paper

classification 💻 cs.CV

keywords readingaudiomodelsrecognitionsentencesspeechaudio-visualdataset

0 comments p. Extension

pith:QDTQHBVA Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{QDTQHBVA}

Prints a linked pith:QDTQHBVA badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

read the original abstract

The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio. Unlike previous works that have focussed on recognising a limited number of words or phrases, we tackle lip reading as an open-world problem - unconstrained natural language sentences, and in the wild videos. Our key contributions are: (1) we compare two models for lip reading, one using a CTC loss, and the other using a sequence-to-sequence loss. Both models are built on top of the transformer self-attention architecture; (2) we investigate to what extent lip reading is complementary to audio speech recognition, especially when the audio signal is noisy; (3) we introduce and publicly release a new dataset for audio-visual speech recognition, LRS2-BBC, consisting of thousands of natural sentences from British television. The models that we train surpass the performance of all previous work on a lip reading benchmark dataset by a significant margin.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HighSync: High-Quality Lip Synchronization via Latent Diffusion Models
cs.CV 2026-05 unverdicted novelty 5.0

HighSync is a diffusion-based lip synchronization system that operates natively at 512x512 resolution by eliminating data leakage to enforce genuine audio dependence and reports state-of-the-art results on quality and...