LipNet: End-to-End Sentence-level Lipreading

Brendan Shillingford; Nando de Freitas; Shimon Whiteson; Yannis M. Assael

Not yet reviewed by Pith; the record is open.

Re-run · record.json Download PDF Read on arXiv ↗

This paper has not been read by Pith yet. Machine review is queued; the pith claim, tier, and objections will appear here once it completes.

SPECIMEN: schema-true, not a live event

T0 review · schema-true

One-sentence machine reading of the paper's core claim.

pith:XXXXXXXX · record.json · timestamp

arxiv 1611.01599 v2 pith:FYM3WX3N submitted 2016-11-05 cs.LG cs.CLcs.CV

LipNet: End-to-End Sentence-level Lipreading

Yannis M. Assael , Brendan Shillingford , Shimon Whiteson , Nando de Freitas This is my paper

classification cs.LG cs.CLcs.CV

keywords end-to-endlipreadinglipnetsentence-levelfeaturesmodelsequenceaccuracy

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

0 comments

read the original abstract

Lipreading is the task of decoding text from the movement of a speaker's mouth. Traditional approaches separated the problem into two stages: designing or learning visual features, and prediction. More recent deep lipreading approaches are end-to-end trainable (Wand et al., 2016; Chung & Zisserman, 2016a). However, existing work on models trained end-to-end perform only word classification, rather than sentence-level sequence prediction. Studies have shown that human lipreading performance increases for longer words (Easton & Basala, 1982), indicating the importance of features capturing temporal context in an ambiguous communication channel. Motivated by this observation, we present LipNet, a model that maps a variable-length sequence of video frames to text, making use of spatiotemporal convolutions, a recurrent network, and the connectionist temporal classification loss, trained entirely end-to-end. To the best of our knowledge, LipNet is the first end-to-end sentence-level lipreading model that simultaneously learns spatiotemporal visual features and a sequence model. On the GRID corpus, LipNet achieves 95.2% accuracy in sentence-level, overlapped speaker split task, outperforming experienced human lipreaders and the previous 86.4% word-level state-of-the-art accuracy (Gergen et al., 2016).

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A First Exploration of Neuromorphic OT-CFM for Multi-Speaker VSR
cs.MM 2026-06 unverdicted novelty 6.0

LipsFlow converts RGB videos to event streams, segments multi-speaker scenes with ByteTrack and TalkNet, and uses OT-CFM plus dual semantic supervision to report 22.3% WER at 240 ms latency on VSR benchmarks.
Head-Pose-Aware Visual Speech Recognition with FiLM Modulation
cs.CV 2026-05 unverdicted novelty 5.0

HP-VSR-ResFiLM adds a single residual FiLM modulation block conditioned on head pose to a CNN visual encoder, yielding WER of 25.0% on LRS2 and 33.2% on LRS3 under standard training conditions.
A First Exploration of Neuromorphic OT-CFM for Multi-Speaker VSR
cs.MM 2026-06 unverdicted novelty 4.0

LipsFlow turns RGB video into event streams, segments multi-speaker scenes with tracking, and uses OT-CFM plus dual semantic supervision to reach 22.3% WER at 240 ms latency on VSR benchmarks.
LipReading with 3D-2D-CNN BLSTM-HMM and word-CTC models
cs.CV 2019-06 unverdicted novelty 4.0

3D-2D-CNN-BLSTM with word-CTC reaches 1.3% WER on GRID seen-speaker lipreading (55% relative gain over LCANet) and 8.6% on unseen speakers (24.5% gain over LipNet).
Naver at ActivityNet Challenge 2019 -- Task B Active Speaker Detection (AVA)
cs.SD 2019-06 unverdicted novelty 2.0

A 3D CNN front-end plus temporal convolution and LSTM ensemble yields significant gains over baseline for active speaker detection on the AVA dataset.