LipReading with 3D-2D-CNN BLSTM-HMM and word-CTC models
Pith reviewed 2026-05-25 16:32 UTC · model grok-4.3
The pith
A 3D-2D-CNN-BLSTM network trained with word CTC reaches 1.3% WER on GRID seen-speaker lipreading.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Training the 3D-2D-CNN-BLSTM network directly with word-level CTC loss produces 1.3 percent WER on the GRID seen-speaker test set, a 55 percent relative improvement over LCANet, and 8.6 percent WER on the unseen-speaker set, a 24.5 percent relative improvement over LipNet; the character-CTC-plus-HMM route further shows that the extracted bottleneck features outperform traditional DCT features inside a hybrid recognition pipeline.
What carries the argument
3D-2D-CNN-BLSTM network with bottleneck layer, trained either via character CTC followed by BLSTM-HMM or directly via word CTC.
If this is right
- Bottleneck features extracted after character CTC training outperform DCT features inside a conventional BLSTM-HMM system.
- Direct word-level CTC training on the network yields lower error than the two-stage character-CTC-plus-HMM route.
- The same architecture and training recipe produces usable results on an independent 81-speaker dataset.
- Feature duplication inside the BLSTM-HMM stage measurably changes final word error rate.
Where Pith is reading between the lines
- If word-level supervision is the main driver of the gains, the same pattern may appear in other visual sequence tasks such as sign-language recognition.
- The gap between seen- and unseen-speaker error rates indicates that speaker-independent lipreading remains harder; further work could test whether the bottleneck features transfer across entirely new recording conditions.
- The two approaches could be combined, for example by using word-CTC features to initialize the HMM stage, though the paper does not explore this.
Load-bearing premise
The GRID corpus seen- and unseen-speaker partitions are representative of real-world lipreading conditions and contain no data leakage or speaker overlap.
What would settle it
A new lipreading corpus recorded under different lighting, camera angles, or vocabulary where the word-CTC version fails to match or beat the prior LCANet and LipNet error rates.
Figures
read the original abstract
In recent years, deep learning based machine lipreading has gained prominence. To this end, several architectures such as LipNet, LCANet and others have been proposed which perform extremely well compared to traditional lipreading DNN-HMM hybrid systems trained on DCT features. In this work, we propose a simpler architecture of 3D-2D-CNN-BLSTM network with a bottleneck layer. We also present analysis of two different approaches for lipreading on this architecture. In the first approach, 3D-2D-CNN-BLSTM network is trained with CTC loss on characters (ch-CTC). Then BLSTM-HMM model is trained on bottleneck lip features (extracted from 3D-2D-CNN-BLSTM ch-CTC network) in a traditional ASR training pipeline. In the second approach, same 3D-2D-CNN-BLSTM network is trained with CTC loss on word labels (w-CTC). The first approach shows that bottleneck features perform better compared to DCT features. Using the second approach on Grid corpus' seen speaker test set, we report $1.3\%$ WER - a $55\%$ improvement relative to LCANet. On unseen speaker test set we report $8.6\%$ WER which is $24.5\%$ improvement relative to LipNet. We also verify the method on a second dataset of $81$ speakers which we collected. Finally, we also discuss the effect of feature duplication on BLSTM-HMM model performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a 3D-2D-CNN-BLSTM architecture for lipreading and evaluates two approaches: (1) character-level CTC training followed by BLSTM-HMM on extracted bottleneck features, and (2) word-level CTC training. On the GRID corpus it reports 1.3% WER on the seen-speaker test set (55% relative improvement over LCANet) and 8.6% WER on the unseen-speaker test set (24.5% relative improvement over LipNet); it also states that the method was verified on a self-collected 81-speaker dataset and discusses the effect of feature duplication on the BLSTM-HMM stage.
Significance. If the reported WER numbers prove comparable under identical evaluation conditions, the work would show that a relatively compact 3D-2D-CNN-BLSTM model trained with word-CTC can substantially outperform prior published lipreading systems on a public benchmark, while the hybrid CTC-plus-HMM pipeline demonstrates the utility of learned bottleneck features over hand-crafted DCT features.
major comments (2)
- [Abstract and §4] Abstract and §4 (results): the central claim consists of the 1.3% and 8.6% WER figures together with the stated relative gains. These numbers are only interpretable if the authors' 'seen speaker test set' and 'unseen speaker test set' are exactly the same speaker partitions, frame-rate, cropping, and word-segmentation protocol used by the LCANet and LipNet papers. The manuscript supplies neither the speaker IDs, a statement of disjointness from training speakers, nor an explicit confirmation that the evaluation protocol matches the baselines.
- [Abstract and §3] Abstract and §3 (experimental setup): no training hyper-parameters, optimizer settings, data-augmentation details, or baseline re-implementation protocol are provided. Because the numerical support for the claimed improvements rests entirely on the reported WER values, the absence of these details prevents verification that the gains are not due to differences in preprocessing or optimization.
minor comments (2)
- [Abstract] The second dataset is described only as '81 speakers which we collected'; no WER numbers, speaker counts for train/test, or comparison to GRID are supplied, so the verification claim cannot be assessed.
- [§2] Notation for the two CTC variants (ch-CTC vs. w-CTC) and the bottleneck layer is introduced without an accompanying diagram or equation that would clarify the precise location of the bottleneck relative to the BLSTM.
Simulated Author's Rebuttal
We thank the referee for the comments, which highlight important aspects of reproducibility. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (results): the central claim consists of the 1.3% and 8.6% WER figures together with the stated relative gains. These numbers are only interpretable if the authors' 'seen speaker test set' and 'unseen speaker test set' are exactly the same speaker partitions, frame-rate, cropping, and word-segmentation protocol used by the LCANet and LipNet papers. The manuscript supplies neither the speaker IDs, a statement of disjointness from training speakers, nor an explicit confirmation that the evaluation protocol matches the baselines.
Authors: We agree that the manuscript must explicitly document the evaluation protocol to support the claimed gains. In the revision we will add the speaker IDs used for the seen-speaker and unseen-speaker test sets, a statement confirming they are disjoint from the training speakers, and an explicit declaration that frame-rate, cropping, and word-segmentation follow the protocols of the LCANet and LipNet papers. This will make the 1.3 % and 8.6 % WER figures directly comparable under identical conditions. revision: yes
-
Referee: [Abstract and §3] Abstract and §3 (experimental setup): no training hyper-parameters, optimizer settings, data-augmentation details, or baseline re-implementation protocol are provided. Because the numerical support for the claimed improvements rests entirely on the reported WER values, the absence of these details prevents verification that the gains are not due to differences in preprocessing or optimization.
Authors: We acknowledge that the current manuscript omits these implementation details. The revised version will contain a dedicated experimental-setup subsection listing the optimizer, learning-rate schedule, batch size, number of epochs, data-augmentation transforms, and the exact protocol followed when comparing against the published LCANet and LipNet numbers. These additions will allow independent verification that the reported improvements are not artifacts of differing preprocessing or optimization. revision: yes
Circularity Check
No circularity detected in empirical performance reporting
full rationale
The paper is an empirical ML study reporting WER numbers on the public GRID corpus (seen/unseen speaker splits) and a self-collected 81-speaker set, with direct numerical comparisons to independently published baselines (LipNet, LCANet). No derivation chain, equations, or fitted parameters exist that reduce any claimed result to an input by construction. The reported improvements are external benchmark comparisons, not internal predictions or self-referential quantities. Self-citation is absent from the load-bearing claims.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Introduction The process of using only visual information of lip movements to convert speech to text is called machine lipreading. The visual information such as lip movements, facial expression, tongue and teeth movements helps us in understanding other person’s speech when audio is interrupted or corrupted. Thus visual information acts as complimentary ...
-
[2]
LipReading with 3D-2D-CNN BLSTM-HMM and word-CTC models
Related Work Two decades ago, lipreading is seen as a word classification problem, where each input video is classified to one of the lim- ited words. The authors in [14] do word classification using different variations of 3D CNN architectures. Word classifi- cation using CNNs followed by RNNs or HMMs is presented in number of different papers [4, 15, 16]. L...
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[3]
Models 3.1. 3D-2D-CNN-BLSTM Network In this section we describe the architecture of proposed 3D-2D- CNN-BLSTM network. This network contains two 3D CNN layers, two 2D CNN layers followed by two BLSTM layers as shown in the Figure 1. In 3D convolution kernel moves along time, height and width dimensions of input. Whereas in 2D con- volution kernel only mov...
-
[4]
And from second epoch all full sentences are included
includes only segmented words. And from second epoch all full sentences are included. Curriculum learning has helped us in faster convergence, for 3D-2D-CNN-BLSTM w-CTC (de- scribed in subsection 3.3) model with curriculum learning it took 45 epochs to converge whereas without curriculum learn- ing it took 89 epochs to converge. Similarly for other exper-...
-
[5]
Experiments 4.1. Datasets 4.1.1. Grid The Grid audio-visual dataset is the widely used data for audio- visual or visual speech recognition tasks [22]. Each sentence of Grid has a fixed structure with six words structured as: com- mand + color + preposition + letter + digit + adverb . For example ”set red with m six please”. The dataset has 51 unique words....
-
[6]
For Indian English (In-En) dataset we report re- sults on unseen dataset
Results Results are reported on seen test set and on unseen test set for Grid dataset. For Indian English (In-En) dataset we report re- sults on unseen dataset. The results for our proposed approaches compared with baseline DCT BLSTM-HMM and with LIP- NET, LCANet, W AS (WLAS) are presented in Table 3. It is shown that 3D-2D-CNN-BLSTM w-CTC approach has ac...
-
[7]
CONCLUSION We proposed new 3D-2D-CNN-BLSTM architecture, which is comparable to LCANet on Grid when ch-CTC loss is used. The proposed 3D-2D-CNN-BLSTM w-CTC has given state-of-the- art results with relative improvement of55% and 24.5% on Grid seen and unseen test sets with 1.3% WER and 8.6% WER re- spectively. We also demonstrated that 3D-2D-CNN BLSTM- HMM...
-
[8]
Multi-task Learning Of Deep Neural Networks For Audio Visual Automatic Speech Recognition
A. Thanda and S. M. Venkatesan, “Multi-task learning of deep neural networks for audio visual automatic speech recognition,” arXiv preprint arXiv:1701.02477, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[9]
Deep Audio-Visual Speech Recognition
T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zis- serman, “Deep audio-visual speech recognition,” arXiv preprint arXiv:1809.02108, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[10]
R. Aralikatti, D. Margam, T. Sharma, T. Abhinav, and S. M. Venkatesan, “Global snr estimation of speech signals using en- tropy and uncertainty estimates from dropout networks,” arXiv preprint arXiv:1804.04353, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[11]
A review of re- cent advances in visual speech decoding,
Z. Zhou, G. Zhao, X. Hong, and M. Pietik ¨ainen, “A review of re- cent advances in visual speech decoding,” Image and vision com- puting, vol. 32, no. 9, pp. 590–605, 2014
work page 2014
-
[12]
Comparison of hu- man and machine-based lip-reading
S. Hilder, R. W. Harvey, and B.-J. Theobald, “Comparison of hu- man and machine-based lip-reading.” in A VSP, 2009, pp. 86–89
work page 2009
-
[13]
Lip reading sentences in the wild
J. S. Chung, A. W. Senior, O. Vinyals, and A. Zisserman, “Lip reading sentences in the wild.” in CVPR, 2017, pp. 3444–3453
work page 2017
-
[14]
LipNet: End-to-End Sentence-level Lipreading
Y . M. Assael, B. Shillingford, S. Whiteson, and N. de Freitas, “Lipnet: End-to-end sentence-level lipreading,” arXiv preprint arXiv:1611.01599, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[15]
Lcanet: End-to-end lipreading with cascaded attention-ctc,
K. Xu, D. Li, N. Cassimatis, and X. Wang, “Lcanet: End-to-end lipreading with cascaded attention-ctc,” inAutomatic Face & Ges- ture Recognition (FG 2018), 2018 13th IEEE International Con- ference on. IEEE, 2018, pp. 548–555
work page 2018
-
[16]
A. Graves, S. Fern ´andez, F. Gomez, and J. Schmidhuber, “Con- nectionist temporal classification: labelling unsegmented se- quence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning . ACM, 2006, pp. 369–376
work page 2006
-
[17]
Audio visual speech recog- nition using deep recurrent neural networks,
A. Thanda and S. M. Venkatesan, “Audio visual speech recog- nition using deep recurrent neural networks,” in IAPR Workshop on Multimodal Pattern Recognition of Social Signals in Human- Computer Interaction. Springer, 2016, pp. 98–109
work page 2016
-
[18]
Direct Acoustics-to-Word Models for English Conversational Speech Recognition
K. Audhkhasi, B. Ramabhadran, G. Saon, M. Picheny, and D. Na- hamoo, “Direct acoustics-to-word models for english conver- sational speech recognition,” arXiv preprint arXiv:1703.07754 , 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[20]
Audio-visual deep learning for noise robust speech recognition,
J. Huang and B. Kingsbury, “Audio-visual deep learning for noise robust speech recognition,” in 2013 IEEE International Confer- ence on Acoustics, Speech and Signal Processing . IEEE, 2013, pp. 7596–7599
work page 2013
-
[21]
J. S. Chung and A. Zisserman, “Lip reading in the wild,” in Asian Conference on Computer Vision. Springer, 2016, pp. 87–103
work page 2016
-
[22]
Dynamic stream weighting for turbo-decoding- based audiovisual asr
S. Gergen, S. Zeiler, A. H. Abdelaziz, R. M. Nickel, and D. Kolossa, “Dynamic stream weighting for turbo-decoding- based audiovisual asr.” in INTERSPEECH, 2016, pp. 2135–2139
work page 2016
-
[23]
End-to-end visual speech recog- nition with lstms,
S. Petridis, Z. Li, and M. Pantic, “End-to-end visual speech recog- nition with lstms,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 2592–2596
work page 2017
-
[24]
Y . Bengio, J. Louradour, R. Collobert, and J. Weston, “Curricu- lum learning,” in Proceedings of the 26th annual international conference on machine learning. ACM, 2009, pp. 41–48
work page 2009
-
[25]
Audio visual speech recognition,
C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, D. Ver- gyri, J. Sison, and A. Mashari, “Audio visual speech recognition,” IDIAP, Tech. Rep., 2000
work page 2000
-
[26]
Large-Scale Visual Speech Recognition
B. Shillingford, Y . Assael, M. W. Hoffman, T. Paine, C. Hughes, U. Prabhu, H. Liao, H. Sak, K. Rao, L. Bennettet al., “Large-scale visual speech recognition,” arXiv preprint arXiv:1807.05162 , 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[27]
Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition
H. Soltau, H. Liao, and H. Sak, “Neural speech recognizer: Acoustic-to-word lstm model for large vocabulary speech recog- nition,” arXiv preprint arXiv:1610.09975, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[28]
The influence of video sam- pling rate on lipreading performance,
A. G. Chit ¸u and L. J. Rothkrantz, “The influence of video sam- pling rate on lipreading performance,” in12-th International Con- ference on Speech and Computer (SPECOM’2007) , 2007, pp. 678–684
work page 2007
-
[29]
An audio- visual corpus for speech perception and automatic speech recog- nition,
M. Cooke, J. Barker, S. Cunningham, and X. Shao, “An audio- visual corpus for speech perception and automatic speech recog- nition,” The Journal of the Acoustical Society of America , vol. 120, no. 5, pp. 2421–2424, 2006
work page 2006
-
[30]
Yolo9000: better, faster, stronger,
J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” arXiv preprint, 2017
work page 2017
-
[31]
Tensorflow: a system for large-scale machine learning
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isardet al., “Tensorflow: a system for large-scale machine learning.” in OSDI, vol. 16, 2016, pp. 265–283
work page 2016
-
[32]
Adam: A Method for Stochastic Optimization
D. P. Kingma and J. Ba, “Adam: A method for stochastic opti- mization,” arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[33]
The kaldi speech recognition toolkit,
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y . Qian, P. Schwarzet al., “The kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding , no. EPFL- CONF-192584. IEEE Signal Processing Society, 2011
work page 2011
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.