LipReading with 3D-2D-CNN BLSTM-HMM and word-CTC models

Abhinav Thanda; Dilip Kumar Margam; Pujitha A K; Rohith Aralikatti; Shankar M Venkatesan; Sharad Roy; Tanay Sharma

arxiv: 1906.12170 · v1 · pith:PTZ2OM4Jnew · submitted 2019-06-25 · 💻 cs.CV · cs.LG· cs.SD· eess.AS· eess.IV

LipReading with 3D-2D-CNN BLSTM-HMM and word-CTC models

Dilip Kumar Margam , Rohith Aralikatti , Tanay Sharma , Abhinav Thanda , Pujitha A K , Sharad Roy , Shankar M Venkatesan This is my paper

Pith reviewed 2026-05-25 16:32 UTC · model grok-4.3

classification 💻 cs.CV cs.LGcs.SDeess.ASeess.IV

keywords lipreadingCTC loss3D-2D-CNNBLSTMword error rateGRID corpusbottleneck featuresvisual speech recognition

0 comments

The pith

A 3D-2D-CNN-BLSTM network trained with word CTC reaches 1.3% WER on GRID seen-speaker lipreading.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a 3D-2D-CNN-BLSTM network with a bottleneck layer and evaluates two training strategies for lipreading. In the first, character-level CTC pre-trains the network so that its bottleneck features can feed a separate BLSTM-HMM pipeline; in the second, the same network is trained end-to-end with word-level CTC. On the GRID corpus the word-CTC version records 1.3 percent WER for seen speakers and 8.6 percent WER for unseen speakers, improving on earlier LCANet and LipNet numbers, and the method is checked on a second 81-speaker collection.

Core claim

Training the 3D-2D-CNN-BLSTM network directly with word-level CTC loss produces 1.3 percent WER on the GRID seen-speaker test set, a 55 percent relative improvement over LCANet, and 8.6 percent WER on the unseen-speaker set, a 24.5 percent relative improvement over LipNet; the character-CTC-plus-HMM route further shows that the extracted bottleneck features outperform traditional DCT features inside a hybrid recognition pipeline.

What carries the argument

3D-2D-CNN-BLSTM network with bottleneck layer, trained either via character CTC followed by BLSTM-HMM or directly via word CTC.

If this is right

Bottleneck features extracted after character CTC training outperform DCT features inside a conventional BLSTM-HMM system.
Direct word-level CTC training on the network yields lower error than the two-stage character-CTC-plus-HMM route.
The same architecture and training recipe produces usable results on an independent 81-speaker dataset.
Feature duplication inside the BLSTM-HMM stage measurably changes final word error rate.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If word-level supervision is the main driver of the gains, the same pattern may appear in other visual sequence tasks such as sign-language recognition.
The gap between seen- and unseen-speaker error rates indicates that speaker-independent lipreading remains harder; further work could test whether the bottleneck features transfer across entirely new recording conditions.
The two approaches could be combined, for example by using word-CTC features to initialize the HMM stage, though the paper does not explore this.

Load-bearing premise

The GRID corpus seen- and unseen-speaker partitions are representative of real-world lipreading conditions and contain no data leakage or speaker overlap.

What would settle it

A new lipreading corpus recorded under different lighting, camera angles, or vocabulary where the word-CTC version fails to match or beat the prior LCANet and LipNet error rates.

Figures

Figures reproduced from arXiv: 1906.12170 by Abhinav Thanda, Dilip Kumar Margam, Pujitha A K, Rohith Aralikatti, Shankar M Venkatesan, Sharad Roy, Tanay Sharma.

**Figure 1.** Figure 1: Architecture of 3D-2D-CNN-BLSTM network these features for context-independent phones (mono-phone GMM-HMM model). Bootstraped by the alignments generated from mono-phone model tri-phone GMM-HMM (a model with context-dependent phones) is trained. Then GMM-HMM with LDA transformed features is trained. Finally, HMM tied-state labels obtained from GMM-HMM trained with LDA transforms are used for training BLSTM… view at source ↗

**Figure 2.** Figure 2: Training pipeline of BLSTM-HMM model with bottleneck features obtained from 3D-2D-CNN-BLSTM as input 3.2.1. Feature Duplication We observed that duplicating each input feature exactly 4 times gives dramatic improvements for hybrid models. The [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

read the original abstract

In recent years, deep learning based machine lipreading has gained prominence. To this end, several architectures such as LipNet, LCANet and others have been proposed which perform extremely well compared to traditional lipreading DNN-HMM hybrid systems trained on DCT features. In this work, we propose a simpler architecture of 3D-2D-CNN-BLSTM network with a bottleneck layer. We also present analysis of two different approaches for lipreading on this architecture. In the first approach, 3D-2D-CNN-BLSTM network is trained with CTC loss on characters (ch-CTC). Then BLSTM-HMM model is trained on bottleneck lip features (extracted from 3D-2D-CNN-BLSTM ch-CTC network) in a traditional ASR training pipeline. In the second approach, same 3D-2D-CNN-BLSTM network is trained with CTC loss on word labels (w-CTC). The first approach shows that bottleneck features perform better compared to DCT features. Using the second approach on Grid corpus' seen speaker test set, we report $1.3\%$ WER - a $55\%$ improvement relative to LCANet. On unseen speaker test set we report $8.6\%$ WER which is $24.5\%$ improvement relative to LipNet. We also verify the method on a second dataset of $81$ speakers which we collected. Finally, we also discuss the effect of feature duplication on BLSTM-HMM model performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a 3D-2D-CNN-BLSTM architecture for lipreading and evaluates two approaches: (1) character-level CTC training followed by BLSTM-HMM on extracted bottleneck features, and (2) word-level CTC training. On the GRID corpus it reports 1.3% WER on the seen-speaker test set (55% relative improvement over LCANet) and 8.6% WER on the unseen-speaker test set (24.5% relative improvement over LipNet); it also states that the method was verified on a self-collected 81-speaker dataset and discusses the effect of feature duplication on the BLSTM-HMM stage.

Significance. If the reported WER numbers prove comparable under identical evaluation conditions, the work would show that a relatively compact 3D-2D-CNN-BLSTM model trained with word-CTC can substantially outperform prior published lipreading systems on a public benchmark, while the hybrid CTC-plus-HMM pipeline demonstrates the utility of learned bottleneck features over hand-crafted DCT features.

major comments (2)

[Abstract and §4] Abstract and §4 (results): the central claim consists of the 1.3% and 8.6% WER figures together with the stated relative gains. These numbers are only interpretable if the authors' 'seen speaker test set' and 'unseen speaker test set' are exactly the same speaker partitions, frame-rate, cropping, and word-segmentation protocol used by the LCANet and LipNet papers. The manuscript supplies neither the speaker IDs, a statement of disjointness from training speakers, nor an explicit confirmation that the evaluation protocol matches the baselines.
[Abstract and §3] Abstract and §3 (experimental setup): no training hyper-parameters, optimizer settings, data-augmentation details, or baseline re-implementation protocol are provided. Because the numerical support for the claimed improvements rests entirely on the reported WER values, the absence of these details prevents verification that the gains are not due to differences in preprocessing or optimization.

minor comments (2)

[Abstract] The second dataset is described only as '81 speakers which we collected'; no WER numbers, speaker counts for train/test, or comparison to GRID are supplied, so the verification claim cannot be assessed.
[§2] Notation for the two CTC variants (ch-CTC vs. w-CTC) and the bottleneck layer is introduced without an accompanying diagram or equation that would clarify the precise location of the bottleneck relative to the BLSTM.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the comments, which highlight important aspects of reproducibility. We address each major comment below.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (results): the central claim consists of the 1.3% and 8.6% WER figures together with the stated relative gains. These numbers are only interpretable if the authors' 'seen speaker test set' and 'unseen speaker test set' are exactly the same speaker partitions, frame-rate, cropping, and word-segmentation protocol used by the LCANet and LipNet papers. The manuscript supplies neither the speaker IDs, a statement of disjointness from training speakers, nor an explicit confirmation that the evaluation protocol matches the baselines.

Authors: We agree that the manuscript must explicitly document the evaluation protocol to support the claimed gains. In the revision we will add the speaker IDs used for the seen-speaker and unseen-speaker test sets, a statement confirming they are disjoint from the training speakers, and an explicit declaration that frame-rate, cropping, and word-segmentation follow the protocols of the LCANet and LipNet papers. This will make the 1.3 % and 8.6 % WER figures directly comparable under identical conditions. revision: yes
Referee: [Abstract and §3] Abstract and §3 (experimental setup): no training hyper-parameters, optimizer settings, data-augmentation details, or baseline re-implementation protocol are provided. Because the numerical support for the claimed improvements rests entirely on the reported WER values, the absence of these details prevents verification that the gains are not due to differences in preprocessing or optimization.

Authors: We acknowledge that the current manuscript omits these implementation details. The revised version will contain a dedicated experimental-setup subsection listing the optimizer, learning-rate schedule, batch size, number of epochs, data-augmentation transforms, and the exact protocol followed when comparing against the published LCANet and LipNet numbers. These additions will allow independent verification that the reported improvements are not artifacts of differing preprocessing or optimization. revision: yes

Circularity Check

0 steps flagged

No circularity detected in empirical performance reporting

full rationale

The paper is an empirical ML study reporting WER numbers on the public GRID corpus (seen/unseen speaker splits) and a self-collected 81-speaker set, with direct numerical comparisons to independently published baselines (LipNet, LCANet). No derivation chain, equations, or fitted parameters exist that reduce any claimed result to an input by construction. The reported improvements are external benchmark comparisons, not internal predictions or self-referential quantities. Self-citation is absent from the load-bearing claims.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, mathematical axioms, or newly postulated entities; typical deep-learning hyperparameters (learning rate, layer sizes, CTC blank weight) are implicitly present but undocumented.

pith-pipeline@v0.9.0 · 5854 in / 1214 out tokens · 51694 ms · 2026-05-25T16:32:15.128603+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 9 internal anchors

[1]

The visual information such as lip movements, facial expression, tongue and teeth movements helps us in understanding other person’s speech when audio is interrupted or corrupted

Introduction The process of using only visual information of lip movements to convert speech to text is called machine lipreading. The visual information such as lip movements, facial expression, tongue and teeth movements helps us in understanding other person’s speech when audio is interrupted or corrupted. Thus visual information acts as complimentary ...

work page
[2]

LipReading with 3D-2D-CNN BLSTM-HMM and word-CTC models

Related Work Two decades ago, lipreading is seen as a word classiﬁcation problem, where each input video is classiﬁed to one of the lim- ited words. The authors in [14] do word classiﬁcation using different variations of 3D CNN architectures. Word classiﬁ- cation using CNNs followed by RNNs or HMMs is presented in number of different papers [4, 15, 16]. L...

work page internal anchor Pith review Pith/arXiv arXiv 1906
[3]

3D-2D-CNN-BLSTM Network In this section we describe the architecture of proposed 3D-2D- CNN-BLSTM network

Models 3.1. 3D-2D-CNN-BLSTM Network In this section we describe the architecture of proposed 3D-2D- CNN-BLSTM network. This network contains two 3D CNN layers, two 2D CNN layers followed by two BLSTM layers as shown in the Figure 1. In 3D convolution kernel moves along time, height and width dimensions of input. Whereas in 2D con- volution kernel only mov...

work page
[4]

And from second epoch all full sentences are included

includes only segmented words. And from second epoch all full sentences are included. Curriculum learning has helped us in faster convergence, for 3D-2D-CNN-BLSTM w-CTC (de- scribed in subsection 3.3) model with curriculum learning it took 45 epochs to converge whereas without curriculum learn- ing it took 89 epochs to converge. Similarly for other exper-...

work page
[5]

Datasets 4.1.1

Experiments 4.1. Datasets 4.1.1. Grid The Grid audio-visual dataset is the widely used data for audio- visual or visual speech recognition tasks [22]. Each sentence of Grid has a ﬁxed structure with six words structured as: com- mand + color + preposition + letter + digit + adverb . For example ”set red with m six please”. The dataset has 51 unique words....

work page
[6]

For Indian English (In-En) dataset we report re- sults on unseen dataset

Results Results are reported on seen test set and on unseen test set for Grid dataset. For Indian English (In-En) dataset we report re- sults on unseen dataset. The results for our proposed approaches compared with baseline DCT BLSTM-HMM and with LIP- NET, LCANet, W AS (WLAS) are presented in Table 3. It is shown that 3D-2D-CNN-BLSTM w-CTC approach has ac...

work page
[7]

CONCLUSION We proposed new 3D-2D-CNN-BLSTM architecture, which is comparable to LCANet on Grid when ch-CTC loss is used. The proposed 3D-2D-CNN-BLSTM w-CTC has given state-of-the- art results with relative improvement of55% and 24.5% on Grid seen and unseen test sets with 1.3% WER and 8.6% WER re- spectively. We also demonstrated that 3D-2D-CNN BLSTM- HMM...

work page
[8]

Multi-task Learning Of Deep Neural Networks For Audio Visual Automatic Speech Recognition

A. Thanda and S. M. Venkatesan, “Multi-task learning of deep neural networks for audio visual automatic speech recognition,” arXiv preprint arXiv:1701.02477, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[9]

Deep Audio-Visual Speech Recognition

T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zis- serman, “Deep audio-visual speech recognition,” arXiv preprint arXiv:1809.02108, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[10]

Global SNR Estimation of Speech Signals using Entropy and Uncertainty Estimates from Dropout Networks

R. Aralikatti, D. Margam, T. Sharma, T. Abhinav, and S. M. Venkatesan, “Global snr estimation of speech signals using en- tropy and uncertainty estimates from dropout networks,” arXiv preprint arXiv:1804.04353, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[11]

A review of re- cent advances in visual speech decoding,

Z. Zhou, G. Zhao, X. Hong, and M. Pietik ¨ainen, “A review of re- cent advances in visual speech decoding,” Image and vision com- puting, vol. 32, no. 9, pp. 590–605, 2014

work page 2014
[12]

Comparison of hu- man and machine-based lip-reading

S. Hilder, R. W. Harvey, and B.-J. Theobald, “Comparison of hu- man and machine-based lip-reading.” in A VSP, 2009, pp. 86–89

work page 2009
[13]

Lip reading sentences in the wild

J. S. Chung, A. W. Senior, O. Vinyals, and A. Zisserman, “Lip reading sentences in the wild.” in CVPR, 2017, pp. 3444–3453

work page 2017
[14]

LipNet: End-to-End Sentence-level Lipreading

Y . M. Assael, B. Shillingford, S. Whiteson, and N. de Freitas, “Lipnet: End-to-end sentence-level lipreading,” arXiv preprint arXiv:1611.01599, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[15]

Lcanet: End-to-end lipreading with cascaded attention-ctc,

K. Xu, D. Li, N. Cassimatis, and X. Wang, “Lcanet: End-to-end lipreading with cascaded attention-ctc,” inAutomatic Face & Ges- ture Recognition (FG 2018), 2018 13th IEEE International Con- ference on. IEEE, 2018, pp. 548–555

work page 2018
[16]

Con- nectionist temporal classiﬁcation: labelling unsegmented se- quence data with recurrent neural networks,

A. Graves, S. Fern ´andez, F. Gomez, and J. Schmidhuber, “Con- nectionist temporal classiﬁcation: labelling unsegmented se- quence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning . ACM, 2006, pp. 369–376

work page 2006
[17]

Audio visual speech recog- nition using deep recurrent neural networks,

A. Thanda and S. M. Venkatesan, “Audio visual speech recog- nition using deep recurrent neural networks,” in IAPR Workshop on Multimodal Pattern Recognition of Social Signals in Human- Computer Interaction. Springer, 2016, pp. 98–109

work page 2016
[18]

Direct Acoustics-to-Word Models for English Conversational Speech Recognition

K. Audhkhasi, B. Ramabhadran, G. Saon, M. Picheny, and D. Na- hamoo, “Direct acoustics-to-word models for english conver- sational speech recognition,” arXiv preprint arXiv:1703.07754 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[20]

Audio-visual deep learning for noise robust speech recognition,

J. Huang and B. Kingsbury, “Audio-visual deep learning for noise robust speech recognition,” in 2013 IEEE International Confer- ence on Acoustics, Speech and Signal Processing . IEEE, 2013, pp. 7596–7599

work page 2013
[21]

Lip reading in the wild,

J. S. Chung and A. Zisserman, “Lip reading in the wild,” in Asian Conference on Computer Vision. Springer, 2016, pp. 87–103

work page 2016
[22]

Dynamic stream weighting for turbo-decoding- based audiovisual asr

S. Gergen, S. Zeiler, A. H. Abdelaziz, R. M. Nickel, and D. Kolossa, “Dynamic stream weighting for turbo-decoding- based audiovisual asr.” in INTERSPEECH, 2016, pp. 2135–2139

work page 2016
[23]

End-to-end visual speech recog- nition with lstms,

S. Petridis, Z. Li, and M. Pantic, “End-to-end visual speech recog- nition with lstms,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 2592–2596

work page 2017
[24]

Curricu- lum learning,

Y . Bengio, J. Louradour, R. Collobert, and J. Weston, “Curricu- lum learning,” in Proceedings of the 26th annual international conference on machine learning. ACM, 2009, pp. 41–48

work page 2009
[25]

Audio visual speech recognition,

C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, D. Ver- gyri, J. Sison, and A. Mashari, “Audio visual speech recognition,” IDIAP, Tech. Rep., 2000

work page 2000
[26]

Large-Scale Visual Speech Recognition

B. Shillingford, Y . Assael, M. W. Hoffman, T. Paine, C. Hughes, U. Prabhu, H. Liao, H. Sak, K. Rao, L. Bennettet al., “Large-scale visual speech recognition,” arXiv preprint arXiv:1807.05162 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[27]

Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition

H. Soltau, H. Liao, and H. Sak, “Neural speech recognizer: Acoustic-to-word lstm model for large vocabulary speech recog- nition,” arXiv preprint arXiv:1610.09975, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[28]

The inﬂuence of video sam- pling rate on lipreading performance,

A. G. Chit ¸u and L. J. Rothkrantz, “The inﬂuence of video sam- pling rate on lipreading performance,” in12-th International Con- ference on Speech and Computer (SPECOM’2007) , 2007, pp. 678–684

work page 2007
[29]

An audio- visual corpus for speech perception and automatic speech recog- nition,

M. Cooke, J. Barker, S. Cunningham, and X. Shao, “An audio- visual corpus for speech perception and automatic speech recog- nition,” The Journal of the Acoustical Society of America , vol. 120, no. 5, pp. 2421–2424, 2006

work page 2006
[30]

Yolo9000: better, faster, stronger,

J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” arXiv preprint, 2017

work page 2017
[31]

Tensorﬂow: a system for large-scale machine learning

M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isardet al., “Tensorﬂow: a system for large-scale machine learning.” in OSDI, vol. 16, 2016, pp. 265–283

work page 2016
[32]

Adam: A Method for Stochastic Optimization

D. P. Kingma and J. Ba, “Adam: A method for stochastic opti- mization,” arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[33]

The kaldi speech recognition toolkit,

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y . Qian, P. Schwarzet al., “The kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding , no. EPFL- CONF-192584. IEEE Signal Processing Society, 2011

work page 2011

[1] [1]

The visual information such as lip movements, facial expression, tongue and teeth movements helps us in understanding other person’s speech when audio is interrupted or corrupted

Introduction The process of using only visual information of lip movements to convert speech to text is called machine lipreading. The visual information such as lip movements, facial expression, tongue and teeth movements helps us in understanding other person’s speech when audio is interrupted or corrupted. Thus visual information acts as complimentary ...

work page

[2] [2]

LipReading with 3D-2D-CNN BLSTM-HMM and word-CTC models

Related Work Two decades ago, lipreading is seen as a word classiﬁcation problem, where each input video is classiﬁed to one of the lim- ited words. The authors in [14] do word classiﬁcation using different variations of 3D CNN architectures. Word classiﬁ- cation using CNNs followed by RNNs or HMMs is presented in number of different papers [4, 15, 16]. L...

work page internal anchor Pith review Pith/arXiv arXiv 1906

[3] [3]

3D-2D-CNN-BLSTM Network In this section we describe the architecture of proposed 3D-2D- CNN-BLSTM network

Models 3.1. 3D-2D-CNN-BLSTM Network In this section we describe the architecture of proposed 3D-2D- CNN-BLSTM network. This network contains two 3D CNN layers, two 2D CNN layers followed by two BLSTM layers as shown in the Figure 1. In 3D convolution kernel moves along time, height and width dimensions of input. Whereas in 2D con- volution kernel only mov...

work page

[4] [4]

And from second epoch all full sentences are included

includes only segmented words. And from second epoch all full sentences are included. Curriculum learning has helped us in faster convergence, for 3D-2D-CNN-BLSTM w-CTC (de- scribed in subsection 3.3) model with curriculum learning it took 45 epochs to converge whereas without curriculum learn- ing it took 89 epochs to converge. Similarly for other exper-...

work page

[5] [5]

Datasets 4.1.1

Experiments 4.1. Datasets 4.1.1. Grid The Grid audio-visual dataset is the widely used data for audio- visual or visual speech recognition tasks [22]. Each sentence of Grid has a ﬁxed structure with six words structured as: com- mand + color + preposition + letter + digit + adverb . For example ”set red with m six please”. The dataset has 51 unique words....

work page

[6] [6]

For Indian English (In-En) dataset we report re- sults on unseen dataset

Results Results are reported on seen test set and on unseen test set for Grid dataset. For Indian English (In-En) dataset we report re- sults on unseen dataset. The results for our proposed approaches compared with baseline DCT BLSTM-HMM and with LIP- NET, LCANet, W AS (WLAS) are presented in Table 3. It is shown that 3D-2D-CNN-BLSTM w-CTC approach has ac...

work page

[7] [7]

CONCLUSION We proposed new 3D-2D-CNN-BLSTM architecture, which is comparable to LCANet on Grid when ch-CTC loss is used. The proposed 3D-2D-CNN-BLSTM w-CTC has given state-of-the- art results with relative improvement of55% and 24.5% on Grid seen and unseen test sets with 1.3% WER and 8.6% WER re- spectively. We also demonstrated that 3D-2D-CNN BLSTM- HMM...

work page

[8] [8]

Multi-task Learning Of Deep Neural Networks For Audio Visual Automatic Speech Recognition

A. Thanda and S. M. Venkatesan, “Multi-task learning of deep neural networks for audio visual automatic speech recognition,” arXiv preprint arXiv:1701.02477, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[9] [9]

Deep Audio-Visual Speech Recognition

T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zis- serman, “Deep audio-visual speech recognition,” arXiv preprint arXiv:1809.02108, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[10] [10]

Global SNR Estimation of Speech Signals using Entropy and Uncertainty Estimates from Dropout Networks

R. Aralikatti, D. Margam, T. Sharma, T. Abhinav, and S. M. Venkatesan, “Global snr estimation of speech signals using en- tropy and uncertainty estimates from dropout networks,” arXiv preprint arXiv:1804.04353, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[11] [11]

A review of re- cent advances in visual speech decoding,

Z. Zhou, G. Zhao, X. Hong, and M. Pietik ¨ainen, “A review of re- cent advances in visual speech decoding,” Image and vision com- puting, vol. 32, no. 9, pp. 590–605, 2014

work page 2014

[12] [12]

Comparison of hu- man and machine-based lip-reading

S. Hilder, R. W. Harvey, and B.-J. Theobald, “Comparison of hu- man and machine-based lip-reading.” in A VSP, 2009, pp. 86–89

work page 2009

[13] [13]

Lip reading sentences in the wild

J. S. Chung, A. W. Senior, O. Vinyals, and A. Zisserman, “Lip reading sentences in the wild.” in CVPR, 2017, pp. 3444–3453

work page 2017

[14] [14]

LipNet: End-to-End Sentence-level Lipreading

Y . M. Assael, B. Shillingford, S. Whiteson, and N. de Freitas, “Lipnet: End-to-end sentence-level lipreading,” arXiv preprint arXiv:1611.01599, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[15] [15]

Lcanet: End-to-end lipreading with cascaded attention-ctc,

K. Xu, D. Li, N. Cassimatis, and X. Wang, “Lcanet: End-to-end lipreading with cascaded attention-ctc,” inAutomatic Face & Ges- ture Recognition (FG 2018), 2018 13th IEEE International Con- ference on. IEEE, 2018, pp. 548–555

work page 2018

[16] [16]

Con- nectionist temporal classiﬁcation: labelling unsegmented se- quence data with recurrent neural networks,

A. Graves, S. Fern ´andez, F. Gomez, and J. Schmidhuber, “Con- nectionist temporal classiﬁcation: labelling unsegmented se- quence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning . ACM, 2006, pp. 369–376

work page 2006

[17] [17]

Audio visual speech recog- nition using deep recurrent neural networks,

A. Thanda and S. M. Venkatesan, “Audio visual speech recog- nition using deep recurrent neural networks,” in IAPR Workshop on Multimodal Pattern Recognition of Social Signals in Human- Computer Interaction. Springer, 2016, pp. 98–109

work page 2016

[18] [18]

Direct Acoustics-to-Word Models for English Conversational Speech Recognition

K. Audhkhasi, B. Ramabhadran, G. Saon, M. Picheny, and D. Na- hamoo, “Direct acoustics-to-word models for english conver- sational speech recognition,” arXiv preprint arXiv:1703.07754 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[19] [20]

Audio-visual deep learning for noise robust speech recognition,

J. Huang and B. Kingsbury, “Audio-visual deep learning for noise robust speech recognition,” in 2013 IEEE International Confer- ence on Acoustics, Speech and Signal Processing . IEEE, 2013, pp. 7596–7599

work page 2013

[20] [21]

Lip reading in the wild,

J. S. Chung and A. Zisserman, “Lip reading in the wild,” in Asian Conference on Computer Vision. Springer, 2016, pp. 87–103

work page 2016

[21] [22]

Dynamic stream weighting for turbo-decoding- based audiovisual asr

S. Gergen, S. Zeiler, A. H. Abdelaziz, R. M. Nickel, and D. Kolossa, “Dynamic stream weighting for turbo-decoding- based audiovisual asr.” in INTERSPEECH, 2016, pp. 2135–2139

work page 2016

[22] [23]

End-to-end visual speech recog- nition with lstms,

S. Petridis, Z. Li, and M. Pantic, “End-to-end visual speech recog- nition with lstms,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 2592–2596

work page 2017

[23] [24]

Curricu- lum learning,

Y . Bengio, J. Louradour, R. Collobert, and J. Weston, “Curricu- lum learning,” in Proceedings of the 26th annual international conference on machine learning. ACM, 2009, pp. 41–48

work page 2009

[24] [25]

Audio visual speech recognition,

C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, D. Ver- gyri, J. Sison, and A. Mashari, “Audio visual speech recognition,” IDIAP, Tech. Rep., 2000

work page 2000

[25] [26]

Large-Scale Visual Speech Recognition

B. Shillingford, Y . Assael, M. W. Hoffman, T. Paine, C. Hughes, U. Prabhu, H. Liao, H. Sak, K. Rao, L. Bennettet al., “Large-scale visual speech recognition,” arXiv preprint arXiv:1807.05162 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[26] [27]

Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition

H. Soltau, H. Liao, and H. Sak, “Neural speech recognizer: Acoustic-to-word lstm model for large vocabulary speech recog- nition,” arXiv preprint arXiv:1610.09975, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[27] [28]

The inﬂuence of video sam- pling rate on lipreading performance,

A. G. Chit ¸u and L. J. Rothkrantz, “The inﬂuence of video sam- pling rate on lipreading performance,” in12-th International Con- ference on Speech and Computer (SPECOM’2007) , 2007, pp. 678–684

work page 2007

[28] [29]

An audio- visual corpus for speech perception and automatic speech recog- nition,

M. Cooke, J. Barker, S. Cunningham, and X. Shao, “An audio- visual corpus for speech perception and automatic speech recog- nition,” The Journal of the Acoustical Society of America , vol. 120, no. 5, pp. 2421–2424, 2006

work page 2006

[29] [30]

Yolo9000: better, faster, stronger,

J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” arXiv preprint, 2017

work page 2017

[30] [31]

Tensorﬂow: a system for large-scale machine learning

M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isardet al., “Tensorﬂow: a system for large-scale machine learning.” in OSDI, vol. 16, 2016, pp. 265–283

work page 2016

[31] [32]

Adam: A Method for Stochastic Optimization

D. P. Kingma and J. Ba, “Adam: A method for stochastic opti- mization,” arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[32] [33]

The kaldi speech recognition toolkit,

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y . Qian, P. Schwarzet al., “The kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding , no. EPFL- CONF-192584. IEEE Signal Processing Society, 2011

work page 2011