Head-Pose-Aware Visual Speech Recognition with FiLM Modulation

Haibo Zhang; Matthew Kit Khinn Teng; Takeshi Saitoh

arxiv: 2606.00751 · v1 · pith:5YAD7VBZnew · submitted 2026-05-30 · 💻 cs.CV

Head-Pose-Aware Visual Speech Recognition with FiLM Modulation

Matthew Kit Khinn Teng , Haibo Zhang , Takeshi Saitoh This is my paper

Pith reviewed 2026-06-28 19:15 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual speech recognitionhead poseFiLM modulationVSR robustnessLRS2LRS3phoneme recognitionnon-frontal views

0 comments

The pith

Explicit head-pose modulation through a residual FiLM block refines visual features to improve VSR robustness on non-frontal views.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a two-stage pipeline that first uses head-pose data to condition visual feature extraction via a residual FiLM block after the 2D CNN frontend, then applies a pretrained language model to convert phonemes to text. This targets the geometric distortions and occlusions that arise when speakers are viewed from the side rather than frontally. Experiments on the LRS2 and LRS3 datasets report word error rates of 25.0 percent and 33.2 percent respectively under standard training conditions and without extra data. Ablations indicate that one residual FiLM block raises overall accuracy and that placement at layers 3 and 4 yields larger gains precisely on samples whose yaw exceeds 30 degrees. A reader would care because the method supplies a lightweight, explicit way to make lip-reading systems function outside controlled studio angles.

Core claim

The authors claim that a pose-conditioned residual Feature-wise Linear Modulation block inserted after the 2D CNN frontend can adaptively refine visual representations using head-pose information, thereby reducing the effect of pose-induced variations on phoneme recognition and allowing competitive word error rates on LRS2 and LRS3 without additional training data.

What carries the argument

pose-conditioned residual Feature-wise Linear Modulation (FiLM) block placed after the 2D CNN frontend that uses head-pose values to scale and shift visual features before they reach the language model

If this is right

A single residual FiLM block consistently lowers overall word error rate on both LRS2 and LRS3.
Modulation applied at layers 3 and 4 produces larger gains specifically for samples whose yaw angle exceeds 30 degrees.
Performance on small-pose samples remains unchanged or improves, so the method does not trade off frontal accuracy.
The added computation is limited to one lightweight conditioning block, preserving efficiency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The residual design could be grafted onto other existing VSR backbones without retraining the entire network from scratch.
If head-pose estimates contain noise, the modulation might still work if the residual path is trained to ignore small errors.
The same conditioning idea could be tested on related tasks such as visual emotion recognition where head angle also distorts features.

Load-bearing premise

The head-pose signal fed to the FiLM block must be accurate enough that the resulting modulation does not create new artifacts the downstream language model cannot correct.

What would settle it

A controlled test in which the same visual encoder is run with and without the residual FiLM block on a held-out set of large-yaw samples; if word error rate rises or stays flat for yaw greater than 30 degrees, the central claim is falsified.

read the original abstract

Visual Speech Recognition (VSR) aims to recognize speech from visual cues such as lip movements, but its performance is fundamentally limited by viseme ambiguity and pose-induced variations that introduce geometric distortions and occlusions. Existing approaches mainly rely on linguistic context or implicit invariance, leaving visual representations insufficiently robust under non-frontal views. In this work, we propose a pose-aware phoneme-level framework, termed HP-VSR-ResFiLM, that explicitly incorporates head-pose information into visual feature extraction. The proposed framework adopts a two-stage pipeline consisting of a pose-conditioned visual encoder in Stage 1 and a pretrained NLLB language model in Stage 2 for phoneme-to-text reconstruction. Specifically, Stage 1 incorporates a pose-conditioned residual Feature-wise Linear Modulation (FiLM) block after the 2D CNN frontend to adaptively refine visual representations using head-pose information. Experiments on LRS2 and LRS3 demonstrate that HP-VSR-ResFiLM achieves competitive performance under comparable training conditions, attaining word error rates (WER) of 25.0% and 33.2%, respectively, without relying on additional training data. Ablation studies further show that a single residual FiLM block consistently improves overall WER, while deeper modulation at Layers 3 and 4 provides larger gains for samples with yaw angles greater than 30{\deg} without degrading performance for smaller pose variations. These findings demonstrate that explicit pose-aware feature modulation offers an effective and computationally efficient solution for improving VSR robustness in unconstrained settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Residual FiLM conditioned on head pose gives WER gains on large-yaw LRS2/LRS3 samples but the gains rest on untested pose-estimator accuracy.

read the letter

The main takeaway is that this paper adds a single residual FiLM block after the 2D CNN frontend in a phoneme-level VSR pipeline, feeds it head-pose angles, and reports 25.0% WER on LRS2 and 33.2% on LRS3 using a frozen NLLB decoder. The ablation indicates the block helps overall and helps more when placed at layers 3 and 4 for yaw angles above 30 degrees, without hurting near-frontal cases.

What is actually new is the targeted residual FiLM application for explicit pose conditioning in this setting. The individual pieces (FiLM, CNN front-end, NLLB) are known, but the paper shows a concrete way to inject pose information that produces measurable gains on standard benchmarks under the training conditions they used.

The evaluation has clear limits. No error bars, no significance tests, and no head-to-head numbers against recent pose-robust baselines appear in the abstract. More critically, the approach assumes the supplied head-pose vector is accurate enough that the modulation improves features rather than adding new geometric artifacts the language model cannot fix. The reported results do not include pose-estimator error distributions on LRS2/LRS3, any noise-injection tests on the pose input, or checks on whether the residual connection prevents degradation when pose is noisy. Those gaps make the robustness claim harder to assess from the given numbers alone.

The paper is for people already working on visual speech recognition who want a lightweight conditioning trick for non-frontal views. A reader interested in simple feature modulation might find the layer-specific ablation useful. The work is coherent on its own terms and uses public data, so it deserves a serious referee even though additional controls on pose noise would be needed before the robustness story is solid.

I would send it for peer review.

Referee Report

2 major / 1 minor

Summary. The paper introduces HP-VSR-ResFiLM, a pose-aware phoneme-level VSR framework using a two-stage pipeline: a pose-conditioned visual encoder with residual FiLM modulation after the 2D CNN frontend in Stage 1, and a pretrained NLLB language model in Stage 2. It reports WERs of 25.0% on LRS2 and 33.2% on LRS3 under comparable conditions without extra data, and ablations indicate consistent WER improvements from the residual FiLM, with larger gains for yaw >30° when modulating at layers 3 and 4.

Significance. Should the results prove robust under statistical testing and baseline comparisons, this work would establish explicit head-pose modulation via residual FiLM as an efficient means to enhance VSR performance in unconstrained environments. The layer-specific ablation offers concrete guidance on effective integration of pose information into visual feature extractors.

major comments (2)

[Abstract] Abstract: the central claim of effective robustness improvement rests on reported WER values and ablation gains for yaw>30° that include no error bars, no statistical significance tests, and no comparisons to recent pose-robust VSR baselines.
[Ablation paragraph and Stage 1 description] Ablation paragraph and Stage 1 description: the load-bearing assumption that the head-pose signal is accurate enough and residual FiLM produces net-positive modulation (rather than uncorrectable artifacts) is not supported by any pose-estimator error distribution or noise-injection ablation.

minor comments (1)

[Abstract] Abstract: the 'computationally efficient' claim would be strengthened by reporting the parameter/FLOP overhead of the single residual FiLM block.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point-by-point below, indicating where revisions will be made to improve statistical rigor and validation of assumptions.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of effective robustness improvement rests on reported WER values and ablation gains for yaw>30° that include no error bars, no statistical significance tests, and no comparisons to recent pose-robust VSR baselines.

Authors: We agree that error bars, statistical tests, and baseline comparisons would strengthen the claims. In revision we will add standard deviations from repeated runs (where compute permits) and paired significance tests on the yaw>30° subset. We will also expand the experiments section with a comparison table against recent pose-robust VSR methods under comparable data conditions, or explicitly note when direct replication is not feasible. revision: partial
Referee: [Ablation paragraph and Stage 1 description] Ablation paragraph and Stage 1 description: the load-bearing assumption that the head-pose signal is accurate enough and residual FiLM produces net-positive modulation (rather than uncorrectable artifacts) is not supported by any pose-estimator error distribution or noise-injection ablation.

Authors: We accept this criticism. The revised manuscript will report the pose-estimator error distribution on LRS2/LRS3 and add a controlled noise-injection ablation (Gaussian noise on yaw/pitch/roll inputs) to verify that FiLM modulation remains net-positive. These results will be placed in the ablation studies section. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical architecture evaluated on public benchmarks

full rationale

The paper describes a two-stage VSR pipeline that inserts a residual FiLM block conditioned on head-pose after a 2D CNN frontend, then reports WER numbers on LRS2/LRS3. No derivation, theorem, or equation chain is present that reduces the claimed gains to a fitted parameter, self-citation, or input by construction. Ablations compare modulation depth and yaw ranges but remain standard empirical comparisons. The central result is therefore an externally falsifiable performance number on fixed public datasets rather than a self-referential identity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on standard computer-vision assumptions and a pretrained language model; no new entities are postulated and no free parameters are fitted inside the reported claim.

axioms (2)

domain assumption A 2D CNN frontend can extract useful lip features from video frames
Invoked in Stage 1 description as the base visual encoder before FiLM modulation.
domain assumption Head-pose angles can be supplied as an external conditioning signal
Assumed when the residual FiLM block receives pose information.

pith-pipeline@v0.9.1-grok · 5809 in / 1455 out tokens · 20475 ms · 2026-06-28T19:15:20.083694+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 24 canonical work pages · 3 internal anchors

[1]

The journal of the acoustical society of america26(2), 212–215 (1954)

Pollack, I.: Visual contribution to speech intelligibility in noise. The journal of the acoustical society of america26(2), 212–215 (1954)

1954
[2]

Warm, comforting recollection

Ma, P., Haliassos, A., Fernandez-Lopez, A., Chen, H., Petridis, S., Pantic, M.: Auto-AVSR: Audio-visual speech recognition with automatic labels. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2023). https://doi.org/10.1109/ICASSP49357.2023.10096889

work page doi:10.1109/icassp49357.2023.10096889 2023
[3]

Nature Machine Intelligence4(11), 930–939 (2022) https://doi.org/ 10.1038/s42256-022-00550-z

Ma, P., Petridis, S., Pantic, M.: Visual speech recognition for multiple languages in the wild. Nature Machine Intelligence4(11), 930–939 (2022) https://doi.org/ 10.1038/s42256-022-00550-z

work page doi:10.1038/s42256-022-00550-z 2022
[4]

In: INTERSPEECH, pp

Stafylakis, T., Tzimiropoulos, G.: Combining residual networks with LSTMs for lipreading. In: INTERSPEECH, pp. 3652–3656 (2017). https://doi.org/10.21437/ Interspeech.2017-85

2017
[5]

In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp

Petridis, S., Li, Z., Pantic, M.: End-to-end visual speech recognition with lstms. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2592–2596 (2017). IEEE

2017
[6]

In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp

Wand, M., Koutn´ ık, J., Schmidhuber, J.: Lipreading with long short-term mem- ory. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6115–6119 (2016). IEEE

2016
[7]

LipNet: End-to-End Sentence-level Lipreading

Assael, Y.M., Shillingford, B., Whiteson, S., Freitas, N.: LipNet: End-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599 (2016) https://doi. org/10.48550/arXiv.1611.01599 23

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1611.01599 2016
[8]

The Journal of the Acoustical Society of America120(5), 2421–2424 (2006)

Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America120(5), 2421–2424 (2006)

2006
[9]

In: 13th Asian Conference on Computer Vision (ACCV), pp

Chung, J.S., Zisserman, A.: Lip reading in the wild. In: 13th Asian Conference on Computer Vision (ACCV), pp. 87–103 (2016)

2016
[10]

In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp

Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3444–3453 (2017). https://doi.org/10.1109/CVPR.2017.367

work page doi:10.1109/cvpr.2017.367 2017
[11]

LRS3-TED: a large-scale dataset for visual speech recognition

Afouras, T., Chung, J.S., Zisserman, A.: LRS3-TED: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496 (2018) https://doi. org/10.48550/arXiv.1809.00496

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1809.00496 2018
[12]

In: INTERSPEECH (2022)

Serdyuk, D., Braga, O., Siohan, O.: Transformer-based video front-ends for audio- visual speech recognition for single and muti-person video. In: INTERSPEECH (2022). https://doi.org/10.21437/Interspeech.2022-10920

work page doi:10.21437/interspeech.2022-10920 2022
[13]

In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp

Hu, Y., Li, R., Chen, C., Qin, C., Zhu, Q.-S., Chng, E.S.: Hearing lips in noise: Universal viseme-phoneme mapping and transfer for robust audio-visual speech recognition. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 15213–15232 (2023)

2023
[14]

arXiv preprint arXiv:2305.09212 (2023)

Hu, Y., Li, R., Chen, C., Zou, H., Zhu, Q., Chng, E.S.: Cross-modal global inter- action and local alignment for audio-visual speech recognition. arXiv preprint arXiv:2305.09212 (2023)

work page arXiv 2023
[15]

17853–17862.DOI: 10

Liu, X., Lakomkin, E., Vougioukas, K., Ma, P., Chen, H., Xie, R., Doulaty, M., Moritz, N., Kolar, J., Petridis, S., Pantic, M., Fuegen, C.: SynthVSR: Scaling up visual speech recognition with synthetic supervision. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18806–18815 (2023). https://doi.org/10.1109/CVPR52729.2023.01803

work page doi:10.1109/cvpr52729.2023.01803 2023
[16]

In: Proceedings of the 2025 11th International Conference on Communication and Information Processing, pp

Liu, Z., Li, X., Chen, C., Guo, L., Li, L., Wang, D.: Alignvsr: Audio-visual cross- modal alignment for visual speech recognition. In: Proceedings of the 2025 11th International Conference on Communication and Information Processing, pp. 161–165 (2025)

2025
[17]

Lawrence Zitnick, and Devi Parikh

Zhang, X., Cheng, F., Shilin, W.: Spatio-temporal fusion based convolutional sequence learning for lip reading. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 713–722 (2019). https://doi.org/10.1109/ICCV. 2019.00080

work page doi:10.1109/iccv 2019
[18]

In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)

Djilali, Y.A.D., Narayan, S., Boussaid, H., Almazrouei, E., Debbah, M.: Lip2Vec: Efficient and robust visual speech recognition via latent-to-latent visual to audio representation mapping. In: IEEE/CVF International Conference on Computer 24 Vision (ICCV), pp. 13744–13755 (2023). https://doi.org/10.1109/ICCV51070. 2023.01268

work page doi:10.1109/iccv51070 2023
[20]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 5891–5900, https://doi.org/10

Prajwal, K.R., Afouras, T., Zisserman, A.: Sub-word level lip reading with visual attention. In: IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), pp. 5162–5172 (2022). https://doi.org/10.1109/CVPR52688.2022. 00510

work page doi:10.1109/cvpr52688.2022 2022
[21]

CAAI Transactions on Intelligence Technology8(1), 129–138 (2023) https://doi.org/10.1049/cit2.12131

El-Bialy, R., Chen, D., Fenghour, S., Hussein, W., Xiao, P., Karam, O.H., Li, B.: Developing phoneme-based lip-reading sentences system for silent speech recognition. CAAI Transactions on Intelligence Technology8(1), 129–138 (2023) https://doi.org/10.1049/cit2.12131

work page doi:10.1049/cit2.12131 2023
[22]

In: Findings of the Association for Computational Linguistics: EMNLP 2024, pp

Yeo, J., Han, S., Kim, M., Ro, Y.M.: Where visual speech meets language: Vsp-llm framework for efficient and context-aware visual speech processing. In: Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 11391–11406 (2024)

2024
[23]

In: AAAI Conference on Artificial Intelligence, vol

Yeo, J.H., Kim, C.W., Kim, H., Rha, H., Han, S., Cheng, W.-H., Ro, Y.M.: Personalized lip reading: Adapting to your unique lip movements with vision and language. In: AAAI Conference on Artificial Intelligence, vol. 39, pp. 9472–9480 (2025). https://doi.org/10.1609/aaai.v39i9.33026

work page doi:10.1609/aaai.v39i9.33026 2025
[24]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Thomas, M., Fish, E., Bowden, R.: Vallr: Visual asr language model for lip read- ing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2846–2856 (2025)

2025
[25]

In: International Conference on Artificial Neural Networks (ICANN), pp

Eickhoff, P., M¨ oller, M., Rosin, T.P., Twiefel, J., Wermter, S.: Bring the noise: Introducing noise robustness to pretrained automatic speech recognition. In: International Conference on Artificial Neural Networks (ICANN), pp. 376–388 (2023). https://doi.org/10.1007/978-3-031-44195-0 31

work page doi:10.1007/978-3-031-44195-0 2023
[26]

In: INTERSPEECH, pp

Liu, H., Chen, Z., Yang, B.: Lip graph assisted audio-visual speech recogni- tion using bidirectional synchronous fusion. In: INTERSPEECH, pp. 3520–3524 (2020). https://doi.org/10.21437/Interspeech.2020-3146

work page doi:10.21437/interspeech.2020-3146 2020
[27]

IEEE Transactions on Multimedia24, 3545–3557 (2021) https://doi.org/10.1109/TMM.2021.3102433

Sheng, C., Zhu, X., Xu, H., Pietik¨ ainen, M., Liu, L.: Adaptive semantic-spatio- temporal graph convolutional network for lip reading. IEEE Transactions on Multimedia24, 3545–3557 (2021) https://doi.org/10.1109/TMM.2021.3102433

work page doi:10.1109/tmm.2021.3102433 2021
[28]

In: 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), vol

Anina, I., Zhou, Z., Zhao, G., Pietik¨ ainen, M.: Ouluvs2: A multi-view audiovisual 25 database for non-rigid mouth motion analysis. In: 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), vol. 1, pp. 1–5 (2015). IEEE

2015
[29]

In: ICASSP 2021 - 2021 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP)

Ma, P., Petridis, S., Pantic, M.: End-to-end audio-visual speech recognition with conformers. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7613–7617 (2021). https://doi.org/10.1109/ ICASSP39728.2021.9414567

work page arXiv 2021
[30]

Future Internet13(7), 182 (2021)

Isobe, S., Tamura, S., Hayamizu, S., Gotoh, Y., Nose, M.: Multi-angle lipreading with angle classification-based feature extraction and its application to audio- visual speech recognition. Future Internet13(7), 182 (2021)

2021
[31]

In: 2021 Asia- Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp

Maeda, T., Tamura, S.: Multi-view convolution for lipreading. In: 2021 Asia- Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1092–1096 (2021). IEEE

2021
[32]

In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp

Cheng, S., Ma, P., Tzimiropoulos, G., Petridis, S., Bulat, A., Shen, J., Pantic, M.: Towards pose-invariant lip-reading. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4357–4361 (2020). IEEE

2020
[33]

In: ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp

Hao, B., Zhou, D., Li, X., Zhang, X., Xie, L., Wu, J., Yin, E.: Lipgen: Viseme- guided lip video generation for enhancing visual speech recognition. In: ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2025). IEEE

2025
[34]

arXiv preprint arXiv:2307.04552 (2023)

Fernandez-Lopez, A., Chen, H., Ma, P., Haliassos, A., Petridis, S., Pantic, M.: Sparsevsr: Lightweight and noise robust visual speech recognition. arXiv preprint arXiv:2307.04552 (2023)

work page arXiv 2023
[35]

In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

Perez, E., Strub, F., De Vries, H., Dumoulin, V., Courville, A.: Film: Visual rea- soning with a general conditioning layer. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)

2018
[36]

A Learned Representation For Artistic Style

Dumoulin, V., Shlens, J., Kudlur, M.: A learned representation for artistic style. arXiv preprint arXiv:1610.07629 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[37]

Advances in neural information processing systems30(2017)

De Vries, H., Strub, F., Mary, J., Larochelle, H., Pietquin, O., Courville, A.C.: Modulating early visual processing by language. Advances in neural information processing systems30(2017)

2017
[38]

In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2026)

Teng, M.K.K., Zhang, H., Saitoh, T.: Phoneme-level visual speech recognition via point-visual fusion and language model reconstruction. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2026). to appear 26

2026
[39]

IEEE Journal of Selected Topics in Signal Processing11(8), 1240–1253 (2017) https://doi.org/10.1109/ JSTSP.2017.2763455

Watanabe, S., Hori, T., Kim, S., Hershey, J.R., Hayashi, T.: Hybrid CTC/at- tention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal Processing11(8), 1240–1253 (2017) https://doi.org/10.1109/ JSTSP.2017.2763455

work page arXiv 2017
[40]

In: 2022 IEEE International Conference on Image Processing (ICIP), pp

Hempel, T., Abdelrahman, A.A., Al-Hamadi, A.: 6d rotation representation for unconstrained head pose estimation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 2496–2500 (2022). IEEE

2022
[41]

Efficient Learning on Successive Test Time Augmentation,

Zhang, Y., Bartley, T.M., Graterol-Fuenmayor, M., Lavrukhin, V., Bakhtu- rina, E., Ginsburg, B.: A chat about boring problems: Studying GPT-based text normalization. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 10921–10925 (2024). https://doi.org/10.1109/ ICASSP48485.2024.10447169

work page arXiv 2024
[42]

In: INTERSPEECH, pp

Ploujnikov, A., Ravanelli, M.: SoundChoice: Grapheme-to-phoneme models with semantic disambiguation. In: INTERSPEECH, pp. 486–490 (2022). https://doi. org/10.21437/Interspeech.2022-11066

work page doi:10.21437/interspeech.2022-11066 2022
[43]

V oxceleb2: Deep speaker recognition,

Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622 (2018)

work page arXiv 2018
[44]

ACM Transactions on Graphics (TOG)37(4), 1–11 (2018)

Ephrat, A., Mosseri, I., Lang, O., Dekel, T., Wilson, K., Hassidim, A., Freeman, W.T., Rubinstein, M.: Looking to listen at the cocktail party: a speaker- independent audio-visual model for speech separation. ACM Transactions on Graphics (TOG)37(4), 1–11 (2018)

2018
[45]

IEEE Transactions on Information Theory 21(3), 250–256 (1975) https://doi.org/10.1109/TIT.1975.1055384 27

Jelinek, F., Bahl, L., Mercer, R.: Design of a linguistic statistical decoder for the recognition of continuous speech. IEEE Transactions on Information Theory 21(3), 250–256 (1975) https://doi.org/10.1109/TIT.1975.1055384 27

work page doi:10.1109/tit.1975.1055384 1975

[1] [1]

The journal of the acoustical society of america26(2), 212–215 (1954)

Pollack, I.: Visual contribution to speech intelligibility in noise. The journal of the acoustical society of america26(2), 212–215 (1954)

1954

[2] [2]

Warm, comforting recollection

Ma, P., Haliassos, A., Fernandez-Lopez, A., Chen, H., Petridis, S., Pantic, M.: Auto-AVSR: Audio-visual speech recognition with automatic labels. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2023). https://doi.org/10.1109/ICASSP49357.2023.10096889

work page doi:10.1109/icassp49357.2023.10096889 2023

[3] [3]

Nature Machine Intelligence4(11), 930–939 (2022) https://doi.org/ 10.1038/s42256-022-00550-z

Ma, P., Petridis, S., Pantic, M.: Visual speech recognition for multiple languages in the wild. Nature Machine Intelligence4(11), 930–939 (2022) https://doi.org/ 10.1038/s42256-022-00550-z

work page doi:10.1038/s42256-022-00550-z 2022

[4] [4]

In: INTERSPEECH, pp

Stafylakis, T., Tzimiropoulos, G.: Combining residual networks with LSTMs for lipreading. In: INTERSPEECH, pp. 3652–3656 (2017). https://doi.org/10.21437/ Interspeech.2017-85

2017

[5] [5]

In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp

Petridis, S., Li, Z., Pantic, M.: End-to-end visual speech recognition with lstms. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2592–2596 (2017). IEEE

2017

[6] [6]

In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp

Wand, M., Koutn´ ık, J., Schmidhuber, J.: Lipreading with long short-term mem- ory. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6115–6119 (2016). IEEE

2016

[7] [7]

LipNet: End-to-End Sentence-level Lipreading

Assael, Y.M., Shillingford, B., Whiteson, S., Freitas, N.: LipNet: End-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599 (2016) https://doi. org/10.48550/arXiv.1611.01599 23

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1611.01599 2016

[8] [8]

The Journal of the Acoustical Society of America120(5), 2421–2424 (2006)

Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America120(5), 2421–2424 (2006)

2006

[9] [9]

In: 13th Asian Conference on Computer Vision (ACCV), pp

Chung, J.S., Zisserman, A.: Lip reading in the wild. In: 13th Asian Conference on Computer Vision (ACCV), pp. 87–103 (2016)

2016

[10] [10]

In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp

Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3444–3453 (2017). https://doi.org/10.1109/CVPR.2017.367

work page doi:10.1109/cvpr.2017.367 2017

[11] [11]

LRS3-TED: a large-scale dataset for visual speech recognition

Afouras, T., Chung, J.S., Zisserman, A.: LRS3-TED: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496 (2018) https://doi. org/10.48550/arXiv.1809.00496

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1809.00496 2018

[12] [12]

In: INTERSPEECH (2022)

Serdyuk, D., Braga, O., Siohan, O.: Transformer-based video front-ends for audio- visual speech recognition for single and muti-person video. In: INTERSPEECH (2022). https://doi.org/10.21437/Interspeech.2022-10920

work page doi:10.21437/interspeech.2022-10920 2022

[13] [13]

In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp

Hu, Y., Li, R., Chen, C., Qin, C., Zhu, Q.-S., Chng, E.S.: Hearing lips in noise: Universal viseme-phoneme mapping and transfer for robust audio-visual speech recognition. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 15213–15232 (2023)

2023

[14] [14]

arXiv preprint arXiv:2305.09212 (2023)

Hu, Y., Li, R., Chen, C., Zou, H., Zhu, Q., Chng, E.S.: Cross-modal global inter- action and local alignment for audio-visual speech recognition. arXiv preprint arXiv:2305.09212 (2023)

work page arXiv 2023

[15] [15]

17853–17862.DOI: 10

Liu, X., Lakomkin, E., Vougioukas, K., Ma, P., Chen, H., Xie, R., Doulaty, M., Moritz, N., Kolar, J., Petridis, S., Pantic, M., Fuegen, C.: SynthVSR: Scaling up visual speech recognition with synthetic supervision. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18806–18815 (2023). https://doi.org/10.1109/CVPR52729.2023.01803

work page doi:10.1109/cvpr52729.2023.01803 2023

[16] [16]

In: Proceedings of the 2025 11th International Conference on Communication and Information Processing, pp

Liu, Z., Li, X., Chen, C., Guo, L., Li, L., Wang, D.: Alignvsr: Audio-visual cross- modal alignment for visual speech recognition. In: Proceedings of the 2025 11th International Conference on Communication and Information Processing, pp. 161–165 (2025)

2025

[17] [17]

Lawrence Zitnick, and Devi Parikh

Zhang, X., Cheng, F., Shilin, W.: Spatio-temporal fusion based convolutional sequence learning for lip reading. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 713–722 (2019). https://doi.org/10.1109/ICCV. 2019.00080

work page doi:10.1109/iccv 2019

[18] [18]

In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)

Djilali, Y.A.D., Narayan, S., Boussaid, H., Almazrouei, E., Debbah, M.: Lip2Vec: Efficient and robust visual speech recognition via latent-to-latent visual to audio representation mapping. In: IEEE/CVF International Conference on Computer 24 Vision (ICCV), pp. 13744–13755 (2023). https://doi.org/10.1109/ICCV51070. 2023.01268

work page doi:10.1109/iccv51070 2023

[19] [20]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 5891–5900, https://doi.org/10

Prajwal, K.R., Afouras, T., Zisserman, A.: Sub-word level lip reading with visual attention. In: IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), pp. 5162–5172 (2022). https://doi.org/10.1109/CVPR52688.2022. 00510

work page doi:10.1109/cvpr52688.2022 2022

[20] [21]

CAAI Transactions on Intelligence Technology8(1), 129–138 (2023) https://doi.org/10.1049/cit2.12131

El-Bialy, R., Chen, D., Fenghour, S., Hussein, W., Xiao, P., Karam, O.H., Li, B.: Developing phoneme-based lip-reading sentences system for silent speech recognition. CAAI Transactions on Intelligence Technology8(1), 129–138 (2023) https://doi.org/10.1049/cit2.12131

work page doi:10.1049/cit2.12131 2023

[21] [22]

In: Findings of the Association for Computational Linguistics: EMNLP 2024, pp

Yeo, J., Han, S., Kim, M., Ro, Y.M.: Where visual speech meets language: Vsp-llm framework for efficient and context-aware visual speech processing. In: Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 11391–11406 (2024)

2024

[22] [23]

In: AAAI Conference on Artificial Intelligence, vol

Yeo, J.H., Kim, C.W., Kim, H., Rha, H., Han, S., Cheng, W.-H., Ro, Y.M.: Personalized lip reading: Adapting to your unique lip movements with vision and language. In: AAAI Conference on Artificial Intelligence, vol. 39, pp. 9472–9480 (2025). https://doi.org/10.1609/aaai.v39i9.33026

work page doi:10.1609/aaai.v39i9.33026 2025

[23] [24]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Thomas, M., Fish, E., Bowden, R.: Vallr: Visual asr language model for lip read- ing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2846–2856 (2025)

2025

[24] [25]

In: International Conference on Artificial Neural Networks (ICANN), pp

Eickhoff, P., M¨ oller, M., Rosin, T.P., Twiefel, J., Wermter, S.: Bring the noise: Introducing noise robustness to pretrained automatic speech recognition. In: International Conference on Artificial Neural Networks (ICANN), pp. 376–388 (2023). https://doi.org/10.1007/978-3-031-44195-0 31

work page doi:10.1007/978-3-031-44195-0 2023

[25] [26]

In: INTERSPEECH, pp

Liu, H., Chen, Z., Yang, B.: Lip graph assisted audio-visual speech recogni- tion using bidirectional synchronous fusion. In: INTERSPEECH, pp. 3520–3524 (2020). https://doi.org/10.21437/Interspeech.2020-3146

work page doi:10.21437/interspeech.2020-3146 2020

[26] [27]

IEEE Transactions on Multimedia24, 3545–3557 (2021) https://doi.org/10.1109/TMM.2021.3102433

Sheng, C., Zhu, X., Xu, H., Pietik¨ ainen, M., Liu, L.: Adaptive semantic-spatio- temporal graph convolutional network for lip reading. IEEE Transactions on Multimedia24, 3545–3557 (2021) https://doi.org/10.1109/TMM.2021.3102433

work page doi:10.1109/tmm.2021.3102433 2021

[27] [28]

In: 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), vol

Anina, I., Zhou, Z., Zhao, G., Pietik¨ ainen, M.: Ouluvs2: A multi-view audiovisual 25 database for non-rigid mouth motion analysis. In: 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), vol. 1, pp. 1–5 (2015). IEEE

2015

[28] [29]

In: ICASSP 2021 - 2021 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP)

Ma, P., Petridis, S., Pantic, M.: End-to-end audio-visual speech recognition with conformers. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7613–7617 (2021). https://doi.org/10.1109/ ICASSP39728.2021.9414567

work page arXiv 2021

[29] [30]

Future Internet13(7), 182 (2021)

Isobe, S., Tamura, S., Hayamizu, S., Gotoh, Y., Nose, M.: Multi-angle lipreading with angle classification-based feature extraction and its application to audio- visual speech recognition. Future Internet13(7), 182 (2021)

2021

[30] [31]

In: 2021 Asia- Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp

Maeda, T., Tamura, S.: Multi-view convolution for lipreading. In: 2021 Asia- Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1092–1096 (2021). IEEE

2021

[31] [32]

In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp

Cheng, S., Ma, P., Tzimiropoulos, G., Petridis, S., Bulat, A., Shen, J., Pantic, M.: Towards pose-invariant lip-reading. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4357–4361 (2020). IEEE

2020

[32] [33]

In: ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp

Hao, B., Zhou, D., Li, X., Zhang, X., Xie, L., Wu, J., Yin, E.: Lipgen: Viseme- guided lip video generation for enhancing visual speech recognition. In: ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2025). IEEE

2025

[33] [34]

arXiv preprint arXiv:2307.04552 (2023)

Fernandez-Lopez, A., Chen, H., Ma, P., Haliassos, A., Petridis, S., Pantic, M.: Sparsevsr: Lightweight and noise robust visual speech recognition. arXiv preprint arXiv:2307.04552 (2023)

work page arXiv 2023

[34] [35]

In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

Perez, E., Strub, F., De Vries, H., Dumoulin, V., Courville, A.: Film: Visual rea- soning with a general conditioning layer. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)

2018

[35] [36]

A Learned Representation For Artistic Style

Dumoulin, V., Shlens, J., Kudlur, M.: A learned representation for artistic style. arXiv preprint arXiv:1610.07629 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[36] [37]

Advances in neural information processing systems30(2017)

De Vries, H., Strub, F., Mary, J., Larochelle, H., Pietquin, O., Courville, A.C.: Modulating early visual processing by language. Advances in neural information processing systems30(2017)

2017

[37] [38]

In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2026)

Teng, M.K.K., Zhang, H., Saitoh, T.: Phoneme-level visual speech recognition via point-visual fusion and language model reconstruction. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2026). to appear 26

2026

[38] [39]

IEEE Journal of Selected Topics in Signal Processing11(8), 1240–1253 (2017) https://doi.org/10.1109/ JSTSP.2017.2763455

Watanabe, S., Hori, T., Kim, S., Hershey, J.R., Hayashi, T.: Hybrid CTC/at- tention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal Processing11(8), 1240–1253 (2017) https://doi.org/10.1109/ JSTSP.2017.2763455

work page arXiv 2017

[39] [40]

In: 2022 IEEE International Conference on Image Processing (ICIP), pp

Hempel, T., Abdelrahman, A.A., Al-Hamadi, A.: 6d rotation representation for unconstrained head pose estimation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 2496–2500 (2022). IEEE

2022

[40] [41]

Efficient Learning on Successive Test Time Augmentation,

Zhang, Y., Bartley, T.M., Graterol-Fuenmayor, M., Lavrukhin, V., Bakhtu- rina, E., Ginsburg, B.: A chat about boring problems: Studying GPT-based text normalization. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 10921–10925 (2024). https://doi.org/10.1109/ ICASSP48485.2024.10447169

work page arXiv 2024

[41] [42]

In: INTERSPEECH, pp

Ploujnikov, A., Ravanelli, M.: SoundChoice: Grapheme-to-phoneme models with semantic disambiguation. In: INTERSPEECH, pp. 486–490 (2022). https://doi. org/10.21437/Interspeech.2022-11066

work page doi:10.21437/interspeech.2022-11066 2022

[42] [43]

V oxceleb2: Deep speaker recognition,

Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622 (2018)

work page arXiv 2018

[43] [44]

ACM Transactions on Graphics (TOG)37(4), 1–11 (2018)

Ephrat, A., Mosseri, I., Lang, O., Dekel, T., Wilson, K., Hassidim, A., Freeman, W.T., Rubinstein, M.: Looking to listen at the cocktail party: a speaker- independent audio-visual model for speech separation. ACM Transactions on Graphics (TOG)37(4), 1–11 (2018)

2018

[44] [45]

IEEE Transactions on Information Theory 21(3), 250–256 (1975) https://doi.org/10.1109/TIT.1975.1055384 27

Jelinek, F., Bahl, L., Mercer, R.: Design of a linguistic statistical decoder for the recognition of continuous speech. IEEE Transactions on Information Theory 21(3), 250–256 (1975) https://doi.org/10.1109/TIT.1975.1055384 27

work page doi:10.1109/tit.1975.1055384 1975