pith. sign in

arxiv: 2606.07435 · v2 · pith:PXZKU2CDnew · submitted 2026-06-05 · 💻 cs.CV · cs.CL

The Lipreading Gap: Do VSR Models Perceive Visual Speech Like Human Lipreaders?

Pith reviewed 2026-06-27 21:56 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords visual speech recognitionlipreadingVSR modelsviseme accuracyword frequencyvisual perceptionhuman comparison
0
0 comments X

The pith

VSR models surpass humans on lipreading benchmarks but rely on language cues from training data rather than visual perception.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether visual speech recognition models that exceed human accuracy on lipreading tasks actually perceive speech in a human-like way. It compares three VSR systems against human baselines on a word-level dataset using accuracy at word, character, phoneme, and viseme levels, plus correlations with training frequency and visual informativeness. Models and humans succeed and fail on different words, and a text-only n-gram baseline given only initial phonemes matches human performance. Model errors track word frequency in training data more than visual clarity of the words, while models perform best on the visemes humans find hardest. These patterns indicate that current VSR systems exploit linguistic statistics rather than binding visual features into words.

Core claim

Although VSR models achieve higher overall accuracy than humans, they succeed and fail on different words than humans. A text-only n-gram baseline given only a few initial phonemes rivals human lipreading. VSR word-level errors are consistently better explained by training word frequency than by the visual informativeness of words. Viseme accuracies, confusion matrices and human-model correlations show that models gain most on visemes humans find hardest and show much weaker dependence on visual clarity. This demonstrates that VSR systems rely primarily on language cues from training data rather than visual perception, failing to bind visual features into meaningful words.

What carries the argument

Comparison of word, character, phoneme, and viseme-level metrics plus correlations of errors with training word frequency versus visual informativeness, between VSR models, humans, and a text-only n-gram baseline on the MaFI dataset.

Load-bearing premise

That differences in word-level success patterns, viseme accuracies, and correlations with word frequency versus visual informativeness demonstrate absence of visual perception rather than other factors such as model architecture or dataset biases.

What would settle it

A VSR model trained on data where word frequency is independent of visual informativeness would show errors tracking visual clarity instead of frequency if the claim is false.

Figures

Figures reproduced from arXiv: 2606.07435 by Naomi Harte, Rishabh Jain.

Figure 1
Figure 1. Figure 1: Viseme-level performance analysis: (A) Per-viseme accuracy comparison across models and humans, with visemes sorted by human performance. (B) Model-human accuracy differences where positive values indicate superior model performance. Viseme IDs in the figure correspond to the mapping presented in [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Viseme confusion differences between Auto-AVSR-Large and human baseline. Viseme IDs in the figure correspond to the mapping presented in [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Spearman correlations between MaFI scores and per￾formance metrics for Auto-AVSR-Large and Human Guesses on easy (high-clarity, MaF I > −1) and hard (low-clarity, MaF I ≤ −1) words. 8. Discussion and Conclusion VSR research implicitly equates improvements in transcription accuracy (such as WER) with progress in visual speech un￾derstanding. Our findings suggest this assumption is incom￾plete. Given the fir… view at source ↗
read the original abstract

Visual speech recognition (VSR) models now surpass human lipreaders on benchmarks, but do such gains establish human-like visual speech perception? To explore this, we compare three VSR systems with human baselines on the MaFI word-level lipreading dataset using word, character, phoneme, and viseme-level metrics. Although models achieve higher overall accuracy, they succeed and fail on different words than humans. A text-only n-gram baseline given only a few initial phonemes rivals human lipreading. VSR word-level errors are consistently better explained by training word frequency than by the visual informativeness of words. Viseme accuracies, confusion matrices and human-model correlations further show that models gain most on visemes humans find hardest, and show much weaker dependence on visual clarity. Our work demonstrates that VSR systems rely primarily on language cues from training data rather than visual perception, failing to bind visual features into meaningful words.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper compares three VSR models against human lipreaders on the MaFI word-level dataset using word-, character-, phoneme-, and viseme-level metrics. It reports that models exceed human accuracy overall yet succeed and fail on different words; a text-only n-gram baseline using only initial phonemes matches human performance; model errors correlate more strongly with training-word frequency than with visual informativeness; and models outperform humans most on visemes humans find hardest. The central claim is that VSR systems rely primarily on language cues from training data rather than visual perception and therefore fail to bind visual features into meaningful words.

Significance. If the empirical patterns hold after appropriate controls, the work identifies a substantive mismatch between benchmark gains and human-like visual speech perception. This would imply that current VSR progress largely exploits linguistic priors rather than learning robust visual-to-word mappings, with direct consequences for generalization, robustness to domain shift, and the design of future architectures that must demonstrably use the visual stream.

major comments (3)
  1. [Abstract / Results] Abstract and Results: the inference that models 'rely primarily on language cues ... rather than visual perception' and 'fail to bind visual features' rests on correlations with word frequency and viseme difficulty, yet the manuscript provides no ablation or controlled experiment that isolates the contribution of the visual stream (e.g., visual-only vs. audio-visual training, feature occlusion, or gradient attribution). Without such evidence the observed patterns could equally arise from architecture, optimization, or dataset statistics that happen to align with frequency.
  2. [Abstract] Abstract: the n-gram baseline demonstrates that language modeling alone can rival human lipreading given partial phoneme input, but does not test whether the actual VSR models under-utilize the visual input they receive. A direct comparison (e.g., model performance with vs. without visual features masked) is required to support the claim that visual perception is absent.
  3. [Abstract] Abstract: the reported correlations between model errors and word frequency versus visual informativeness lack statistical tests, confidence intervals, or error bars. The soundness note indicates these details are missing; their absence prevents assessment of whether the frequency correlation is reliably stronger than the visual one and therefore load-bearing for the central claim.
minor comments (2)
  1. [Methods] The manuscript should specify the exact three VSR architectures, training corpora, and hyper-parameters used, as these details are necessary to interpret whether the observed patterns are architecture-specific.
  2. [Results] Clarify how 'visual informativeness' of words is quantified and how it is distinguished from frequency in the correlation analysis.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight opportunities to strengthen the evidential basis for our central claim. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract / Results] Abstract and Results: the inference that models 'rely primarily on language cues ... rather than visual perception' and 'fail to bind visual features' rests on correlations with word frequency and viseme difficulty, yet the manuscript provides no ablation or controlled experiment that isolates the contribution of the visual stream (e.g., visual-only vs. audio-visual training, feature occlusion, or gradient attribution). Without such evidence the observed patterns could equally arise from architecture, optimization, or dataset statistics that happen to align with frequency.

    Authors: We agree that direct ablations (such as feature occlusion or gradient-based attribution) would provide stronger causal evidence isolating visual contributions from linguistic priors. The current manuscript relies on correlational patterns across multiple metrics and a language-only baseline. In revision we will add a controlled occlusion experiment on at least one VSR model and a limitations paragraph acknowledging that architecture or optimization effects cannot be fully ruled out without such controls. revision: yes

  2. Referee: [Abstract] Abstract: the n-gram baseline demonstrates that language modeling alone can rival human lipreading given partial phoneme input, but does not test whether the actual VSR models under-utilize the visual input they receive. A direct comparison (e.g., model performance with vs. without visual features masked) is required to support the claim that visual perception is absent.

    Authors: The n-gram baseline serves to show that linguistic information alone can approximate human performance levels, providing context for why model error patterns track frequency more closely than visual properties. We concur that a direct visual-masking comparison on the VSR models themselves would more conclusively demonstrate under-utilization of the visual stream. We will incorporate such a masking ablation in the revised manuscript. revision: yes

  3. Referee: [Abstract] Abstract: the reported correlations between model errors and word frequency versus visual informativeness lack statistical tests, confidence intervals, or error bars. The soundness note indicates these details are missing; their absence prevents assessment of whether the frequency correlation is reliably stronger than the visual one and therefore load-bearing for the central claim.

    Authors: We appreciate this observation. The original submission omitted formal statistical tests and uncertainty estimates for the correlations. In the revision we will compute and report Pearson/Spearman correlations with bootstrap confidence intervals and p-values comparing the strength of frequency versus visual-informativeness predictors. revision: yes

Circularity Check

0 steps flagged

No circularity: conclusions rest on independent empirical correlations

full rationale

The paper's central claim—that VSR models rely primarily on language cues rather than visual perception—is supported by direct comparisons of model vs. human performance on the MaFI dataset, including word-level error patterns, correlations of errors with training word frequency versus visual informativeness, viseme accuracies, and confusion matrices. These quantities are measured against external references (human baselines, dataset statistics, and visual clarity metrics) and do not reduce to any fitted parameter, self-definition, or self-citation chain. No equations or derivations are present that would create a self-referential loop; the analysis is observational and falsifiable by the reported metrics themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical comparisons using the MaFI dataset and standard accuracy metrics; no free parameters, new entities, or ad-hoc axioms are introduced in the abstract.

axioms (1)
  • domain assumption The MaFI word-level lipreading dataset and chosen metrics (word, character, phoneme, viseme) provide a valid basis for comparing model and human visual speech perception.
    All reported comparisons and conclusions depend on this dataset and metric choice.

pith-pipeline@v0.9.1-grok · 5688 in / 1183 out tokens · 27422 ms · 2026-06-27T21:56:56.376001+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

84 extracted references · 3 canonical work pages · 3 internal anchors

  1. [1]

    Modern transformer-based models achieve word error rates (WER) below 17% on LRS3 [1, 2], representing a substantial improvement over earlier ap- proaches

    Introduction Visual Speech Recognition (VSR) has achieved remarkable per- formance on standard benchmarks. Modern transformer-based models achieve word error rates (WER) below 17% on LRS3 [1, 2], representing a substantial improvement over earlier ap- proaches. These advances in architecture and self-supervised pretraining suggest that machines have maste...

  2. [2]

    in VSR. We compare three state-of-the-art models: Auto- A VSR [39] (supervised), A V-HuBERT [40] (self-supervised), and VSP-LLM [41] (LLM-based), against human baselines us- ing multi-level, multi-metric correlation analysis, viseme-level comparisons [27], and confusion pattern examination across vi- sual clarity levels. Our analysis reveals systematic di...

  3. [3]

    Materials and Methods 2.1. Dataset Description We conduct our analysis using the MaFI dataset [37], which provides Mouth and Facial Informativeness (MaFI) scores for 2,276 English words based on human lipreading experiments conducted with 410 participants (263 native British English speakers and 147 native North American English speakers). In these experi...

  4. [4]

    Detected face regions are cropped and resized to 96×96 pixels, at 25 FPS, which serves as the stan- dard input resolution for all VSR models

    using RetinaFace [45]. Detected face regions are cropped and resized to 96×96 pixels, at 25 FPS, which serves as the stan- dard input resolution for all VSR models. All text sequences are normalized by removing punctuation, converting to lowercase, and standardizing whitespace. 2.4. Mapping the Mouth: Phonemes and Visemes To enable analysis at both the ph...

  5. [5]

    gondola,

    Recognition Baselines: Humans vs Models We first establish baseline recognition performance across all VSR models on 2,189 words from the MaFI dataset [37]. While word-level accuracy provides an overall performance measure, it does not indicate whether models are truly using visual cues, as high accuracy could result from reliance on linguistic regu- lari...

  6. [6]

    VSR models are trained on large labeled corpora where they simultaneously learn visual- to-text mappings and language patterns

    Isolating Language Patterns from Visual Understanding While VSR models outperform humans (as seen in Section 3), it remains unclear whether this stems from visual understand- ing or learned language patterns. VSR models are trained on large labeled corpora where they simultaneously learn visual- to-text mappings and language patterns. This raises two crit...

  7. [7]

    (11.1M tokens, 49,928 unique words) and trained 2-gram and 5-gram models using KenLM [50], to learn phoneme se- quence patterns from the training data, enabling word predic- tion from partial or complete phoneme sequences. For each test word, we extract the first K phonemes from the ground- truth word (e.g., K=3 from ”absolutely” gives ”æ b s”) and re- tr...

  8. [8]

    Successful lipreading therefore relies on recognizing groups of phonemes that share common visual characteristics

    Reading the Lips: Viseme Ambiguity VSR is challenging because many phonemes appear visually similar or identical when spoken, a phenomenon known as viseme ambiguity [27]. Successful lipreading therefore relies on recognizing groups of phonemes that share common visual characteristics. To assess how VSR models handle this ambigu- ity compared to humans, we...

  9. [9]

    To further exam- ine this, we conducted correlation analyses [20] across multi- ple dimensions (Table 6) to assess VSR models’ alignment with human perceptual patterns

    Human-Machine Alignment in Visual Speech Perception Despite their overall strong performance on viseme accuracy, it remains unclear whether VSR models succeed and fail on the same words that humans find easy or difficult. To further exam- ine this, we conducted correlation analyses [20] across multi- ple dimensions (Table 6) to assess VSR models’ alignmen...

  10. [10]

    We split the dataset into high- clarity (easy) words (M aF I >−1,N= 1056, visually clear) and low-clarity (hard) words (M aF I≤ −1,N= 1133, vi- sually ambiguous)

    Performance by Visual Information Clarity To further assess how the visual saliency of words affects recog- nition performance, we group them according to their MaFI score, reflecting visual clarity. We split the dataset into high- clarity (easy) words (M aF I >−1,N= 1056, visually clear) and low-clarity (hard) words (M aF I≤ −1,N= 1133, vi- sually ambigu...

  11. [11]

    Our findings suggest this assumption is incom- plete

    Discussion and Conclusion VSR research implicitly equates improvements in transcription accuracy (such as WER) with progress in visual speech un- derstanding. Our findings suggest this assumption is incom- plete. Given the first three phonemes as input, text-only n- grams achieve 76.7% accuracy on in-domain LRS3 vocabulary but only 41% on MaFI, revealing ...

  12. [12]

    Acknowledgments This publication emanates from research supported by Taighde ´Eireann – Research Ireland, Grant number 22/FFP-A/11059

  13. [13]

    Generative AI Use Disclosure During the preparation of this work, Claude (Anthropic) was used only for minor English grammar corrections and refining the clarity of written content

  14. [14]

    Synthvsr: Scaling up visual speech recognition with synthetic supervision,

    X. Liu, E. Lakomkin, K. V ougioukas, P. Ma, H. Chen, R. Xie, M. Doulaty, N. Moritz, J. Kolar, S. Petridiset al., “Synthvsr: Scaling up visual speech recognition with synthetic supervision,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 18 806–18 815

  15. [15]

    Lip reading sentences in the wild,

    J. Son Chung, A. Senior, O. Vinyals, and A. Zisserman, “Lip reading sentences in the wild,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 6447–6456

  16. [16]

    Speech Recogni- tion Models are Strong Lip-readers,

    K. R. Prajwal, T. Afouras, and A. Zisserman, “Speech Recogni- tion Models are Strong Lip-readers,” inInterspeech 2024, 2024, pp. 2425–2429

  17. [17]

    SyncVSR: Data-Efficient Visual Speech Recognition with End- to-End Crossmodal Audio Token Synchronization,

    Y . J. Ahn, J. Park, S. Park, J. Choi, and K.-E. Kim, “SyncVSR: Data-Efficient Visual Speech Recognition with End- to-End Crossmodal Audio Token Synchronization,” inInter- speech 2024, 2024, pp. 867–871

  18. [18]

    Beyond accuracy: quantifying trial-by-trial behaviour of cnns and humans by mea- suring error consistency,

    R. Geirhos, K. Meding, and F. A. Wichmann, “Beyond accuracy: quantifying trial-by-trial behaviour of cnns and humans by mea- suring error consistency,”Advances in neural information pro- cessing systems (NeurIPS), vol. 33, pp. 13 890–13 902, 2020

  19. [19]

    Uncovering the visual contribution in audio- visual speech recognition,

    Z. Lin and N. Harte, “Uncovering the visual contribution in audio- visual speech recognition,” in2025 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

  20. [20]

    Deep problems with neural network mod- els of human vision,

    J. S. Bowers, G. Malhotra, M. Dujmovi ´c, M. L. Montero, C. Tsvetkov, V . Biscione, G. Puebla, F. Adolfi, J. E. Hummel, R. F. Heatonet al., “Deep problems with neural network mod- els of human vision,”Behavioral and Brain Sciences, vol. 46, p. e385, 2023

  21. [21]

    LRS3-TED: a large-scale dataset for visual speech recognition

    T. Afouras, J. S. Chung, and A. Zisserman, “Lrs3-ted: a large- scale dataset for visual speech recognition,” inarXiv preprint arXiv:1809.00496, 2018

  22. [22]

    Do vsr models generalize beyond lrs3?

    Y . A. D. Djilali, S. Narayan, E. LeBihan, H. Boussaid, E. Al- mazrouei, and M. Debbah, “Do vsr models generalize beyond lrs3?” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024, pp. 6635–6644

  23. [23]

    From hype to insight: Rethinking large language model integration in visual speech recognition,

    R. Jain and N. Harte, “From hype to insight: Rethinking large language model integration in visual speech recognition,” inPro- ceedings of 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2026

  24. [24]

    Lip-reading: Advances and unresolved questions in a key com- munication skill,

    M. Battista, F. Collesei, E. Orzan, M. Fantoni, and D. Bottari, “Lip-reading: Advances and unresolved questions in a key com- munication skill,”Audiology Research, vol. 15, no. 4, p. 89, 2025

  25. [25]

    Neural pathways for visual speech perception,

    L. E. Bernstein and E. Liebenthal, “Neural pathways for visual speech perception,”Frontiers in neuroscience, vol. 8, p. 386, 2014

  26. [26]

    Speech perception by humans and machines,

    R. P. Lippmann, “Speech perception by humans and machines,” in Workshop on the Auditory Basis of Speech Perception, 1996, pp. 309–316

  27. [27]

    Watch or listen: Robust audio-visual speech recognition with visual corruption modeling and reliability scoring,

    J. Hong, M. Kim, J. Choi, and Y . M. Ro, “Watch or listen: Robust audio-visual speech recognition with visual corruption modeling and reliability scoring,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 18 783–18 794

  28. [28]

    The visual speech head start improves perception and reduces superior temporal cortex re- sponses to auditory speech,

    P. J. Karas, J. F. Magnotti, B. A. Metzger, L. L. Zhu, K. B. Smith, D. Yoshor, and M. S. Beauchamp, “The visual speech head start improves perception and reduces superior temporal cortex re- sponses to auditory speech,”elife, vol. 8, p. e48116, 2019

  29. [29]

    Towards es- timating the upper bound of visual-speech recognition: The visual lip-reading feasibility database,

    A. Fernandez-Lopez, O. Martinez, and F. M. Sukno, “Towards es- timating the upper bound of visual-speech recognition: The visual lip-reading feasibility database,” in2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017). IEEE Press, 2017, p. 208–215

  30. [30]

    Comparison of human and machine-based lip-reading,

    S. Hilder, R. Harvey, and B.-J. Theobald, “Comparison of human and machine-based lip-reading,” inProceedings of the Interna- tional Conference on Auditory-Visual Speech Processing (AVSP), 2009, pp. 86–89

  31. [31]

    Large- vocabulary audio-visual speech recognition by machines and hu- mans

    G. Potamianos, C. Neti, G. Iyengar, and E. Helmuth, “Large- vocabulary audio-visual speech recognition by machines and hu- mans.” inInterspeech 2001, 2001, pp. 1027–1030

  32. [32]

    Which components of the face do humans and machines best speechread?

    C. Benoit, T. Guiard-Marigny, B. Le Goff, and A. Adjoudani, “Which components of the face do humans and machines best speechread?” inSpeechreading by humans and machines: Mod- els, systems, and applications. Springer, 1996, pp. 315–328

  33. [33]

    Evaluating auto- matic speech recognition systems in comparison with human per- ception results using distinctive feature measures,

    X. Kong, J.-Y . Choi, and S. Shattuck-Hufnagel, “Evaluating auto- matic speech recognition systems in comparison with human per- ception results using distinctive feature measures,” in2017 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP). IEEE, 2017, pp. 5810–5814

  34. [34]

    Insights into machine lip reading,

    Y . Lan, R. Harvey, and B.-J. Theobald, “Insights into machine lip reading,” in2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012, pp. 4825–4828

  35. [35]

    Automatic lips reading for audio-visual speech processing and recognition,

    J. Chaloupka, “Automatic lips reading for audio-visual speech processing and recognition,” inInterspeech 2004, 2004, pp. 2505– 2508

  36. [36]

    Viseme- dependent weight optimization for CHMM-based audio-visual speech recognition,

    A. Karpov, A. Ronzhin, K. Markov, and M. ˇZelezn´y, “Viseme- dependent weight optimization for CHMM-based audio-visual speech recognition,” inInterspeech 2010, 2010, pp. 2678–2681

  37. [37]

    Lipreading approach for isolated digits recognition under whisper and neutral speech,

    F. Tao and C. Busso, “Lipreading approach for isolated digits recognition under whisper and neutral speech,” inInterspeech 2014, 2014, pp. 1154–1158

  38. [38]

    Lipreading using convolutional neural network,

    K. Noda, Y . Yamaguchi, K. Nakadai, H. G. Okuno, and T. Ogata, “Lipreading using convolutional neural network,” inInterspeech 2014, 2014, pp. 1149–1153

  39. [39]

    Decoding visemes: Improving ma- chine lip-reading,

    H. L. Bear and R. Harvey, “Decoding visemes: Improving ma- chine lip-reading,” in2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 2009–2013

  40. [40]

    Phoneme-to-viseme mappings: the good, the bad, and the ugly,

    ——, “Phoneme-to-viseme mappings: the good, the bad, and the ugly,”Speech Communication, vol. 95, pp. 40–67, 2017

  41. [41]

    Hear- ing lips in noise: Universal viseme-phoneme mapping and trans- fer for robust audio-visual speech recognition,

    Y . Hu, R. Li, C. Chen, C. Qin, Q.-S. Zhu, and E. S. Chng, “Hear- ing lips in noise: Universal viseme-phoneme mapping and trans- fer for robust audio-visual speech recognition,” inProceedings of the 61st Annual Meeting of the ACL, Toronto, Canada, Jul. 2023, pp. 15 213–15 232

  42. [42]

    Improving the gap in visual speech recognition between normal and silent speech based on metric learning,

    S. Kashiwagi, K. Tanaka, Q. Feng, and S. Morishima, “Improving the gap in visual speech recognition between normal and silent speech based on metric learning,” inInterspeech 2023, 2023, pp. 3397–3401

  43. [43]

    Lip enhancement and multi-view sim- ulation for robust visual speech recognition in mavsr 2025,

    F. Su, C. Li, and J. Liu, “Lip enhancement and multi-view sim- ulation for robust visual speech recognition in mavsr 2025,” in 2025 IEEE 19th International Conference on Automatic Face and Gesture Recognition (FG 2025), 2025, pp. 1–5

  44. [44]

    Human alignment of neural network representa- tions,

    L. Muttenthaler, J. Dippel, L. Linhardt, R. A. Vandermeulen, and S. Kornblith, “Human alignment of neural network representa- tions,” inProceedings of the International Conference on Learn- ing Representations (ICLR), 2023

  45. [45]

    Visalign: Dataset for measuring the alignment between ai and humans in visual perception,

    J. Lee, S. Kim, S. Won, J. Lee, M. Ghassemi, J. Thorne, J. Choi, O.-K. Kwon, and E. Choi, “Visalign: Dataset for measuring the alignment between ai and humans in visual perception,”Advances in Neural Information Processing Systems (NeurIPS), vol. 36, pp. 77 119–77 148, 2023

  46. [46]

    Learning from human perception to improve automatic speaker verification in style-mismatched con- ditions,

    A. Afshan and A. Alwan, “Learning from human perception to improve automatic speaker verification in style-mismatched con- ditions,” inInterspeech 2022, 2022, pp. 2338–2342

  47. [47]

    Revisiting Parity of Human vs. Machine Conversational Speech Transcription,

    C. Mansfield, S. Ng, G.-A. Levow, R. A. Wright, and M. Os- tendorf, “Revisiting Parity of Human vs. Machine Conversational Speech Transcription,” inInterspeech 2021, 2021, pp. 1997–2001

  48. [48]

    Advocating char- acter error rate for multilingual ASR evaluation,

    T. D. K, J. James, D. P. Gopinath, and M. A. K, “Advocating char- acter error rate for multilingual ASR evaluation,” inFindings of the ACL: NAACL 2025. Albuquerque, New Mexico: Associa- tion for Computational Linguistics, April 2025, pp. 4926–4935

  49. [49]

    Av-superb: A multi-task evaluation benchmark for audio-visual representation models,

    Y . Tsenget al., “Av-superb: A multi-task evaluation benchmark for audio-visual representation models,” in2024 IEEE interna- tional conference on acoustics, speech and signal processing (ICASSP), 2024, pp. 6890–6894

  50. [50]

    Mouth and fa- cial informativeness norms for 2276 english words,

    A. Krason, Y . Zhang, H. Man, and G. Vigliocco, “Mouth and fa- cial informativeness norms for 2276 english words,”Behavior Re- search Methods, vol. 56, no. 5, pp. 4786–4801, 2024

  51. [51]

    Do self-supervised speech models de- velop human-like perception biases?

    J. Millet and E. Dunbar, “Do self-supervised speech models de- velop human-like perception biases?” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 7591–7605

  52. [52]

    Auto-avsr: Audio-visual speech recognition with automatic labels,

    P. Ma, S. Petridis, and M. Pantic, “Auto-avsr: Audio-visual speech recognition with automatic labels,” in2023 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5

  53. [53]

    Learning audio-visual speech representation by masked multimodal cluster prediction,

    B. Shi, W.-N. Hsu, K. Lakhotia, and A. Mohamed, “Learning audio-visual speech representation by masked multimodal cluster prediction,” inInternational Conference on Learning Representa- tions (ICLR), 2022

  54. [54]

    Where visual speech meets language: VSP-LLM framework for efficient and context- aware visual speech processing,

    J. H. Yeo, S. Han, M. Kim, and Y . M. Ro, “Where visual speech meets language: VSP-LLM framework for efficient and context- aware visual speech processing,” inFindings of the Association for Computational Linguistics: EMNLP 2024, Miami, Florida, USA, November 2024, pp. 11 391–11 406

  55. [55]

    Binary codes capable of correcting deletions, insertions, and reversals,

    V . I. Levenshtein, “Binary codes capable of correcting deletions, insertions, and reversals,”Soviet physics. Doklady, vol. 10, pp. 707–710, 1965

  56. [56]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvronet al., “Llama 2: Open foundation and fine-tuned chat models,” inin arXiv preprint arXiv:2307.09288, 2023

  57. [57]

    Lora: Low-rank adaptation of large language models

    E. J. Huet al., “Lora: Low-rank adaptation of large language models.”Proceedings of the International Conference on Learn- ing Representations (ICLR), 2022

  58. [58]

    Reti- naFace: Single-shot multi-level face localisation in the wild,

    J. Deng, J. Guo, E. Ververas, I. Kotsia, and S. Zafeiriou, “Reti- naFace: Single-shot multi-level face localisation in the wild,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020, pp. 5202–5211

  59. [59]

    Get facial position with viseme,

    Microsoft, “Get facial position with viseme,” Mi- crosoft Learn, accessed: 30 October 2025. [Online]. Available: https://learn.microsoft.com/en-us/azure/ai-services/ speech-service/how-to-speech-synthesis-viseme

  60. [60]

    Phonemizer: Text to phones transcrip- tion for multiple languages in python,

    M. Bernard and H. Titeux, “Phonemizer: Text to phones transcrip- tion for multiple languages in python,”Journal of Open Source Software, vol. 6, no. 68, p. 3958, 2021

  61. [61]

    Phoneme-to-viseme mapping for vi- sual speech recognition,

    L. Cappelletta and N. Harte, “Phoneme-to-viseme mapping for vi- sual speech recognition,” inInternational Conference on Pattern Recognition Applications and Methods, vol. 2. SciTePress, 2012, pp. 322–329

  62. [62]

    A normalized levenshtein distance metric,

    L. Yujian and L. Bo, “A normalized levenshtein distance metric,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 6, pp. 1091–1095, 2007

  63. [63]

    Kenlm: Faster and smaller language model queries,

    K. Heafield, “Kenlm: Faster and smaller language model queries,” inProceedings of the sixth workshop on statistical machine trans- lation, 2011, pp. 187–197

  64. [64]

    The reliability of a two-item scale: Pearson, cronbach, or spearman-brown?

    R. Eisingaet al., “The reliability of a two-item scale: Pearson, cronbach, or spearman-brown?”International journal of public health, vol. 58, no. 4, pp. 637–642, 2013

  65. [65]

    Video-bench: Human-aligned video generation benchmark,

    H. Hanet al., “Video-bench: Human-aligned video generation benchmark,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 18 858–18 868

  66. [66]

    Prediction and constraint in au- diovisual speech perception,

    J. E. Peelle and M. S. Sommers, “Prediction and constraint in au- diovisual speech perception,”Cortex, vol. 68, pp. 169–181, 2015

  67. [67]

    Large language models are strong audio- visual speech recognition learners,

    U. Cappellazzoet al., “Large language models are strong audio- visual speech recognition learners,” in2025 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

  68. [68]

    Zero-avsr: Zero-shot audio-visual speech recognition with llms by learning language-agnostic speech representations,

    J. H. Yeo, M. Kim, C. W. Kim, S. Petridis, and Y . M. Ro, “Zero-avsr: Zero-shot audio-visual speech recognition with llms by learning language-agnostic speech representations,” inPro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025, pp. 6693–6703

  69. [69]

    Cocktail-Party Audio-Visual Speech Recognition,

    T.-B. Nguyen, N.-Q. Pham, and A. Waibel, “Cocktail-Party Audio-Visual Speech Recognition,” inInterspeech 2025, 2025, pp. 1828–1832

  70. [70]

    Hearing lips and seeing voices,

    H. McGurk and J. MacDonald, “Hearing lips and seeing voices,” Nature, vol. 264, no. 5588, pp. 746–748, 1976

  71. [71]

    An image is worth 16x16 words: Trans- formers for image recognition at scale,

    A. Dosovitskiyet al., “An image is worth 16x16 words: Trans- formers for image recognition at scale,” inInternational Confer- ence on Learning Representations (ICLR), 2021

  72. [72]

    Deep saliency models : The quest for the loss function,

    A. Bruckert, H. R. Tavakoli, Z. Liu, M. Christie, and O. Le Meur, “Deep saliency models : The quest for the loss function,”Neuro- computing, vol. 453, pp. 693–704, 2021

  73. [73]

    Predict-and-update network: Audio-visual speech recognition inspired by human speech per- ception,

    J. Wang, X. Qian, and H. Li, “Predict-and-update network: Audio-visual speech recognition inspired by human speech per- ception,”IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, 2024

  74. [74]

    Human- inspired computing for robust and efficient audio-visual speech recognition,

    Q. Liu, J. Wang, Y . Wang, X. Yang, G. Pan, and H. Li, “Human- inspired computing for robust and efficient audio-visual speech recognition,”IEEE Transactions on Computers, 2025

  75. [75]

    Can we read speech beyond the lips? rethinking roi selection for deep visual speech recognition,

    Y . Zhang, S. Yang, J. Xiao, S. Shan, and X. Chen, “Can we read speech beyond the lips? rethinking roi selection for deep visual speech recognition,” in2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020). IEEE, 2020, pp. 356–363

  76. [76]

    Vivit: A video vision transformer,

    A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lu ˇci´c, and C. Schmid, “Vivit: A video vision transformer,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 6836–6846

  77. [77]

    Vallr: Visual asr language model for lip reading,

    M. Thomas, E. Fish, and R. Bowden, “Vallr: Visual asr language model for lip reading,” inProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), October 2025

  78. [78]

    Swinlip: An efficient vi- sual speech encoder for lip reading using swin transformer,

    Y .-H. Park, R.-H. Park, and H.-M. Park, “Swinlip: An efficient vi- sual speech encoder for lip reading using swin transformer,”Neu- rocomputing, p. 130289, 2025

  79. [79]

    Efficient train- ing for multilingual visual speech recognition: Pre-training with discretized visual speech representation,

    M. Kim, J. Yeo, S. J. Park, H. Rha, and Y . M. Ro, “Efficient train- ing for multilingual visual speech recognition: Pre-training with discretized visual speech representation,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 1311–1320

  80. [80]

    Phoneme-level visual speech recognition via point-visual fusion and language model reconstruction,

    M. Kit Khinn Teng, H. Zhang, and T. Saitoh, “Phoneme-level visual speech recognition via point-visual fusion and language model reconstruction,”arXiv e-prints, pp. arXiv–2507, 2025

Showing first 80 references.