Assessing True Generalisability of Audio-Visual Speech Recognisers

Maja Pantic; Naomi Harte; Stavros Petridis; Zhaofeng Lin

arxiv: 2606.07259 · v1 · pith:BQ2R6I6Bnew · submitted 2026-06-05 · 📡 eess.AS · cs.SD

Assessing True Generalisability of Audio-Visual Speech Recognisers

Zhaofeng Lin , Stavros Petridis , Maja Pantic , Naomi Harte This is my paper

Pith reviewed 2026-06-27 21:02 UTC · model grok-4.3

classification 📡 eess.AS cs.SD

keywords audio-visual speech recognitiongeneralisabilityoverfittingLRS3 benchmarkMultiVSR datasetlexical biastest set construction

0 comments

The pith

Audio-visual speech recognition models that reach near-perfect LRS3 scores suffer sharp accuracy drops on a new test set that exactly matches LRS3 distributions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds an unseen evaluation subset from the MultiVSR dataset that is constructed to match the acoustic, visual, and demographic distributions of the LRS3 test set. Five state-of-the-art AVSR architectures are evaluated on this set and all exhibit large performance degradation. The results indicate that high scores on the standard benchmark reflect overfitting to its specific properties rather than robust generalisability. Additional analysis across seven attributes identifies lexical bias as a key driver, shows distinct error patterns, and reveals that audio-visual fusion sometimes underperforms audio-only recognition. The authors release the matched subset to enable stricter future benchmarking.

Core claim

By constructing a controlled, unseen evaluation set subsampled from MultiVSR that strictly matches the acoustic, visual, and demographic distributions of the LRS3 test set, the authors show that five state-of-the-art AVSR architectures undergo a universal performance collapse. This establishes that current systems fail to generalise even under strictly aligned conditions. Fine-grained attribute analysis across seven factors isolates the drivers of degradation, while further examination uncovers a profound lexical bias, distinct error patterns, and cases where audio-visual performance lags behind audio-only settings.

What carries the argument

The distribution-matched unseen evaluation set subsampled from MultiVSR, which isolates generalisability failure from any distribution shift.

If this is right

Standard LRS3-style benchmarks are insufficient to certify generalisability of AVSR models.
Lexical bias must be mitigated during training to reduce the observed degradation.
Current audio-visual fusion can degrade accuracy relative to audio alone under matched conditions.
The released matched test set supplies a stricter benchmark for future AVSR development.
Performance collapse occurs even when acoustic, visual, and demographic factors are controlled.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models may be relying on dataset-specific artifacts instead of learning robust multimodal speech patterns.
Similar distribution-matched subsets could be built from other large corpora to test for hidden overfitting in related tasks.
The audio-visual performance lag points to potential weaknesses in how visual features are integrated when conditions are tightly controlled.

Load-bearing premise

The subsampled evaluation set from MultiVSR strictly matches the acoustic, visual, and demographic distributions of the LRS3 test set.

What would settle it

Showing that the five architectures maintain near-LRS3 performance levels on the matched set, or that any measured drop stems from residual mismatches in the subsampling process, would falsify the universal collapse claim.

Figures

Figures reproduced from arXiv: 2606.07259 by Maja Pantic, Naomi Harte, Stavros Petridis, Zhaofeng Lin.

**Figure 1.** Figure 1: Distribution of all 7 factors on the LRS3 Test set and MV2LRS3 set. pass fully-supervised end-to-end models, self-supervised models fine-tuned on LRS3, and the latest architectures integrating speech foundation models and Large Language Models: • AV-HuBERT [6]: A self-supervised learning framework that pre-trains on unlabelled audio-visual data to learn robust representations and then fine-tunes on the L… view at source ↗

**Figure 2.** Figure 2: shows the relationship between WER on the MV2LRS3 and the LRS3 Test set. We observe that the performance degradation can be approximated by a linear function: W ERMV 2LRS3 = 10.4 × W ERLRS3 + 8.1 (1) The steep slope of 10.4 demonstrates a high degree of sensitivity; small performance variations on the LRS3 benchmark are amplified around tenfold on the MV2LRS3 set. We can compare this to established gener… view at source ↗

**Figure 3.** Figure 3: This expansion allows us to observe whether the per [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: visualises the distribution of the Vshare and Vdiff sets against the Zipfian frequency curve [30] of the LRS3 training vocabulary. Note that 883 out of 930 words in Vshare appear in LRS3 training vocabulary, and 1361 out of 1547 words in Vdiff appear in LRS3 training vocabulary. As illustrated, Vshare is dominated by highly frequent words, whereas Vdiff is heavily skewed towards the long tail. This indicat… view at source ↗

**Figure 3.** Figure 3: Distribution of all 7 factors on the LRS3 Test set, 10x set, Difficult set and Easy set. Numbers in the bracket denotes number of samples in the set. 6. Impact of the Vocabulary Vocabulary divergence is a critical factor influencing AVSR performance, and it is important to isolate its specific impact. Evaluating this is complex because the SoTA models rely on vastly different training corpora. While some… view at source ↗

read the original abstract

Current Audio-Visual Speech Recognition (AVSR) models achieve near-perfect performance on the standard LRS3 benchmark, raising concerns of adaptive overfitting. To systematically assess true generalisability, we construct a highly controlled, unseen evaluation set subsampled from the massive MultiVSR dataset. Unlike standard out-of-distribution benchmarks, our subset strictly matches the acoustic, visual, and demographic distributions of the LRS3 test set. Evaluating five state-of-the-art architectures reveals a universal performance collapse, proving that current systems fail to generalise even under strictly aligned conditions. Through a fine-grained attribute analysis across seven factors, we isolate the specific drivers of this degradation. Furthermore, we uncover a profound lexical bias, expose distinct error patterns, and surprisingly reveal that audio-visual performance even lags behind audio-only settings. We release our matched test set for future benchmarking.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The matched MultiVSR subset is the key new piece, but the abstract gives no matching procedure or numbers, so the collapse claim can't be checked yet.

read the letter

The paper's main move is pulling a subset from MultiVSR that they claim matches LRS3 on acoustics, visuals, and demographics, then showing five SOTA AVSR models drop hard on it. They also report AV underperforming audio-only and some lexical bias. Releasing the set is the concrete positive step.

What stands out as new is the construction of that strictly aligned unseen set and the fine-grained breakdown across seven attributes. If the matching holds and the numbers are real, it directly challenges how much LRS3 results can be trusted for generalization claims.

The soft spot is right where the stress-test flagged: the abstract asserts strict matching but describes none of the procedure, the feature space, tolerance, or verification stats like KL or moments. Without that, any performance drop could come from residual mismatches rather than overfitting. No quantitative results, error bars, or significance tests appear in the abstract either, which leaves the "universal collapse" and "proving" language unsupported so far.

This is for AVSR benchmark and generalization researchers who care about LRS3 limitations. A reader who wants to test their own models on a new controlled set would get value from the release, even if the current claims need more evidence.

It deserves peer review once the full methods and results are in, because the question is worth asking and the set could be useful if the matching is reproducible.

Referee Report

1 major / 1 minor

Summary. The paper claims that current AVSR models overfit to the LRS3 benchmark. To test true generalisability, the authors subsample a test set from MultiVSR that strictly matches LRS3 in acoustic, visual, and demographic distributions. Evaluation of five state-of-the-art architectures on this set shows a universal performance collapse, which they take as proof of generalisation failure even under matched conditions. Additional contributions include a seven-factor attribute analysis of degradation drivers, identification of lexical bias and distinct error patterns, the observation that AV performance can lag audio-only, and release of the matched test set.

Significance. If the distribution matching is shown to be rigorous and the performance drops are supported by quantitative results and statistical tests, the work would be significant for highlighting potential overfitting in AVSR and supplying a new controlled benchmark. The dataset release would further increase its value for the community.

major comments (1)

[§3 (Dataset Construction)] §3 (Dataset Construction): The central claim that current systems 'fail to generalise even under strictly aligned conditions' rests on the MultiVSR subsample having identical acoustic, visual, and demographic distributions to the LRS3 test set. The manuscript asserts that the subset 'strictly matches' these distributions but supplies no description of the matching procedure, the embeddings or feature space used, tolerance thresholds, or any verification (e.g., KL divergence, moment matching, or statistical tests). Without this, residual un-matched shifts remain a plausible alternative explanation for the observed collapse.

minor comments (1)

[Abstract] Abstract: The abstract states that the work 'uncover[s] a profound lexical bias' and 'expose[s] distinct error patterns' but provides no quantitative measures or examples, reducing the ability to assess these claims at a glance.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed review and for highlighting the need for greater transparency in our dataset construction. We address the single major comment below and will revise the manuscript to incorporate the requested details.

read point-by-point responses

Referee: §3 (Dataset Construction): The central claim that current systems 'fail to generalise even under strictly aligned conditions' rests on the MultiVSR subsample having identical acoustic, visual, and demographic distributions to the LRS3 test set. The manuscript asserts that the subset 'strictly matches' these distributions but supplies no description of the matching procedure, the embeddings or feature space used, tolerance thresholds, or any verification (e.g., KL divergence, moment matching, or statistical tests). Without this, residual un-matched shifts remain a plausible alternative explanation for the observed collapse.

Authors: We agree that the current manuscript lacks sufficient detail on the distribution-matching procedure, which is necessary to substantiate the claim of strict alignment. In the revised version we will expand §3 with: (i) the exact feature extractors (Wav2Vec 2.0 for audio, ArcFace for visual, and a demographic classifier for age/gender/ethnicity), (ii) the matching algorithm (iterative nearest-neighbour sampling in the joint embedding space with explicit tolerance thresholds on each modality), and (iii) quantitative verification (KL divergence, Wasserstein distance, and two-sample Kolmogorov-Smirnov tests on all marginals, plus a demographic balance table). These additions will allow readers to evaluate residual shift directly. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical benchmark study

full rationale

This paper is a purely empirical evaluation: it subsamples MultiVSR to create a test set claimed to match LRS3 distributions, then reports performance drops on five AVSR models. No derivation chain, equations, fitted parameters renamed as predictions, or self-citations that bear the central claim exist. The distribution-matching step is an assumption whose verification details are absent, but that is a methodological gap, not a reduction of any result to its own inputs by construction. The study is self-contained against external benchmarks and receives the default non-circular finding.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the distribution-matching procedure and the assumption that performance differences reflect lack of generalisability rather than dataset artifacts.

free parameters (1)

Distribution matching criteria
The specific rules or thresholds used to subsample MultiVSR to align with LRS3 distributions are chosen by the authors.

axioms (1)

domain assumption The MultiVSR dataset contains sufficient samples to allow strict matching of distributions across acoustic, visual, and demographic factors.
Invoked when constructing the subset from MultiVSR.

pith-pipeline@v0.9.1-grok · 5678 in / 1237 out tokens · 28202 ms · 2026-06-27T21:02:54.946878+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 2 canonical work pages · 2 internal anchors

[1]

Introduction For decades, Audio-Visual Speech Recognition (A VSR) re- search has sought to enhance conventional audio-only speech recognition by exploiting visual cues from the speaker’s lip movements [1, 2]. In recent years, the A VSR field has expe- rienced a rapid architectural evolution, advancing from super- vised end-to-end networks [3, 4, 5] and se...
[2]

Constructing a Matched Test Set Prior efforts in the broader computer vision field [11] and VSR
[3]

Assessing True Generalisability of Audio-Visual Speech Recognisers

provide valuable inspiration for assessing the true gener- alisation of machine learning models. However, these founda- tional studies mainly analyse distribution shifts and the result- ing performance degradation from a broad, statistical machine learning perspective. In the highly complex domain of audio- visual speech, a purely statistical approach is ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

Evaluated Models We select five state-of-the-art models representing the rapid ar- chitectural evolution of A VSR over recent years. These encom- 2https://github.com/yakhyo/uniface 0 2 4 6 0.0 0.2 0.4Density Duration (s) 20 40 60 80 0.00 0.01 0.02 0.03Density Speaker age 0 50 100 0.00 0.01 0.02 0.03Density Audio SNR (dB) 0.0 2.5 5.0 7.5 0.0 0.2 0.4Density...
[5]

Table 1 details their performance and relative rankings on both LRS3 benchmark and our MV2LRS3 set

MV2LRS3 Set Evaluation Having constructed a subset strictly matched to the LRS3 distri- bution across all seven factors, we evaluate the WER of the five chosen models. Table 1 details their performance and relative rankings on both LRS3 benchmark and our MV2LRS3 set. 4.1. Overall Performance Our evaluation reveals a severe degradation in WER across all So...
[6]

We observe that the fundamental degradation in performance persists across all models. While the absolute magnitude of this drop is slightly smaller on the 10x set, the performance ranking of the models on the 10x set is still identical to the ranking ob- served on the MV2LRS3 set. This consistent ranking across a substantially larger test set confirms th...
[7]

Leave-one-out Attribute Analysis During the construction of our controlled MV2LRS3 set, we identified 7 distinct attributes with the potential to influence A VSR performance

Isolating the Impact of Each Factor 5.1. Leave-one-out Attribute Analysis During the construction of our controlled MV2LRS3 set, we identified 7 distinct attributes with the potential to influence A VSR performance. Guided by established research concern- ing demographic and acoustic bias in the ASR field [17, 16], we hypothesise that specific factors may...
[8]

Eval- uating this is complex because the SoTA models rely on vastly different training corpora

Impact of the Vocabulary V ocabulary divergence is a critical factor influencing A VSR per- formance, and it is important to isolate its specific impact. Eval- uating this is complex because the SoTA models rely on vastly different training corpora. While some are fine-tuned exclu- sively on LRS3, others incorporate pseudo-labels from datasets like A VSpe...
[9]

Therefore, we do not treatV diff as out-of-vocabulary

or massive external pre-training datasets. Therefore, we do not treatV diff as out-of-vocabulary. However, the words inV diff are rarer than words inV share within the LRS3 training vocabu- lary, potentially making them harder for the recognition. Also, it serves as a critical measure of out-of-benchmark vocabulary: words that fall outside the highly expe...
[10]

We evaluated the models on the MV2LRS3 set using Audio-Visual (A V), Audio-Only (AO), and Video- Only (VO) inputs

Dive into the Modalities To understand how these architectures process distinct streams of information, we tested their performance across different modality settings. We evaluated the models on the MV2LRS3 set using Audio-Visual (A V), Audio-Only (AO), and Video- Only (VO) inputs. Table 6 reveals an unexpected trend: for several top-tier foundation model...
[11]

As shown in Table 7, the error profiles on the LRS3 Test set are extremely low and well-balanced across all three categories

Dive into the Errors We here dive deep into the specific error: Substitutions (Sub), Deletions (Del), and Insertions (Ins) errors. As shown in Table 7, the error profiles on the LRS3 Test set are extremely low and well-balanced across all three categories. However, the tran- sition to the MV2LRS3 set reveals completely different error behaviours among the...
[12]

Does A VSR generalise beyond LRS3? The primary focus of this paper is the generalisation capabil- ity of current A VSR systems beyond the LRS3 benchmark

Discussion 9.1. Does A VSR generalise beyond LRS3? The primary focus of this paper is the generalisation capabil- ity of current A VSR systems beyond the LRS3 benchmark. Our empirical results reveal that performance on the controlled MV2LRS3 set universally collapses across all models, exposing a massive gap compared to the LRS3 Test set. Consequently, cu...
[13]

To achieve this, we constructed the MV2LRS3 set: a controlled LRS3-like test set that strictly aligns with the demographic, acoustic and visual distributions of the LRS3 Test set

Conclusion This paper systematically assesses the true generalisability of state-of-the-art A VSR systems beyond the standard LRS3 benchmark. To achieve this, we constructed the MV2LRS3 set: a controlled LRS3-like test set that strictly aligns with the demographic, acoustic and visual distributions of the LRS3 Test set. By evaluating five leading A VSR sy...
[14]

Ethical Disclaimer Demographic metadata for this dataset was generated via auto- mated extraction tools and represents algorithmic inference, not self-reported identity. We acknowledge the limitations of these off-the-shelf models, including potential algorithmic bias, un- even accuracy across demographics, and the reduction of com- plex traits into rigid...
[15]

Generative AI Use Disclosure Generative AI tools were used to improve the paper’s grammar and clarity, as well as to assist in writing the code for the plots
[16]

18/CRT/6224

Acknowledgements This work was conducted with the financial support of the Research Ireland Centre for Research Training in Digitally- Enhanced Reality (d-real) under Grant No. 18/CRT/6224
[17]

Re- cent advances in the automatic recognition of audiovisual speech,

G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. Senior, “Re- cent advances in the automatic recognition of audiovisual speech,” Proceedings of the IEEE, vol. 91, no. 9, pp. 1306–1326, 2003

2003
[18]

Deep audio-visual speech recognition,

T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisser- man, “Deep audio-visual speech recognition,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 12, pp. 8717–8727, 2022

2022
[19]

Attention-based audio-visual fusion for robust automatic speech recognition,

G. Sterpu, C. Saam, and N. Harte, “Attention-based audio-visual fusion for robust automatic speech recognition,” inProceedings of the 20th ACM International conference on Multimodal Interac- tion, 2018, pp. 111–115

2018
[20]

End-to-end audiovisual speech recognition,

S. Petridis, T. Stafylakis, P. Ma, F. Cai, G. Tzimiropoulos, and M. Pantic, “End-to-end audiovisual speech recognition,” in2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 6548–6552

2018
[21]

Auto-avsr: Audio-visual speech recognition with automatic labels,

P. Ma, A. Haliassos, A. Fernandez-Lopez, H. Chen, S. Petridis, and M. Pantic, “Auto-avsr: Audio-visual speech recognition with automatic labels,” inICASSP 2023-2023 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

2023
[22]

Learning audio-visual speech representation by masked multimodal cluster prediction,

B. Shi, W.-N. Hsu, K. Lakhotia, and A. Mohamed, “Learning audio-visual speech representation by masked multimodal cluster prediction,” inInternational Conference on Learning Representa- tions, 2022

2022
[23]

Unified speech recognition: A single model for au- ditory, visual, and audiovisual inputs,

A. Haliassos, R. Mira, H. Chen, Z. Landgraf, S. Petridis, and M. Pantic, “Unified speech recognition: A single model for au- ditory, visual, and audiovisual inputs,”Advances in Neural Infor- mation Processing Systems, vol. 37, pp. 139 673–139 699, 2024

2024
[24]

Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation,

A. Rouditchenko, Y . Gong, S. Thomas, L. Karlinsky, H. Kuehne, R. Feris, and J. Glass, “Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation,” inInterspeech 2024, 2024, pp. 2420–2424

2024
[25]

Large language models are strong audio-visual speech recognition learners,

U. Cappellazzo, M. Kim, H. Chen, P. Ma, S. Petridis, D. Falavi- gna, A. Brutti, and M. Pantic, “Large language models are strong audio-visual speech recognition learners,” inICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5

2025
[26]

LRS3-TED: a large-scale dataset for visual speech recognition

T. Afouras, J. S. Chung, and A. Zisserman, “Lrs3-ted: a large-scale dataset for visual speech recognition,”arXiv preprint arXiv:1809.00496, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[27]

Do ImageNet classifiers generalize to ImageNet?

B. Recht, R. Roelofs, L. Schmidt, and V . Shankar, “Do ImageNet classifiers generalize to ImageNet?” inProceedings of the 36th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 97. PMLR, 09–15 Jun 2019, pp. 5389–5400. [Online]. Available: https://proceedings.mlr.press/v97/recht19a.html

2019
[28]

Do vsr models generalize beyond lrs3?

Y . A. D. Djilali, S. Narayan, E. LeBihan, H. Boussaid, E. Al- mazrouei, and M. Debbah, “Do vsr models generalize beyond lrs3?” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 6635–6644

2024
[29]

Scaling multilingual visual speech recognition,

K. R. Prajwal, S. Hegde, and A. Zisserman, “Scaling multilingual visual speech recognition,” inICASSP 2025 - 2025 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5

2025
[30]

Looking to listen at the cock- tail party: a speaker-independent audio-visual model for speech separation,

A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein, “Looking to listen at the cock- tail party: a speaker-independent audio-visual model for speech separation,”ACM Transactions on Graphics (TOG), vol. 37, no. 4, pp. 1–11, 2018

2018
[31]

WhisperX: Time- Accurate Speech Transcription of Long-Form Audio,

M. Bain, J. Huh, T. Han, and A. Zisserman, “WhisperX: Time- Accurate Speech Transcription of Long-Form Audio,” inInter- speech 2023, 2023, pp. 4489–4493

2023
[32]

To- wards inclusive automatic speech recognition,

S. Feng, B. M. Halpern, O. Kudina, and O. Scharenborg, “To- wards inclusive automatic speech recognition,”Computer Speech & Language, vol. 84, p. 101567, 2024

2024
[33]

Towards measuring fairness in speech recognition: Casual conversations dataset transcriptions,

C. Liu, M. Picheny, L. Sarı, P. Chitkara, A. Xiao, X. Zhang, M. Chou, A. Alvarado, C. Hazirbas, and Y . Saraf, “Towards measuring fairness in speech recognition: Casual conversations dataset transcriptions,” inICASSP 2022-2022 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6162–6166

2022
[34]

An overview of noise-robust automatic speech recognition,

J. Li, L. Deng, Y . Gong, and R. Haeb-Umbach, “An overview of noise-robust automatic speech recognition,”IEEE/ACM Transac- tions on Audio, Speech, and Language Processing, vol. 22, no. 4, pp. 745–777, 2014

2014
[35]

Performance evaluation of slam-asr: The good, the bad, the ugly, and the way forward,

S. Kumar, I. Thorbecke, S. Burdisso, E. Villatoro-Tello, M. KE, K. Hacio ˘glu, P. Rangappa, P. Motlicek, A. Ganapathiraju, and A. Stolcke, “Performance evaluation of slam-asr: The good, the bad, the ugly, and the way forward,” in2025 IEEE International Conference on Acoustics, Speech, and Signal Processing Work- shops (ICASSPW). IEEE, 2025, pp. 1–5

2025
[36]

Gimeno G ´omez,Contributions to Automatic Lipreading for Spanish

D. Gimeno G ´omez,Contributions to Automatic Lipreading for Spanish. Universitat Polit `ecnica de Val`encia, 2025

2025
[37]

Accuracy comparison across face recognition algorithms: Where are we on measuring race bias?

J. G. Cavazos, P. J. Phillips, C. D. Castillo, and A. J. O’Toole, “Accuracy comparison across face recognition algorithms: Where are we on measuring race bias?”IEEE transactions on biometrics, behavior, and identity science, vol. 3, no. 1, pp. 101–111, 2020

2020
[38]

Classification algorithm for skin color (casco): A new tool to measure skin color in social science research,

R. A. Rej ´on Pi ˜na and C. Ma, “Classification algorithm for skin color (casco): A new tool to measure skin color in social science research,”Social Science Quarterly, vol. 104, no. 2, pp. 168–179, 2023

2023
[39]

Monk skin tone scale,

E. Monk, “Monk skin tone scale,” 2019. [Online]. Available: https://skintone.google

2019
[40]

6d rota- tion representation for unconstrained head pose estimation,

T. Hempel, A. A. Abdelrahman, and A. Al-Hamadi, “6d rota- tion representation for unconstrained head pose estimation,” in 2022 IEEE International Conference on Image Processing (ICIP), 2022, pp. 2496–2500

2022
[41]

Lrw-1000: A naturally-distributed large- scale benchmark for lip reading in the wild,

S. Yang, Y . Zhang, D. Feng, M. Yang, C. Wang, J. Xiao, K. Long, S. Shan, and X. Chen, “Lrw-1000: A naturally-distributed large- scale benchmark for lip reading in the wild,” in2019 14th IEEE International Conference on Automatic Face & Gesture Recogni- tion (FG 2019), 2019, pp. 1–8

2019
[42]

Robust signal-to-noise ratio estima- tion based on waveform amplitude distribution analysis,

C. Kim and R. M. Stern, “Robust signal-to-noise ratio estima- tion based on waveform amplitude distribution analysis,” inInter- speech 2008, 2008, pp. 2598–2601

2008
[43]

K-nearest neighbour classifiers- a tutorial,

P. Cunningham and S. J. Delany, “K-nearest neighbour classifiers- a tutorial,”ACM computing surveys (CSUR), vol. 54, no. 6, pp. 1–25, 2021

2021
[44]

V oxCeleb2: Deep Speaker Recognition,

J. S. Chung, A. Nagrani, and A. Zisserman, “V oxCeleb2: Deep Speaker Recognition,” inInterspeech 2018, 2018, pp. 1086–1090

2018
[45]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518

2023
[46]

G. K. Zipf,The psycho-biology of language: An introduction to dynamic philology. Routledge, 2013

2013
[47]

Which words are hard to recognize? prosodic, lexical, and disfluency factors that increase speech recognition error rates,

S. Goldwater, D. Jurafsky, and C. D. Manning, “Which words are hard to recognize? prosodic, lexical, and disfluency factors that increase speech recognition error rates,”Speech Communication, vol. 52, no. 3, pp. 181–200, 2010

2010
[48]

What’s so complex about conversational speech? a comparison of hmm- based and transformer-based asr architectures,

J. Linke, B. C. Geiger, G. Kubin, and B. Schuppler, “What’s so complex about conversational speech? a comparison of hmm- based and transformer-based asr architectures,”Computer Speech & Language, vol. 90, p. 101738, 2025

2025
[49]

Uncovering the visual contribution in audio- visual speech recognition,

Z. Lin and N. Harte, “Uncovering the visual contribution in audio- visual speech recognition,” inICASSP 2025 - 2025 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5

2025
[50]

Cocktail-Party Audio-Visual Speech Recognition,

T.-B. Nguyen, N.-Q. Pham, and A. Waibel, “Cocktail-Party Audio-Visual Speech Recognition,” inInterspeech 2025, 2025, pp. 1828–1832

2025

[1] [1]

Introduction For decades, Audio-Visual Speech Recognition (A VSR) re- search has sought to enhance conventional audio-only speech recognition by exploiting visual cues from the speaker’s lip movements [1, 2]. In recent years, the A VSR field has expe- rienced a rapid architectural evolution, advancing from super- vised end-to-end networks [3, 4, 5] and se...

[2] [2]

Constructing a Matched Test Set Prior efforts in the broader computer vision field [11] and VSR

[3] [3]

Assessing True Generalisability of Audio-Visual Speech Recognisers

provide valuable inspiration for assessing the true gener- alisation of machine learning models. However, these founda- tional studies mainly analyse distribution shifts and the result- ing performance degradation from a broad, statistical machine learning perspective. In the highly complex domain of audio- visual speech, a purely statistical approach is ...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[4] [4]

Evaluated Models We select five state-of-the-art models representing the rapid ar- chitectural evolution of A VSR over recent years. These encom- 2https://github.com/yakhyo/uniface 0 2 4 6 0.0 0.2 0.4Density Duration (s) 20 40 60 80 0.00 0.01 0.02 0.03Density Speaker age 0 50 100 0.00 0.01 0.02 0.03Density Audio SNR (dB) 0.0 2.5 5.0 7.5 0.0 0.2 0.4Density...

[5] [5]

Table 1 details their performance and relative rankings on both LRS3 benchmark and our MV2LRS3 set

MV2LRS3 Set Evaluation Having constructed a subset strictly matched to the LRS3 distri- bution across all seven factors, we evaluate the WER of the five chosen models. Table 1 details their performance and relative rankings on both LRS3 benchmark and our MV2LRS3 set. 4.1. Overall Performance Our evaluation reveals a severe degradation in WER across all So...

[6] [6]

We observe that the fundamental degradation in performance persists across all models. While the absolute magnitude of this drop is slightly smaller on the 10x set, the performance ranking of the models on the 10x set is still identical to the ranking ob- served on the MV2LRS3 set. This consistent ranking across a substantially larger test set confirms th...

[7] [7]

Leave-one-out Attribute Analysis During the construction of our controlled MV2LRS3 set, we identified 7 distinct attributes with the potential to influence A VSR performance

Isolating the Impact of Each Factor 5.1. Leave-one-out Attribute Analysis During the construction of our controlled MV2LRS3 set, we identified 7 distinct attributes with the potential to influence A VSR performance. Guided by established research concern- ing demographic and acoustic bias in the ASR field [17, 16], we hypothesise that specific factors may...

[8] [8]

Eval- uating this is complex because the SoTA models rely on vastly different training corpora

Impact of the Vocabulary V ocabulary divergence is a critical factor influencing A VSR per- formance, and it is important to isolate its specific impact. Eval- uating this is complex because the SoTA models rely on vastly different training corpora. While some are fine-tuned exclu- sively on LRS3, others incorporate pseudo-labels from datasets like A VSpe...

[9] [9]

Therefore, we do not treatV diff as out-of-vocabulary

or massive external pre-training datasets. Therefore, we do not treatV diff as out-of-vocabulary. However, the words inV diff are rarer than words inV share within the LRS3 training vocabu- lary, potentially making them harder for the recognition. Also, it serves as a critical measure of out-of-benchmark vocabulary: words that fall outside the highly expe...

[10] [10]

We evaluated the models on the MV2LRS3 set using Audio-Visual (A V), Audio-Only (AO), and Video- Only (VO) inputs

Dive into the Modalities To understand how these architectures process distinct streams of information, we tested their performance across different modality settings. We evaluated the models on the MV2LRS3 set using Audio-Visual (A V), Audio-Only (AO), and Video- Only (VO) inputs. Table 6 reveals an unexpected trend: for several top-tier foundation model...

[11] [11]

As shown in Table 7, the error profiles on the LRS3 Test set are extremely low and well-balanced across all three categories

Dive into the Errors We here dive deep into the specific error: Substitutions (Sub), Deletions (Del), and Insertions (Ins) errors. As shown in Table 7, the error profiles on the LRS3 Test set are extremely low and well-balanced across all three categories. However, the tran- sition to the MV2LRS3 set reveals completely different error behaviours among the...

[12] [12]

Does A VSR generalise beyond LRS3? The primary focus of this paper is the generalisation capabil- ity of current A VSR systems beyond the LRS3 benchmark

Discussion 9.1. Does A VSR generalise beyond LRS3? The primary focus of this paper is the generalisation capabil- ity of current A VSR systems beyond the LRS3 benchmark. Our empirical results reveal that performance on the controlled MV2LRS3 set universally collapses across all models, exposing a massive gap compared to the LRS3 Test set. Consequently, cu...

[13] [13]

To achieve this, we constructed the MV2LRS3 set: a controlled LRS3-like test set that strictly aligns with the demographic, acoustic and visual distributions of the LRS3 Test set

Conclusion This paper systematically assesses the true generalisability of state-of-the-art A VSR systems beyond the standard LRS3 benchmark. To achieve this, we constructed the MV2LRS3 set: a controlled LRS3-like test set that strictly aligns with the demographic, acoustic and visual distributions of the LRS3 Test set. By evaluating five leading A VSR sy...

[14] [14]

Ethical Disclaimer Demographic metadata for this dataset was generated via auto- mated extraction tools and represents algorithmic inference, not self-reported identity. We acknowledge the limitations of these off-the-shelf models, including potential algorithmic bias, un- even accuracy across demographics, and the reduction of com- plex traits into rigid...

[15] [15]

Generative AI Use Disclosure Generative AI tools were used to improve the paper’s grammar and clarity, as well as to assist in writing the code for the plots

[16] [16]

18/CRT/6224

Acknowledgements This work was conducted with the financial support of the Research Ireland Centre for Research Training in Digitally- Enhanced Reality (d-real) under Grant No. 18/CRT/6224

[17] [17]

Re- cent advances in the automatic recognition of audiovisual speech,

G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. Senior, “Re- cent advances in the automatic recognition of audiovisual speech,” Proceedings of the IEEE, vol. 91, no. 9, pp. 1306–1326, 2003

2003

[18] [18]

Deep audio-visual speech recognition,

T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisser- man, “Deep audio-visual speech recognition,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 12, pp. 8717–8727, 2022

2022

[19] [19]

Attention-based audio-visual fusion for robust automatic speech recognition,

G. Sterpu, C. Saam, and N. Harte, “Attention-based audio-visual fusion for robust automatic speech recognition,” inProceedings of the 20th ACM International conference on Multimodal Interac- tion, 2018, pp. 111–115

2018

[20] [20]

End-to-end audiovisual speech recognition,

S. Petridis, T. Stafylakis, P. Ma, F. Cai, G. Tzimiropoulos, and M. Pantic, “End-to-end audiovisual speech recognition,” in2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 6548–6552

2018

[21] [21]

Auto-avsr: Audio-visual speech recognition with automatic labels,

P. Ma, A. Haliassos, A. Fernandez-Lopez, H. Chen, S. Petridis, and M. Pantic, “Auto-avsr: Audio-visual speech recognition with automatic labels,” inICASSP 2023-2023 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

2023

[22] [22]

Learning audio-visual speech representation by masked multimodal cluster prediction,

B. Shi, W.-N. Hsu, K. Lakhotia, and A. Mohamed, “Learning audio-visual speech representation by masked multimodal cluster prediction,” inInternational Conference on Learning Representa- tions, 2022

2022

[23] [23]

Unified speech recognition: A single model for au- ditory, visual, and audiovisual inputs,

A. Haliassos, R. Mira, H. Chen, Z. Landgraf, S. Petridis, and M. Pantic, “Unified speech recognition: A single model for au- ditory, visual, and audiovisual inputs,”Advances in Neural Infor- mation Processing Systems, vol. 37, pp. 139 673–139 699, 2024

2024

[24] [24]

Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation,

A. Rouditchenko, Y . Gong, S. Thomas, L. Karlinsky, H. Kuehne, R. Feris, and J. Glass, “Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation,” inInterspeech 2024, 2024, pp. 2420–2424

2024

[25] [25]

Large language models are strong audio-visual speech recognition learners,

U. Cappellazzo, M. Kim, H. Chen, P. Ma, S. Petridis, D. Falavi- gna, A. Brutti, and M. Pantic, “Large language models are strong audio-visual speech recognition learners,” inICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5

2025

[26] [26]

LRS3-TED: a large-scale dataset for visual speech recognition

T. Afouras, J. S. Chung, and A. Zisserman, “Lrs3-ted: a large-scale dataset for visual speech recognition,”arXiv preprint arXiv:1809.00496, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[27] [27]

Do ImageNet classifiers generalize to ImageNet?

B. Recht, R. Roelofs, L. Schmidt, and V . Shankar, “Do ImageNet classifiers generalize to ImageNet?” inProceedings of the 36th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 97. PMLR, 09–15 Jun 2019, pp. 5389–5400. [Online]. Available: https://proceedings.mlr.press/v97/recht19a.html

2019

[28] [28]

Do vsr models generalize beyond lrs3?

Y . A. D. Djilali, S. Narayan, E. LeBihan, H. Boussaid, E. Al- mazrouei, and M. Debbah, “Do vsr models generalize beyond lrs3?” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 6635–6644

2024

[29] [29]

Scaling multilingual visual speech recognition,

K. R. Prajwal, S. Hegde, and A. Zisserman, “Scaling multilingual visual speech recognition,” inICASSP 2025 - 2025 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5

2025

[30] [30]

Looking to listen at the cock- tail party: a speaker-independent audio-visual model for speech separation,

A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein, “Looking to listen at the cock- tail party: a speaker-independent audio-visual model for speech separation,”ACM Transactions on Graphics (TOG), vol. 37, no. 4, pp. 1–11, 2018

2018

[31] [31]

WhisperX: Time- Accurate Speech Transcription of Long-Form Audio,

M. Bain, J. Huh, T. Han, and A. Zisserman, “WhisperX: Time- Accurate Speech Transcription of Long-Form Audio,” inInter- speech 2023, 2023, pp. 4489–4493

2023

[32] [32]

To- wards inclusive automatic speech recognition,

S. Feng, B. M. Halpern, O. Kudina, and O. Scharenborg, “To- wards inclusive automatic speech recognition,”Computer Speech & Language, vol. 84, p. 101567, 2024

2024

[33] [33]

Towards measuring fairness in speech recognition: Casual conversations dataset transcriptions,

C. Liu, M. Picheny, L. Sarı, P. Chitkara, A. Xiao, X. Zhang, M. Chou, A. Alvarado, C. Hazirbas, and Y . Saraf, “Towards measuring fairness in speech recognition: Casual conversations dataset transcriptions,” inICASSP 2022-2022 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6162–6166

2022

[34] [34]

An overview of noise-robust automatic speech recognition,

J. Li, L. Deng, Y . Gong, and R. Haeb-Umbach, “An overview of noise-robust automatic speech recognition,”IEEE/ACM Transac- tions on Audio, Speech, and Language Processing, vol. 22, no. 4, pp. 745–777, 2014

2014

[35] [35]

Performance evaluation of slam-asr: The good, the bad, the ugly, and the way forward,

S. Kumar, I. Thorbecke, S. Burdisso, E. Villatoro-Tello, M. KE, K. Hacio ˘glu, P. Rangappa, P. Motlicek, A. Ganapathiraju, and A. Stolcke, “Performance evaluation of slam-asr: The good, the bad, the ugly, and the way forward,” in2025 IEEE International Conference on Acoustics, Speech, and Signal Processing Work- shops (ICASSPW). IEEE, 2025, pp. 1–5

2025

[36] [36]

Gimeno G ´omez,Contributions to Automatic Lipreading for Spanish

D. Gimeno G ´omez,Contributions to Automatic Lipreading for Spanish. Universitat Polit `ecnica de Val`encia, 2025

2025

[37] [37]

Accuracy comparison across face recognition algorithms: Where are we on measuring race bias?

J. G. Cavazos, P. J. Phillips, C. D. Castillo, and A. J. O’Toole, “Accuracy comparison across face recognition algorithms: Where are we on measuring race bias?”IEEE transactions on biometrics, behavior, and identity science, vol. 3, no. 1, pp. 101–111, 2020

2020

[38] [38]

Classification algorithm for skin color (casco): A new tool to measure skin color in social science research,

R. A. Rej ´on Pi ˜na and C. Ma, “Classification algorithm for skin color (casco): A new tool to measure skin color in social science research,”Social Science Quarterly, vol. 104, no. 2, pp. 168–179, 2023

2023

[39] [39]

Monk skin tone scale,

E. Monk, “Monk skin tone scale,” 2019. [Online]. Available: https://skintone.google

2019

[40] [40]

6d rota- tion representation for unconstrained head pose estimation,

T. Hempel, A. A. Abdelrahman, and A. Al-Hamadi, “6d rota- tion representation for unconstrained head pose estimation,” in 2022 IEEE International Conference on Image Processing (ICIP), 2022, pp. 2496–2500

2022

[41] [41]

Lrw-1000: A naturally-distributed large- scale benchmark for lip reading in the wild,

S. Yang, Y . Zhang, D. Feng, M. Yang, C. Wang, J. Xiao, K. Long, S. Shan, and X. Chen, “Lrw-1000: A naturally-distributed large- scale benchmark for lip reading in the wild,” in2019 14th IEEE International Conference on Automatic Face & Gesture Recogni- tion (FG 2019), 2019, pp. 1–8

2019

[42] [42]

Robust signal-to-noise ratio estima- tion based on waveform amplitude distribution analysis,

C. Kim and R. M. Stern, “Robust signal-to-noise ratio estima- tion based on waveform amplitude distribution analysis,” inInter- speech 2008, 2008, pp. 2598–2601

2008

[43] [43]

K-nearest neighbour classifiers- a tutorial,

P. Cunningham and S. J. Delany, “K-nearest neighbour classifiers- a tutorial,”ACM computing surveys (CSUR), vol. 54, no. 6, pp. 1–25, 2021

2021

[44] [44]

V oxCeleb2: Deep Speaker Recognition,

J. S. Chung, A. Nagrani, and A. Zisserman, “V oxCeleb2: Deep Speaker Recognition,” inInterspeech 2018, 2018, pp. 1086–1090

2018

[45] [45]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518

2023

[46] [46]

G. K. Zipf,The psycho-biology of language: An introduction to dynamic philology. Routledge, 2013

2013

[47] [47]

Which words are hard to recognize? prosodic, lexical, and disfluency factors that increase speech recognition error rates,

S. Goldwater, D. Jurafsky, and C. D. Manning, “Which words are hard to recognize? prosodic, lexical, and disfluency factors that increase speech recognition error rates,”Speech Communication, vol. 52, no. 3, pp. 181–200, 2010

2010

[48] [48]

What’s so complex about conversational speech? a comparison of hmm- based and transformer-based asr architectures,

J. Linke, B. C. Geiger, G. Kubin, and B. Schuppler, “What’s so complex about conversational speech? a comparison of hmm- based and transformer-based asr architectures,”Computer Speech & Language, vol. 90, p. 101738, 2025

2025

[49] [49]

Uncovering the visual contribution in audio- visual speech recognition,

Z. Lin and N. Harte, “Uncovering the visual contribution in audio- visual speech recognition,” inICASSP 2025 - 2025 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5

2025

[50] [50]

Cocktail-Party Audio-Visual Speech Recognition,

T.-B. Nguyen, N.-Q. Pham, and A. Waibel, “Cocktail-Party Audio-Visual Speech Recognition,” inInterspeech 2025, 2025, pp. 1828–1832

2025