Interpreting the Role of Visemes in Audio-Visual Speech Recognition

Aristeidis Papadopoulos; Naomi Harte

arxiv: 2509.16023 · v1 · submitted 2025-09-19 · 📡 eess.AS

Interpreting the Role of Visemes in Audio-Visual Speech Recognition

Aristeidis Papadopoulos , Naomi Harte This is my paper

Pith reviewed 2026-05-18 15:46 UTC · model grok-4.3

classification 📡 eess.AS

keywords audio-visual speech recognitionvisemesinterpretabilityt-SNEprobingAV-HuBERTfeature clusteringmultimodal representations

0 comments

The pith

Visual cues drive natural clustering of speech features in AVSR models, with audio refining representations for ambiguous visemes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how visemes are encoded inside AV-HuBERT, a leading audio-visual speech recognition model. t-SNE visualizations of the model's learned features show that these features form clusters primarily according to visual information. The addition of audio further organizes and sharpens those clusters. Linear probing experiments demonstrate that audio contributes most to disambiguating visemes that look similar on the lips or appear infrequently in training data. These observations explain part of the performance gain seen when models combine audio and video over audio alone.

Core claim

t-SNE visualizations of AV-HuBERT features reveal natural clustering driven by visual cues, which is further refined by the presence of audio. Probing shows that audio contributes to refining feature representations particularly for visemes that are visually ambiguous or under-represented.

What carries the argument

t-SNE embeddings combined with linear probing classifiers applied to AV-HuBERT multimodal features to separate visual-driven clustering from audio-driven refinement of viseme representations.

If this is right

The visual modality supplies the dominant initial structure for grouping speech units inside the model.
Audio input acts mainly as a disambiguator for visemes that share similar lip shapes or are rare in the data.
AVSR performance gains arise from this staged division of labor rather than uniform fusion of the two streams.
Targeted improvements to visual feature quality could strengthen the primary clusters before audio is added.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same visual-first pattern may appear in other multimodal speech models and could be tested by applying identical visualizations across architectures.
Training regimes that emphasize visual examples of ambiguous visemes might reduce reliance on audio and improve robustness in noisy conditions.
The findings suggest possible parallels with human speech perception, where lip reading provides coarse categories that sound resolves.
Lip-reading systems could adopt similar staged clustering to handle cases where visual input alone is insufficient.

Load-bearing premise

t-SNE visualizations and linear probes accurately expose the separate causal roles of visual and audio inputs without major distortion from the reduction method or the probe architecture.

What would settle it

If AV-HuBERT features extracted from audio-only or visual-only inputs produce t-SNE clusters and probe accuracies that match the full audio-visual case for ambiguous visemes, the claim that audio provides specific refinement would not hold.

Figures

Figures reproduced from arXiv: 2509.16023 by Aristeidis Papadopoulos, Naomi Harte.

**Figure 2.** Figure 2: t-SNE visualization of video only features from Layer 11, with visemes indicated by colour and phonemes distinguished by marker shape [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: t-SNE visualization of clean AV features from Layer 11, with visemes indicated by color and phonemes distinguished by marker shape [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Viseme Classification Accuracy on LRS3 Test set for each input [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: F1 Scores from probing for visemes ’F’ and ’ER’ [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

read the original abstract

Audio-Visual Speech Recognition (AVSR) models have surpassed their audio-only counterparts in terms of performance. However, the interpretability of AVSR systems, particularly the role of the visual modality, remains under-explored. In this paper, we apply several interpretability techniques to examine how visemes are encoded in AV-HuBERT a state-of-the-art AVSR model. First, we use t-distributed Stochastic Neighbour Embedding (t-SNE) to visualize learned features, revealing natural clustering driven by visual cues, which is further refined by the presence of audio. Then, we employ probing to show how audio contributes to refining feature representations, particularly for visemes that are visually ambiguous or under-represented. Our findings shed light on the interplay between modalities in AVSR and could point to new strategies for leveraging visual information to improve AVSR performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript applies interpretability methods to the AV-HuBERT audio-visual speech recognition model to examine viseme encoding. It reports that t-SNE visualizations of learned features exhibit natural clustering driven primarily by visual cues, with audio input providing further refinement, and that linear probing indicates audio contributes to disambiguating representations for visually ambiguous or under-represented visemes.

Significance. If the methodological concerns are addressed, the work offers useful observations on modality contributions in a leading AVSR model and grounds the analysis in the linguistically relevant unit of visemes. This could inform targeted improvements in visual leverage for AVSR. The application of standard tools (t-SNE and probing) to an existing high-performing model is a positive aspect, though the observational nature and lack of quantitative validation limit the strength of the causal interpretations.

major comments (2)

[§4] §4 (t-SNE visualizations): The central claim that visualizations reveal 'natural clustering driven by visual cues, which is further refined by the presence of audio' lacks supporting quantitative metrics such as adjusted Rand index or normalized mutual information against viseme labels, and no robustness checks (multiple perplexity values, UMAP comparison, or audio-ablated embeddings) are reported. This is load-bearing for the interpretation because t-SNE is known to produce spurious clusters sensitive to hyperparameters and initialization.
[§5] §5 (probing experiments): The conclusion that 'audio contributes to refining feature representations particularly for visemes that are visually ambiguous or under-represented' relies on linear probes without reported controls for probe architecture (e.g., comparison to non-linear probes or random baselines), modality ablations, or statistical significance testing across viseme classes. This weakens the specific attribution of refinement effects to audio.

minor comments (2)

[Abstract] The abstract states that 'several interpretability techniques' are applied but only describes t-SNE and probing in detail; clarify whether additional methods were used and their results.
[Figures] Figure captions for t-SNE plots should explicitly state the color mapping (e.g., by viseme class or modality condition) and any preprocessing steps such as feature normalization.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments correctly identify opportunities to strengthen the quantitative support for our interpretability claims. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [§4] §4 (t-SNE visualizations): The central claim that visualizations reveal 'natural clustering driven by visual cues, which is further refined by the presence of audio' lacks supporting quantitative metrics such as adjusted Rand index or normalized mutual information against viseme labels, and no robustness checks (multiple perplexity values, UMAP comparison, or audio-ablated embeddings) are reported. This is load-bearing for the interpretation because t-SNE is known to produce spurious clusters sensitive to hyperparameters and initialization.

Authors: We agree that t-SNE is a qualitative tool and that quantitative validation plus robustness checks would make the clustering claims more robust. In the revised manuscript we will add adjusted Rand index and normalized mutual information scores computed between the t-SNE-derived clusters and the ground-truth viseme labels. We will also report results across a range of perplexity values and include a side-by-side UMAP comparison. To directly demonstrate the refinement effect of audio, we will include t-SNE visualizations of the visual-only branch embeddings (i.e., audio-ablated) alongside the full audio-visual embeddings. These additions address the risk of spurious clusters while preserving the observational nature of the analysis. revision: yes
Referee: [§5] §5 (probing experiments): The conclusion that 'audio contributes to refining feature representations particularly for visemes that are visually ambiguous or under-represented' relies on linear probes without reported controls for probe architecture (e.g., comparison to non-linear probes or random baselines), modality ablations, or statistical significance testing across viseme classes. This weakens the specific attribution of refinement effects to audio.

Authors: We acknowledge that additional controls would improve the rigor of the probing results. In revision we will report probe accuracies for both linear and non-linear (single-hidden-layer MLP) architectures, together with a random-feature baseline. We will also add explicit modality ablations by including audio-only probing results and will perform statistical significance testing (bootstrap resampling across multiple random seeds) to quantify the improvement for visually ambiguous and under-represented visemes. These controls will be presented in an expanded §5 while retaining the focus on linear probes for interpretability. revision: yes

Circularity Check

0 steps flagged

No circularity: purely observational analysis of pre-trained model features

full rationale

The paper applies standard interpretability methods (t-SNE visualization and linear probing) to features extracted from the existing AV-HuBERT model. No derivation chain, first-principles predictions, fitted parameters, or self-referential equations are present. Claims rest on direct application of these techniques to model outputs without any reduction of results to the analysis inputs by construction or via self-citation load-bearing steps. The work is self-contained as empirical observation rather than a closed predictive loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis rests on the untested premise that standard interpretability tools applied post-hoc accurately reflect the model's learned modality interactions; no free parameters, new axioms, or invented entities are introduced.

axioms (1)

domain assumption t-SNE and probing classifiers can be used to infer causal contributions of input modalities to internal representations
Invoked implicitly when interpreting clustering and probe accuracy as evidence of visual-audio interplay

pith-pipeline@v0.9.0 · 5672 in / 1292 out tokens · 42740 ms · 2026-05-18T15:46:38.511019+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 2 internal anchors

[1]

Learning audio- visual speech representation by masked multimodal cluster prediction,

B. Shi, W.-N. Hsu, K. Lakhotia, and A. Mohamed, “Learning audio- visual speech representation by masked multimodal cluster prediction,” inInternational Conference on Learning Representations, 2022

work page 2022
[2]

Auto-A VSR: Audio-visual speech recognition with auto- matic labels,

P. Ma, A. Haliassos, A. Fernandez-Lopez, H. Chen, S. Petridis, and M. Pantic, “Auto-A VSR: Audio-visual speech recognition with auto- matic labels,” inICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, p. 1–5

work page 2023
[3]

Whisper-Flamingo: Integrating visual features into whisper for audio-visual speech recognition and translation,

A. Rouditchenko, Y . Gong, S. Thomas, L. Karlinsky, H. Kuehne, R. Feris, and J. Glass, “Whisper-Flamingo: Integrating visual features into whisper for audio-visual speech recognition and translation,” in Interspeech 2024, 2024, pp. 2420–2424

work page 2024
[4]

XLA VS-R: Cross-lingual audio-visual speech representation learning for noise-robust speech perception,

H. Han, M. Anwar, J. Pino, W.-N. Hsu, M. Carpuat, B. Shi, and C. Wang, “XLA VS-R: Cross-lingual audio-visual speech representation learning for noise-robust speech perception,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2024, pp. 12 896–12 911

work page 2024
[5]

VatLM: Visual-audio-text pre-training with unified masked prediction for speech representation learning,

Q. Zhu, L. Zhou, Z. Zhang, S. Liu, B. Jiao, J. Zhang, L. Dai, D. Jiang, J. Li, and F. Wei, “VatLM: Visual-audio-text pre-training with unified masked prediction for speech representation learning,”IEEE Transactions on Multimedia, vol. 26, p. 1055–1064, 2024

work page 2024
[6]

Jointly learning visual and auditory speech representations from raw data,

A. Haliassos, P. Ma, R. Mira, S. Petridis, and M. Pantic, “Jointly learning visual and auditory speech representations from raw data,” in The Eleventh International Conference on Learning Representations, 2023

work page 2023
[7]

Av-data2vec: Self- supervised learning of audio-visual speech representations with con- textualized target representations,

J. Lian, A. Baevski, W.-N. Hsu, and M. Auli, “Av-data2vec: Self- supervised learning of audio-visual speech representations with con- textualized target representations,” in2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023, pp. 1–8

work page 2023
[8]

Conformer: Convolution- augmented transformer for speech recognition,

A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wu, and R. Pang, “Conformer: Convolution- augmented transformer for speech recognition,” inInterspeech 2020, 2020, pp. 5036–5040

work page 2020
[9]

wav2vec 2.0: a framework for self-supervised learning of speech representations,

A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: a framework for self-supervised learning of speech representations,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, ser. NIPS ’20. Red Hook, NY , USA: Curran Associates Inc., 2020

work page 2020
[10]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,”IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 29, p. 3451–3460, 2021

work page 2021
[11]

Robust speech recognition via large-scale weak supervi- sion,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. Mcleavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervi- sion,” inProceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 202. PMLR, 23–29 Jul 2023, pp. 28 492–28 518

work page 2023
[12]

Understanding self-attention of self-supervised audio transformers,

S. wen Yang, A. T. Liu, and H. yi Lee, “Understanding self-attention of self-supervised audio transformers,” inInterspeech 2020, 2020, pp. 3785–3789

work page 2020
[13]

Probing Acoustic Representations for Phonetic Properties,

D. Ma, N. Ryant, and M. Liberman, “Probing Acoustic Representations for Phonetic Properties,” inICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toronto, ON, Canada: IEEE, Jun. 2021, pp. 311–315

work page 2021
[14]

Layer-Wise Analysis of a Self- Supervised Speech Representation Model,

A. Pasad, J.-C. Chou, and K. Livescu, “Layer-Wise Analysis of a Self- Supervised Speech Representation Model,” in2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). Cartagena, Colombia: IEEE, Dec. 2021, pp. 914–921

work page 2021
[15]

What do audio transformers hear? probing their representations for language delivery & structure,

Y . K. Singla, J. Shah, C. Chen, and R. R. Shah, “What do audio transformers hear? probing their representations for language delivery & structure,” in2022 IEEE International Conference on Data Mining Workshops (ICDMW), 2022, pp. 910–925

work page 2022
[16]

Domain- Informed Probing of wav2vec 2.0 Embeddings for Phonetic Features,

P. Cormac English, J. D. Kelleher, and J. Carson-Berndsen, “Domain- Informed Probing of wav2vec 2.0 Embeddings for Phonetic Features,” inProceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology. Association for Computational Linguistics, 2022, pp. 83–91

work page 2022
[17]

Probing phoneme, language and speaker information in unsupervised speech representations,

M. de Seyssel, M. Lavechin, Y . Adi, E. Dupoux, and G. Wisniewski, “Probing phoneme, language and speaker information in unsupervised speech representations,” inInterspeech 2022, 2022, pp. 1402–1406

work page 2022
[18]

Understanding the role of self attention for efficient speech recognition,

K. Shim, J. Choi, and W. Sung, “Understanding the role of self attention for efficient speech recognition,” inInternational Conference on Learning Representations, 2022

work page 2022
[19]

Discovering phonetic feature event patterns in transformer embeddings,

P. C. English, J. D. Kelleher, and J. Carson-Berndsen, “Discovering phonetic feature event patterns in transformer embeddings,” inINTER- SPEECH 2023, 2023, p. 4733–4737

work page 2023
[20]

Comparative Layer-Wise Analysis of Self-Supervised Speech Models,

A. Pasad, B. Shi, and K. Livescu, “Comparative Layer-Wise Analysis of Self-Supervised Speech Models,” inICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5

work page 2023
[21]

Self-supervised speech representations are more phonetic than semantic,

K. Choi, A. Pasad, T. Nakamura, S. Fukayama, K. Livescu, and S. Watanabe, “Self-supervised speech representations are more phonetic than semantic,” inInterspeech 2024, 2024, pp. 4578–4582

work page 2024
[22]

Probing speech quality information in asr systems,

B. T. Ta, M. T. Le, N. M. Le, and V . H. Do, “Probing speech quality information in asr systems,” inInterspeech 2023, 2023, pp. 541–545

work page 2023
[23]

Silence is sweeter than speech: Self-supervised model using silence to store speaker information,

C.-L. Feng, P. chun Hsu, and H. yi Lee, “Silence is sweeter than speech: Self-supervised model using silence to store speaker information,” 2022

work page 2022
[24]

What do self- supervised speech models know about words?

A. Pasad, C.-M. Chien, S. Settle, and K. Livescu, “What do self- supervised speech models know about words?”Transactions of the Association for Computational Linguistics, vol. 12, pp. 372–391, 2024

work page 2024
[25]

Understanding intermediate layers using linear classifier probes,

G. Alain and Y . Bengio, “Understanding intermediate layers using linear classifier probes,” inInternational Conference on Learning Representa- tions, 2017

work page 2017
[26]

Searching for structure: Appraising the organisation of speech features in wav2vec 2.0 embeddings,

P. C. English, J. D. Kelleher, and J. Carson-Berndsen, “Searching for structure: Appraising the organisation of speech features in wav2vec 2.0 embeddings,” inInterspeech 2024. ISCA, Sep. 2024, p. 4613–4617

work page 2024
[27]

Visualizing data using t-sne,

L. van der Maaten and G. Hinton, “Visualizing data using t-sne,”Journal of Machine Learning Research, vol. 9, no. 86, pp. 2579–2605, 2008

work page 2008
[28]

Confusions among visually perceived consonants,

C. G. Fisher, “Confusions among visually perceived consonants,”Jour- nal of Speech and Hearing Research, vol. 11, no. 4, pp. 796–804, 1968

work page 1968
[29]

Jeffers and M

J. Jeffers and M. Barley,Speechreading (lipreading). Thomas, 1971

work page 1971
[30]

Audio-to-visual conversion using hidden markov models,

S. Lee and D. Yook, “Audio-to-visual conversion using hidden markov models,” inPRICAI 2002: Trends in Artificial Intelligence. Berlin, Heidelberg: Springer, 2002, pp. 563–570

work page 2002
[31]

Decoding visemes: Improving machine lip- reading,

H. L. Bear and R. Harvey, “Decoding visemes: Improving machine lip- reading,” in2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 2009–2013

work page 2016
[32]

Comparison of phoneme and viseme based acoustic units for speech driven realistic lip animation,

E. Bozkurt, C. Erdem, E. Erzin, T. Erdem, and M. ¨Ozkan, “Comparison of phoneme and viseme based acoustic units for speech driven realistic lip animation,” in2007 IEEE 15th Signal Processing and Communica- tions Applications, 07 2007, pp. 1 – 4

work page 2007
[33]

Visual model structures and synchrony constraints for audio- visual speech recognition,

T. Hazen, “Visual model structures and synchrony constraints for audio- visual speech recognition,”IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 3, pp. 1082–1089, 2006

work page 2006
[34]

Phoneme-to-viseme mapping for visual speech recognition,

L. Cappelletta and N. Harte, “Phoneme-to-viseme mapping for visual speech recognition,” inInternational Conference on Pattern Recognition Applications and Methods, 2012

work page 2012
[35]

S. J. Oh, B. Schiele, and M. Fritz,Towards Reverse-Engineering Black- Box Neural Networks. Springer International Publishing, 2019, pp. 121–144

work page 2019
[36]

Combining residual networks with lstms for lipreading,

T. Stafylakis and G. Tzimiropoulos, “Combining residual networks with lstms for lipreading,” inInterspeech 2017, 2017, pp. 3652–3656

work page 2017
[37]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inProceedings of the 31st International Conference on Neural Information Processing Systems, ser. NIPS’17, 2017, p. 6000–6010

work page 2017
[38]

LRS3-TED: a large-scale dataset for visual speech recognition

T. Afouras, J. S. Chung, and A. Zisserman, “LRS3-TED: a large-scale dataset for visual speech recognition,”CoRR, vol. abs/1809.00496, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[39]

Montreal forced aligner: Trainable text-speech alignment using kaldi,

M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, “Montreal forced aligner: Trainable text-speech alignment using kaldi,” inInterspeech 2017, 2017, pp. 498–502

work page 2017
[40]

Prosodylab-aligner: A tool for forced alignment of laboratory speech,

K. Gorman, J. Howell, and M. Wagner, “Prosodylab-aligner: A tool for forced alignment of laboratory speech,”Canadian Acoustics, vol. 39, no. 3, pp. 192–193, 2011

work page 2011
[41]

MUSAN: A Music, Speech, and Noise Corpus

D. Snyder, G. Chen, and D. Povey, “MUSAN: A music, speech, and noise corpus,”CoRR, vol. abs/1510.08484, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[42]

Robust self-supervised audio- visual speech recognition,

B. Shi, W.-N. Hsu, and A. Mohamed, “Robust self-supervised audio- visual speech recognition,” inInterspeech 2022, 2022, pp. 2118–2122

work page 2022
[43]

Learning lip-based audio-visual speaker embeddings with av-hubert,

B. Shi, A. Mohamed, and W.-N. Hsu, “Learning lip-based audio-visual speaker embeddings with av-hubert,” inInterspeech 2022, 2022, pp. 4785–4789

work page 2022
[44]

Uncovering the visual contribution in audio- visual speech recognition,

Z. Lin and N. Harte, “Uncovering the visual contribution in audio- visual speech recognition,” inICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5

work page 2025
[45]

How to use t-SNE effectively,

M. Wattenberg, F. Vi ´egas, and I. Johnson, “How to use t-SNE effectively,”Distill, 2016. [Online]. Available: http://distill.pub/2016/ misread-tsne

work page 2016
[46]

Machine learning in python: Main developments and technology trends in data science, machine learning, and artificial intelligence,

S. Raschka, J. Patterson, and C. Nolet, “Machine learning in python: Main developments and technology trends in data science, machine learning, and artificial intelligence,”Information, vol. 11, no. 4, 2020

work page 2020
[47]

Barnes-Hut-SNE,

L. van der Maaten, “Barnes-Hut-SNE,” inInternational Conference on Learning Representations, 2013

work page 2013
[48]

On the surprising behavior of distance metrics in high dimensional space,

C. C. Aggarwal, A. Hinneburg, and D. A. Keim, “On the surprising behavior of distance metrics in high dimensional space,” inDatabase Theory — ICDT 2001. Springer Berlin Heidelberg, 2001, pp. 420–434

work page 2001
[49]

Neighborhood preservation in nonlinear projec- tion methods: An experimental study,

J. Venna and S. Kaski, “Neighborhood preservation in nonlinear projec- tion methods: An experimental study,” inArtificial Neural Networks — ICANN 2001. Springer Berlin Heidelberg, 2001, pp. 485–491

work page 2001
[50]

t-Distributed Stochastic Neighbor Embedding,

L. van der Maaten, “t-Distributed Stochastic Neighbor Embedding,” https://lvdmaaten.github.io/tsne/, 2008

work page 2008
[51]

Adam: A method for stochastic optimiza- tion,

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza- tion,” in3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015

work page 2015

[1] [1]

Learning audio- visual speech representation by masked multimodal cluster prediction,

B. Shi, W.-N. Hsu, K. Lakhotia, and A. Mohamed, “Learning audio- visual speech representation by masked multimodal cluster prediction,” inInternational Conference on Learning Representations, 2022

work page 2022

[2] [2]

Auto-A VSR: Audio-visual speech recognition with auto- matic labels,

P. Ma, A. Haliassos, A. Fernandez-Lopez, H. Chen, S. Petridis, and M. Pantic, “Auto-A VSR: Audio-visual speech recognition with auto- matic labels,” inICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, p. 1–5

work page 2023

[3] [3]

Whisper-Flamingo: Integrating visual features into whisper for audio-visual speech recognition and translation,

A. Rouditchenko, Y . Gong, S. Thomas, L. Karlinsky, H. Kuehne, R. Feris, and J. Glass, “Whisper-Flamingo: Integrating visual features into whisper for audio-visual speech recognition and translation,” in Interspeech 2024, 2024, pp. 2420–2424

work page 2024

[4] [4]

XLA VS-R: Cross-lingual audio-visual speech representation learning for noise-robust speech perception,

H. Han, M. Anwar, J. Pino, W.-N. Hsu, M. Carpuat, B. Shi, and C. Wang, “XLA VS-R: Cross-lingual audio-visual speech representation learning for noise-robust speech perception,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2024, pp. 12 896–12 911

work page 2024

[5] [5]

VatLM: Visual-audio-text pre-training with unified masked prediction for speech representation learning,

Q. Zhu, L. Zhou, Z. Zhang, S. Liu, B. Jiao, J. Zhang, L. Dai, D. Jiang, J. Li, and F. Wei, “VatLM: Visual-audio-text pre-training with unified masked prediction for speech representation learning,”IEEE Transactions on Multimedia, vol. 26, p. 1055–1064, 2024

work page 2024

[6] [6]

Jointly learning visual and auditory speech representations from raw data,

A. Haliassos, P. Ma, R. Mira, S. Petridis, and M. Pantic, “Jointly learning visual and auditory speech representations from raw data,” in The Eleventh International Conference on Learning Representations, 2023

work page 2023

[7] [7]

Av-data2vec: Self- supervised learning of audio-visual speech representations with con- textualized target representations,

J. Lian, A. Baevski, W.-N. Hsu, and M. Auli, “Av-data2vec: Self- supervised learning of audio-visual speech representations with con- textualized target representations,” in2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023, pp. 1–8

work page 2023

[8] [8]

Conformer: Convolution- augmented transformer for speech recognition,

A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wu, and R. Pang, “Conformer: Convolution- augmented transformer for speech recognition,” inInterspeech 2020, 2020, pp. 5036–5040

work page 2020

[9] [9]

wav2vec 2.0: a framework for self-supervised learning of speech representations,

A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: a framework for self-supervised learning of speech representations,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, ser. NIPS ’20. Red Hook, NY , USA: Curran Associates Inc., 2020

work page 2020

[10] [10]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,”IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 29, p. 3451–3460, 2021

work page 2021

[11] [11]

Robust speech recognition via large-scale weak supervi- sion,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. Mcleavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervi- sion,” inProceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 202. PMLR, 23–29 Jul 2023, pp. 28 492–28 518

work page 2023

[12] [12]

Understanding self-attention of self-supervised audio transformers,

S. wen Yang, A. T. Liu, and H. yi Lee, “Understanding self-attention of self-supervised audio transformers,” inInterspeech 2020, 2020, pp. 3785–3789

work page 2020

[13] [13]

Probing Acoustic Representations for Phonetic Properties,

D. Ma, N. Ryant, and M. Liberman, “Probing Acoustic Representations for Phonetic Properties,” inICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toronto, ON, Canada: IEEE, Jun. 2021, pp. 311–315

work page 2021

[14] [14]

Layer-Wise Analysis of a Self- Supervised Speech Representation Model,

A. Pasad, J.-C. Chou, and K. Livescu, “Layer-Wise Analysis of a Self- Supervised Speech Representation Model,” in2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). Cartagena, Colombia: IEEE, Dec. 2021, pp. 914–921

work page 2021

[15] [15]

What do audio transformers hear? probing their representations for language delivery & structure,

Y . K. Singla, J. Shah, C. Chen, and R. R. Shah, “What do audio transformers hear? probing their representations for language delivery & structure,” in2022 IEEE International Conference on Data Mining Workshops (ICDMW), 2022, pp. 910–925

work page 2022

[16] [16]

Domain- Informed Probing of wav2vec 2.0 Embeddings for Phonetic Features,

P. Cormac English, J. D. Kelleher, and J. Carson-Berndsen, “Domain- Informed Probing of wav2vec 2.0 Embeddings for Phonetic Features,” inProceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology. Association for Computational Linguistics, 2022, pp. 83–91

work page 2022

[17] [17]

Probing phoneme, language and speaker information in unsupervised speech representations,

M. de Seyssel, M. Lavechin, Y . Adi, E. Dupoux, and G. Wisniewski, “Probing phoneme, language and speaker information in unsupervised speech representations,” inInterspeech 2022, 2022, pp. 1402–1406

work page 2022

[18] [18]

Understanding the role of self attention for efficient speech recognition,

K. Shim, J. Choi, and W. Sung, “Understanding the role of self attention for efficient speech recognition,” inInternational Conference on Learning Representations, 2022

work page 2022

[19] [19]

Discovering phonetic feature event patterns in transformer embeddings,

P. C. English, J. D. Kelleher, and J. Carson-Berndsen, “Discovering phonetic feature event patterns in transformer embeddings,” inINTER- SPEECH 2023, 2023, p. 4733–4737

work page 2023

[20] [20]

Comparative Layer-Wise Analysis of Self-Supervised Speech Models,

A. Pasad, B. Shi, and K. Livescu, “Comparative Layer-Wise Analysis of Self-Supervised Speech Models,” inICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5

work page 2023

[21] [21]

Self-supervised speech representations are more phonetic than semantic,

K. Choi, A. Pasad, T. Nakamura, S. Fukayama, K. Livescu, and S. Watanabe, “Self-supervised speech representations are more phonetic than semantic,” inInterspeech 2024, 2024, pp. 4578–4582

work page 2024

[22] [22]

Probing speech quality information in asr systems,

B. T. Ta, M. T. Le, N. M. Le, and V . H. Do, “Probing speech quality information in asr systems,” inInterspeech 2023, 2023, pp. 541–545

work page 2023

[23] [23]

Silence is sweeter than speech: Self-supervised model using silence to store speaker information,

C.-L. Feng, P. chun Hsu, and H. yi Lee, “Silence is sweeter than speech: Self-supervised model using silence to store speaker information,” 2022

work page 2022

[24] [24]

What do self- supervised speech models know about words?

A. Pasad, C.-M. Chien, S. Settle, and K. Livescu, “What do self- supervised speech models know about words?”Transactions of the Association for Computational Linguistics, vol. 12, pp. 372–391, 2024

work page 2024

[25] [25]

Understanding intermediate layers using linear classifier probes,

G. Alain and Y . Bengio, “Understanding intermediate layers using linear classifier probes,” inInternational Conference on Learning Representa- tions, 2017

work page 2017

[26] [26]

Searching for structure: Appraising the organisation of speech features in wav2vec 2.0 embeddings,

P. C. English, J. D. Kelleher, and J. Carson-Berndsen, “Searching for structure: Appraising the organisation of speech features in wav2vec 2.0 embeddings,” inInterspeech 2024. ISCA, Sep. 2024, p. 4613–4617

work page 2024

[27] [27]

Visualizing data using t-sne,

L. van der Maaten and G. Hinton, “Visualizing data using t-sne,”Journal of Machine Learning Research, vol. 9, no. 86, pp. 2579–2605, 2008

work page 2008

[28] [28]

Confusions among visually perceived consonants,

C. G. Fisher, “Confusions among visually perceived consonants,”Jour- nal of Speech and Hearing Research, vol. 11, no. 4, pp. 796–804, 1968

work page 1968

[29] [29]

Jeffers and M

J. Jeffers and M. Barley,Speechreading (lipreading). Thomas, 1971

work page 1971

[30] [30]

Audio-to-visual conversion using hidden markov models,

S. Lee and D. Yook, “Audio-to-visual conversion using hidden markov models,” inPRICAI 2002: Trends in Artificial Intelligence. Berlin, Heidelberg: Springer, 2002, pp. 563–570

work page 2002

[31] [31]

Decoding visemes: Improving machine lip- reading,

H. L. Bear and R. Harvey, “Decoding visemes: Improving machine lip- reading,” in2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 2009–2013

work page 2016

[32] [32]

Comparison of phoneme and viseme based acoustic units for speech driven realistic lip animation,

E. Bozkurt, C. Erdem, E. Erzin, T. Erdem, and M. ¨Ozkan, “Comparison of phoneme and viseme based acoustic units for speech driven realistic lip animation,” in2007 IEEE 15th Signal Processing and Communica- tions Applications, 07 2007, pp. 1 – 4

work page 2007

[33] [33]

Visual model structures and synchrony constraints for audio- visual speech recognition,

T. Hazen, “Visual model structures and synchrony constraints for audio- visual speech recognition,”IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 3, pp. 1082–1089, 2006

work page 2006

[34] [34]

Phoneme-to-viseme mapping for visual speech recognition,

L. Cappelletta and N. Harte, “Phoneme-to-viseme mapping for visual speech recognition,” inInternational Conference on Pattern Recognition Applications and Methods, 2012

work page 2012

[35] [35]

S. J. Oh, B. Schiele, and M. Fritz,Towards Reverse-Engineering Black- Box Neural Networks. Springer International Publishing, 2019, pp. 121–144

work page 2019

[36] [36]

Combining residual networks with lstms for lipreading,

T. Stafylakis and G. Tzimiropoulos, “Combining residual networks with lstms for lipreading,” inInterspeech 2017, 2017, pp. 3652–3656

work page 2017

[37] [37]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inProceedings of the 31st International Conference on Neural Information Processing Systems, ser. NIPS’17, 2017, p. 6000–6010

work page 2017

[38] [38]

LRS3-TED: a large-scale dataset for visual speech recognition

T. Afouras, J. S. Chung, and A. Zisserman, “LRS3-TED: a large-scale dataset for visual speech recognition,”CoRR, vol. abs/1809.00496, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[39] [39]

Montreal forced aligner: Trainable text-speech alignment using kaldi,

M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, “Montreal forced aligner: Trainable text-speech alignment using kaldi,” inInterspeech 2017, 2017, pp. 498–502

work page 2017

[40] [40]

Prosodylab-aligner: A tool for forced alignment of laboratory speech,

K. Gorman, J. Howell, and M. Wagner, “Prosodylab-aligner: A tool for forced alignment of laboratory speech,”Canadian Acoustics, vol. 39, no. 3, pp. 192–193, 2011

work page 2011

[41] [41]

MUSAN: A Music, Speech, and Noise Corpus

D. Snyder, G. Chen, and D. Povey, “MUSAN: A music, speech, and noise corpus,”CoRR, vol. abs/1510.08484, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[42] [42]

Robust self-supervised audio- visual speech recognition,

B. Shi, W.-N. Hsu, and A. Mohamed, “Robust self-supervised audio- visual speech recognition,” inInterspeech 2022, 2022, pp. 2118–2122

work page 2022

[43] [43]

Learning lip-based audio-visual speaker embeddings with av-hubert,

B. Shi, A. Mohamed, and W.-N. Hsu, “Learning lip-based audio-visual speaker embeddings with av-hubert,” inInterspeech 2022, 2022, pp. 4785–4789

work page 2022

[44] [44]

Uncovering the visual contribution in audio- visual speech recognition,

Z. Lin and N. Harte, “Uncovering the visual contribution in audio- visual speech recognition,” inICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5

work page 2025

[45] [45]

How to use t-SNE effectively,

M. Wattenberg, F. Vi ´egas, and I. Johnson, “How to use t-SNE effectively,”Distill, 2016. [Online]. Available: http://distill.pub/2016/ misread-tsne

work page 2016

[46] [46]

Machine learning in python: Main developments and technology trends in data science, machine learning, and artificial intelligence,

S. Raschka, J. Patterson, and C. Nolet, “Machine learning in python: Main developments and technology trends in data science, machine learning, and artificial intelligence,”Information, vol. 11, no. 4, 2020

work page 2020

[47] [47]

Barnes-Hut-SNE,

L. van der Maaten, “Barnes-Hut-SNE,” inInternational Conference on Learning Representations, 2013

work page 2013

[48] [48]

On the surprising behavior of distance metrics in high dimensional space,

C. C. Aggarwal, A. Hinneburg, and D. A. Keim, “On the surprising behavior of distance metrics in high dimensional space,” inDatabase Theory — ICDT 2001. Springer Berlin Heidelberg, 2001, pp. 420–434

work page 2001

[49] [49]

Neighborhood preservation in nonlinear projec- tion methods: An experimental study,

J. Venna and S. Kaski, “Neighborhood preservation in nonlinear projec- tion methods: An experimental study,” inArtificial Neural Networks — ICANN 2001. Springer Berlin Heidelberg, 2001, pp. 485–491

work page 2001

[50] [50]

t-Distributed Stochastic Neighbor Embedding,

L. van der Maaten, “t-Distributed Stochastic Neighbor Embedding,” https://lvdmaaten.github.io/tsne/, 2008

work page 2008

[51] [51]

Adam: A method for stochastic optimiza- tion,

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza- tion,” in3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015

work page 2015