hub

LRS3-TED: a large-scale dataset for visual speech recognition

· 2018 · cs.CV · arXiv 1809.00496

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

open full Pith review browse 12 citing papers arXiv PDF

abstract

This paper introduces a new multi-modal dataset for visual and audio-visual speech recognition. It includes face tracks from over 400 hours of TED and TEDx videos, along with the corresponding subtitles and word alignment boundaries. The new dataset is substantially larger in scale compared to other public datasets that are available for general research.

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

Hierarchical Codec Diffusion for Video-to-Speech Generation

cs.SD · 2026-04-17 · unverdicted · novelty 7.0

HiCoDiT generates speech from video by conditioning low-level RVQ tokens on speaker identity and high-level tokens on facial expressions via a dual-scale normalized diffusion transformer.

CoSyncDiT: Cognitive Synchronous Diffusion Transformer for Movie Dubbing

cs.SD · 2026-04-14 · unverdicted · novelty 7.0

CoSyncDiT is a cognitive-inspired diffusion transformer that achieves state-of-the-art lip synchronization and naturalness in movie dubbing by guiding noise-to-speech generation through acoustic, visual, and contextual stages plus joint regularization.

OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text

cs.SD · 2026-04-06 · unverdicted · novelty 7.0

OmniSonic introduces a TriAttn-DiT architecture with MoE gating to jointly generate on-screen, off-screen, and speech audio from video and text, outperforming prior models on a new UniHAGen-Bench.

Mechanisms of Multimodal Synchronization: Insights from Decoder-Based Video-Text-to-Speech Synthesis

cs.MM · 2024-11-26 · unverdicted · novelty 6.0

Experiments with a video-text-to-speech transformer show co-temporal positional indexing enables synchronization without timestamps, text and video supply complementary signals, and modality ordering creates a trade-off between in-domain accuracy and cross-domain generalization.

Orthogonal Subspace Decomposition for Generalizable AI-Generated Image Detection

cs.CV · 2024-11-23 · unverdicted · novelty 6.0

Orthogonal subspace decomposition via SVD on vision foundation model features preserves high-rank pre-trained knowledge by freezing principal components and adapting residuals, reducing overfitting for better generalization in AI-generated image detection.

My lips are concealed: Audio-visual speech enhancement through obstructions

cs.CV · 2019-07-11 · unverdicted · novelty 6.0

A deep audio-visual speech enhancement network separates a speaker's voice using lip movements and voice representations, with self-enrollment to handle visual occlusions.

LRS-VoxMM: A benchmark for in-the-wild audio-visual speech recognition

eess.AS · 2026-04-30 · unverdicted · novelty 6.0

LRS-VoxMM is a new in-the-wild AVSR benchmark that is harder than LRS3 and demonstrates increasing value of visual information under acoustic degradation.

Tracking Listener Attention: Gaze-Guided Audio-Visual Speech Enhancement Framework

eess.AS · 2026-04-09 · unverdicted · novelty 6.0

The GG-AVSE framework uses listener gaze direction combined with YOLO5Face and AVSEMamba to resolve target-speaker ambiguity in audio-visual speech enhancement, yielding gains in PESQ, STOI, and SI-SDR.

HighSync: High-Quality Lip Synchronization via Latent Diffusion Models

cs.CV · 2026-05-16 · unverdicted · novelty 5.0

HighSync is a diffusion-based lip synchronization system that operates natively at 512x512 resolution by eliminating data leakage to enforce genuine audio dependence and reports state-of-the-art results on quality and sync metrics.

Interpreting the Role of Visemes in Audio-Visual Speech Recognition

eess.AS · 2025-09-19 · unverdicted · novelty 5.0

Interpretability analysis of AV-HuBERT reveals visual-driven clustering of visemes that audio refines, especially for ambiguous cases.

Delayed Commitment for Representation Readiness in Stage-wise Audio-Visual Learning

cs.SD · 2026-05-03 · unverdicted · novelty 5.0

DPC-Net improves stage-wise audio-visual learning by correcting readiness deficiencies in fused representations using cross-layer and cross-modal evidence.

BUT System Description for CHiME-9 MCoRec Challenge

eess.AS · 2026-04-30 · unverdicted · novelty 3.0

BUT's CHiME-9 MCoRec system conditions Parakeet-v2 ASR on AV-HuBERT visuals for 33.7% WER and uses Qwen3.5 LLM for hierarchical clustering to reach 0.97 F1, beating the baseline by 16.2% WER and 0.15 F1 on the development set.

citing papers explorer

Showing 12 of 12 citing papers.

Hierarchical Codec Diffusion for Video-to-Speech Generation cs.SD · 2026-04-17 · unverdicted · none · ref 1
HiCoDiT generates speech from video by conditioning low-level RVQ tokens on speaker identity and high-level tokens on facial expressions via a dual-scale normalized diffusion transformer.
CoSyncDiT: Cognitive Synchronous Diffusion Transformer for Movie Dubbing cs.SD · 2026-04-14 · unverdicted · none · ref 1
CoSyncDiT is a cognitive-inspired diffusion transformer that achieves state-of-the-art lip synchronization and naturalness in movie dubbing by guiding noise-to-speech generation through acoustic, visual, and contextual stages plus joint regularization.
OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text cs.SD · 2026-04-06 · unverdicted · none · ref 1
OmniSonic introduces a TriAttn-DiT architecture with MoE gating to jointly generate on-screen, off-screen, and speech audio from video and text, outperforming prior models on a new UniHAGen-Bench.
Mechanisms of Multimodal Synchronization: Insights from Decoder-Based Video-Text-to-Speech Synthesis cs.MM · 2024-11-26 · unverdicted · none · ref 1 · internal anchor
Experiments with a video-text-to-speech transformer show co-temporal positional indexing enables synchronization without timestamps, text and video supply complementary signals, and modality ordering creates a trade-off between in-domain accuracy and cross-domain generalization.
Orthogonal Subspace Decomposition for Generalizable AI-Generated Image Detection cs.CV · 2024-11-23 · unverdicted · none · ref 175 · internal anchor
Orthogonal subspace decomposition via SVD on vision foundation model features preserves high-rank pre-trained knowledge by freezing principal components and adapting residuals, reducing overfitting for better generalization in AI-generated image detection.
My lips are concealed: Audio-visual speech enhancement through obstructions cs.CV · 2019-07-11 · unverdicted · none · ref 27 · internal anchor
A deep audio-visual speech enhancement network separates a speaker's voice using lip movements and voice representations, with self-enrollment to handle visual occlusions.
LRS-VoxMM: A benchmark for in-the-wild audio-visual speech recognition eess.AS · 2026-04-30 · unverdicted · none · ref 10
LRS-VoxMM is a new in-the-wild AVSR benchmark that is harder than LRS3 and demonstrates increasing value of visual information under acoustic degradation.
Tracking Listener Attention: Gaze-Guided Audio-Visual Speech Enhancement Framework eess.AS · 2026-04-09 · unverdicted · none · ref 30
The GG-AVSE framework uses listener gaze direction combined with YOLO5Face and AVSEMamba to resolve target-speaker ambiguity in audio-visual speech enhancement, yielding gains in PESQ, STOI, and SI-SDR.
HighSync: High-Quality Lip Synchronization via Latent Diffusion Models cs.CV · 2026-05-16 · unverdicted · none · ref 20 · internal anchor
HighSync is a diffusion-based lip synchronization system that operates natively at 512x512 resolution by eliminating data leakage to enforce genuine audio dependence and reports state-of-the-art results on quality and sync metrics.
Interpreting the Role of Visemes in Audio-Visual Speech Recognition eess.AS · 2025-09-19 · unverdicted · none · ref 38 · internal anchor
Interpretability analysis of AV-HuBERT reveals visual-driven clustering of visemes that audio refines, especially for ambiguous cases.
Delayed Commitment for Representation Readiness in Stage-wise Audio-Visual Learning cs.SD · 2026-05-03 · unverdicted · none · ref 59
DPC-Net improves stage-wise audio-visual learning by correcting readiness deficiencies in fused representations using cross-layer and cross-modal evidence.
BUT System Description for CHiME-9 MCoRec Challenge eess.AS · 2026-04-30 · unverdicted · none · ref 18
BUT's CHiME-9 MCoRec system conditions Parakeet-v2 ASR on AV-HuBERT visuals for 33.7% WER and uses Qwen3.5 LLM for hierarchical clustering to reach 0.97 F1, beating the baseline by 16.2% WER and 0.15 F1 on the development set.

LRS3-TED: a large-scale dataset for visual speech recognition

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer