pith. sign in

hub

LRS3-TED: a large-scale dataset for visual speech recognition

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it
abstract

This paper introduces a new multi-modal dataset for visual and audio-visual speech recognition. It includes face tracks from over 400 hours of TED and TEDx videos, along with the corresponding subtitles and word alignment boundaries. The new dataset is substantially larger in scale compared to other public datasets that are available for general research.

hub tools

verdicts

UNVERDICTED 12

representative citing papers

Hierarchical Codec Diffusion for Video-to-Speech Generation

cs.SD · 2026-04-17 · unverdicted · novelty 7.0

HiCoDiT generates speech from video by conditioning low-level RVQ tokens on speaker identity and high-level tokens on facial expressions via a dual-scale normalized diffusion transformer.

CoSyncDiT: Cognitive Synchronous Diffusion Transformer for Movie Dubbing

cs.SD · 2026-04-14 · unverdicted · novelty 7.0

CoSyncDiT is a cognitive-inspired diffusion transformer that achieves state-of-the-art lip synchronization and naturalness in movie dubbing by guiding noise-to-speech generation through acoustic, visual, and contextual stages plus joint regularization.

HighSync: High-Quality Lip Synchronization via Latent Diffusion Models

cs.CV · 2026-05-16 · unverdicted · novelty 5.0

HighSync is a diffusion-based lip synchronization system that operates natively at 512x512 resolution by eliminating data leakage to enforce genuine audio dependence and reports state-of-the-art results on quality and sync metrics.

BUT System Description for CHiME-9 MCoRec Challenge

eess.AS · 2026-04-30 · unverdicted · novelty 3.0

BUT's CHiME-9 MCoRec system conditions Parakeet-v2 ASR on AV-HuBERT visuals for 33.7% WER and uses Qwen3.5 LLM for hierarchical clustering to reach 0.97 F1, beating the baseline by 16.2% WER and 0.15 F1 on the development set.

citing papers explorer

Showing 12 of 12 citing papers.