Lip Forcing distills a 14B bidirectional video diffusion teacher into autoregressive students that achieve real-time lip synchronization at 31 FPS using two denoising steps without CFG.
MuseTalk: Real-time high-fidelity video dubbing via spatio-temporal sampling
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 8roles
background 1polarities
background 1representative citing papers
Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x higher throughput.
Phoneme-guided autoregressive framework for talking-head animation that reduces inter-frame flicker via causal keyframe generation and timestamp-aware interpolation, outperforming diffusion baselines on FVD and a new BG-Flicker metric.
EAD-Net uses a diffusion model with new spatio-temporal attention, graph-based temporal reasoning, and LLM-derived semantic descriptions to generate emotionally expressive talking head videos with improved lip-sync and coherence over prior methods.
KM-Speaker introduces a keypoint-conditioned generative framework for speech-driven 3D facial animation offering global style guidance and frame-level temporal control via disentangled lip and upper-face dynamics.
MindFlow presents a neuroscience-inspired dual-stream generative model that uses chunk-state emotional modeling and conditional flow matching to produce facial animations with improved semantic fit and motion realism in dyadic conversations.
HighSync is a diffusion-based lip synchronization system that operates natively at 512x512 resolution by eliminating data leakage to enforce genuine audio dependence and reports state-of-the-art results on quality and sync metrics.
Fre-Res compresses video tokens by preserving spatial anchors and representing temporal dynamics with low-frequency residual tokens derived from 1D-DCT on inter-frame residuals, plus a Spatial-Guided Absorber to reinject the information.
citing papers explorer
-
Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip Synchronization
Lip Forcing distills a 14B bidirectional video diffusion teacher into autoregressive students that achieve real-time lip synchronization at 31 FPS using two denoising steps without CFG.