Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x higher throughput.
MuseTalk: Real-time high-fidelity video dubbing via spatio-temporal sampling,
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 4verdicts
UNVERDICTED 4roles
background 1polarities
background 1representative citing papers
Phoneme-guided autoregressive framework for talking-head animation that reduces inter-frame flicker via causal keyframe generation and timestamp-aware interpolation, outperforming diffusion baselines on FVD and a new BG-Flicker metric.
EAD-Net uses a diffusion model with new spatio-temporal attention, graph-based temporal reasoning, and LLM-derived semantic descriptions to generate emotionally expressive talking head videos with improved lip-sync and coherence over prior methods.
HighSync is a diffusion-based lip synchronization system that operates natively at 512x512 resolution by eliminating data leakage to enforce genuine audio dependence and reports state-of-the-art results on quality and sync metrics.
citing papers explorer
-
Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation
Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x higher throughput.
-
FluentAvatar: Flicker-Free Talking-Head Animation via Phoneme-Guided Autoregressive Modeling
Phoneme-guided autoregressive framework for talking-head animation that reduces inter-frame flicker via causal keyframe generation and timestamp-aware interpolation, outperforming diffusion baselines on FVD and a new BG-Flicker metric.
-
EAD-Net: Emotion-Aware Talking Head Generation with Spatial Refinement and Temporal Coherence
EAD-Net uses a diffusion model with new spatio-temporal attention, graph-based temporal reasoning, and LLM-derived semantic descriptions to generate emotionally expressive talking head videos with improved lip-sync and coherence over prior methods.
-
HighSync: High-Quality Lip Synchronization via Latent Diffusion Models
HighSync is a diffusion-based lip synchronization system that operates natively at 512x512 resolution by eliminating data leakage to enforce genuine audio dependence and reports state-of-the-art results on quality and sync metrics.