MuseTalk: Real-time high-fidelity video dubbing via spatio-temporal sampling

Yue Zhang, Zhizhou Zhong, Minhao Liu, Zhaokang Chen, Bin Wu, Yubin Zeng, Chao Zhan, Yingjie He, Junxin Huang, Wenjiang Zhou · 2025 · arXiv 2410.10122

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip Synchronization

cs.CV · 2026-06-09 · conditional · novelty 8.0

Lip Forcing distills a 14B bidirectional video diffusion teacher into autoregressive students that achieve real-time lip synchronization at 31 FPS using two denoising steps without CFG.

Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation

cs.CV · 2026-04-26 · unverdicted · novelty 7.0

Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x higher throughput.

FluentAvatar: Flicker-Free Talking-Head Animation via Phoneme-Guided Autoregressive Modeling

cs.CV · 2025-09-15 · unverdicted · novelty 7.0

Phoneme-guided autoregressive framework for talking-head animation that reduces inter-frame flicker via causal keyframe generation and timestamp-aware interpolation, outperforming diffusion baselines on FVD and a new BG-Flicker metric.

EAD-Net: Emotion-Aware Talking Head Generation with Spatial Refinement and Temporal Coherence

cs.CV · 2026-04-25 · unverdicted · novelty 6.0

EAD-Net uses a diffusion model with new spatio-temporal attention, graph-based temporal reasoning, and LLM-derived semantic descriptions to generate emotionally expressive talking head videos with improved lip-sync and coherence over prior methods.

KM-Speaker: Keypoint-Based Style Control for High-Quality Speech-Driven 3D Facial Animation and Dialogue Localization

cs.CV · 2026-06-26 · unverdicted · novelty 5.0

KM-Speaker introduces a keypoint-conditioned generative framework for speech-driven 3D facial animation offering global style guidance and frame-level temporal control via disentangled lip and upper-face dynamics.

MindFlow: Harmonizing Cognitive Semantics and Acoustic Dynamics for Facial Animation Generation in Dyadic Conversations

cs.CV · 2026-06-26 · unverdicted · novelty 5.0

MindFlow presents a neuroscience-inspired dual-stream generative model that uses chunk-state emotional modeling and conditional flow matching to produce facial animations with improved semantic fit and motion realism in dyadic conversations.

HighSync: High-Quality Lip Synchronization via Latent Diffusion Models

cs.CV · 2026-05-16 · unverdicted · novelty 5.0

HighSync is a diffusion-based lip synchronization system that operates natively at 512x512 resolution by eliminating data leakage to enforce genuine audio dependence and reports state-of-the-art results on quality and sync metrics.

Fre-Res: Frequency-Residual Video Token Compression for Efficient Video MLLMs

cs.CV · 2026-05-10 · unverdicted · novelty 5.0

Fre-Res compresses video tokens by preserving spatial anchors and representing temporal dynamics with low-frequency residual tokens derived from 1D-DCT on inter-frame residuals, plus a Spatial-Guided Absorber to reinject the information.

citing papers explorer

Showing 8 of 8 citing papers.

Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip Synchronization cs.CV · 2026-06-09 · conditional · none · ref 51
Lip Forcing distills a 14B bidirectional video diffusion teacher into autoregressive students that achieve real-time lip synchronization at 31 FPS using two denoising steps without CFG.
Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation cs.CV · 2026-04-26 · unverdicted · none · ref 52
Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x higher throughput.
FluentAvatar: Flicker-Free Talking-Head Animation via Phoneme-Guided Autoregressive Modeling cs.CV · 2025-09-15 · unverdicted · none · ref 29
Phoneme-guided autoregressive framework for talking-head animation that reduces inter-frame flicker via causal keyframe generation and timestamp-aware interpolation, outperforming diffusion baselines on FVD and a new BG-Flicker metric.
EAD-Net: Emotion-Aware Talking Head Generation with Spatial Refinement and Temporal Coherence cs.CV · 2026-04-25 · unverdicted · none · ref 55
EAD-Net uses a diffusion model with new spatio-temporal attention, graph-based temporal reasoning, and LLM-derived semantic descriptions to generate emotionally expressive talking head videos with improved lip-sync and coherence over prior methods.
KM-Speaker: Keypoint-Based Style Control for High-Quality Speech-Driven 3D Facial Animation and Dialogue Localization cs.CV · 2026-06-26 · unverdicted · none · ref 39
KM-Speaker introduces a keypoint-conditioned generative framework for speech-driven 3D facial animation offering global style guidance and frame-level temporal control via disentangled lip and upper-face dynamics.
MindFlow: Harmonizing Cognitive Semantics and Acoustic Dynamics for Facial Animation Generation in Dyadic Conversations cs.CV · 2026-06-26 · unverdicted · none · ref 47
MindFlow presents a neuroscience-inspired dual-stream generative model that uses chunk-state emotional modeling and conditional flow matching to produce facial animations with improved semantic fit and motion realism in dyadic conversations.
HighSync: High-Quality Lip Synchronization via Latent Diffusion Models cs.CV · 2026-05-16 · unverdicted · none · ref 5
HighSync is a diffusion-based lip synchronization system that operates natively at 512x512 resolution by eliminating data leakage to enforce genuine audio dependence and reports state-of-the-art results on quality and sync metrics.
Fre-Res: Frequency-Residual Video Token Compression for Efficient Video MLLMs cs.CV · 2026-05-10 · unverdicted · none · ref 33
Fre-Res compresses video tokens by preserving spatial anchors and representing temporal dynamics with low-frequency residual tokens derived from 1D-DCT on inter-frame residuals, plus a Spatial-Guided Absorber to reinject the information.

MuseTalk: Real-time high-fidelity video dubbing via spatio-temporal sampling

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer