Recognition: unknown
FluentAvatar: Flicker-Free Talking-Head Animation via Phoneme-Guided Autoregressive Modeling
read the original abstract
Current talking-head generation has gradually shifted from GAN-based methods to diffusion-based paradigms, achieving remarkable progress in visual fidelity and temporal consistency. However, inter-frame flicker remains prevalent in existing diffusion-based methods. An important reason is that denoising trajectory variation induced by stochastic initialization leaves residual inter-frame inconsistencies, which manifest as short-term, abrupt visual fluctuations between adjacent frames. To further verify this, we conduct a controlled study by fixing the input while varying only the random seed. The results show markedly different flicker patterns across samplings, with a mean inter-seed Pearson correlation of only r = 0.15. This motivates us to explore autoregressive generation, which models frames sequentially and provides a more direct prior for temporal continuity. Based on this, we propose FluentAvatar, a two-stage autoregressive framework built on phoneme representations. First, Facial Keyframe Generation produces phoneme-aligned keyframes under a Phoneme-Frame Causal Attention Mask, and Inter-frame Interpolation synthesizes transition frames via a timestamp-aware adaptive strategy built upon selective state space modeling. Moreover, we introduce BG-Flicker, a background-isolated metric for talking-head videos that enables more reliable evaluation of inter-frame flicker. Experiments on CMLR and HDTF demonstrate that FluentAvatar achieves strong performance in visual fidelity, lip synchronization, and temporal stability, attaining the best FVD on both datasets and BG-Flicker results close to ground truth. The code, the model, and the interface will be released to facilitate further research.
This paper has not been read by Pith yet.
Forward citations
Cited by 3 Pith papers
-
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
AsymTalker maintains identity consistency in long-term diffusion talking-head videos by encoding temporal references from a static image and training a student model under inference-like conditions via asymmetric dist...
-
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
AsymK-Talker introduces kernel-conditioned loop generation, temporal reference encoding, and asymmetric kernel distillation to achieve real-time, drift-resistant talking head synthesis from audio using diffusion models.
-
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
AsymTalker uses temporal reference encoding and asymmetric knowledge distillation to produce identity-consistent talking head videos up to 600 seconds long at 66 FPS.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.