FluentAvatar: Flicker-Free Talking-Head Animation via Phoneme-Guided Autoregressive Modeling

· 2025 · cs.CV · arXiv 2509.12052

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Current talking-head generation has gradually shifted from GAN-based methods to diffusion-based paradigms, achieving remarkable progress in visual fidelity and temporal consistency. However, inter-frame flicker remains prevalent in existing diffusion-based methods. An important reason is that denoising trajectory variation induced by stochastic initialization leaves residual inter-frame inconsistencies, which manifest as short-term, abrupt visual fluctuations between adjacent frames. To further verify this, we conduct a controlled study by fixing the input while varying only the random seed. The results show markedly different flicker patterns across samplings, with a mean inter-seed Pearson correlation of only r = 0.15. This motivates us to explore autoregressive generation, which models frames sequentially and provides a more direct prior for temporal continuity. Based on this, we propose FluentAvatar, a two-stage autoregressive framework built on phoneme representations. First, Facial Keyframe Generation produces phoneme-aligned keyframes under a Phoneme-Frame Causal Attention Mask, and Inter-frame Interpolation synthesizes transition frames via a timestamp-aware adaptive strategy built upon selective state space modeling. Moreover, we introduce BG-Flicker, a background-isolated metric for talking-head videos that enables more reliable evaluation of inter-frame flicker. Experiments on CMLR and HDTF demonstrate that FluentAvatar achieves strong performance in visual fidelity, lip synchronization, and temporal stability, attaining the best FVD on both datasets and BG-Flicker results close to ground truth. The code, the model, and the interface will be released to facilitate further research.

representative citing papers

AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation

cs.LG · 2026-05-01 · unverdicted · novelty 6.0 · 2 refs

AsymTalker uses temporal reference encoding and asymmetric knowledge distillation to produce identity-consistent talking head videos up to 600 seconds long at 66 FPS.

citing papers explorer

Showing 1 of 1 citing paper.

AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation cs.LG · 2026-05-01 · unverdicted · none · ref 4 · 2 links · internal anchor
AsymTalker uses temporal reference encoding and asymmetric knowledge distillation to produce identity-consistent talking head videos up to 600 seconds long at 66 FPS.

FluentAvatar: Flicker-Free Talking-Head Animation via Phoneme-Guided Autoregressive Modeling

fields

years

verdicts

representative citing papers

citing papers explorer