KSDiff: Keyframe-Augmented Speech-Aware Dual-Path Diffusion for Facial Animation

· 2025 · cs.GR · arXiv 2509.20128

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Audio-driven facial animation has made significant progress in multimedia applications, with diffusion models showing strong potential for talking-face synthesis. However, most existing works treat speech features as a monolithic representation and fail to capture their fine-grained roles in driving different facial motions, while also overlooking the importance of modeling keyframes with intense dynamics. To address these limitations, we propose KSDiff, a Keyframe-Augmented Speech-Aware Dual-Path Diffusion framework. Specifically, the raw audio and transcript are processed by a Dual-Path Speech Encoder (DPSE) to disentangle expression-related and head-pose-related features, while an autoregressive Keyframe Establishment Learning (KEL) module predicts the most salient motion frames. These components are integrated into a Dual-path Motion generator to synthesize coherent and realistic facial motions. Extensive experiments on HDTF and VoxCeleb demonstrate that KSDiff achieves state-of-the-art performance, with improvements in both lip synchronization accuracy and head-pose naturalness. Our results highlight the effectiveness of combining speech disentanglement with keyframe-aware diffusion for talking-head generation. The demo page is available at: https://kincin.github.io/KSDiff/.

representative citing papers

KSDiff: Keyframe-Augmented Speech-Aware Dual-Path Diffusion for Facial Animation

cs.GR · 2025-09-24 · unverdicted · novelty 6.0

KSDiff introduces dual-path speech disentanglement and autoregressive keyframe prediction inside a diffusion model to improve lip synchronization and head-pose realism in audio-driven facial animation.

citing papers explorer

Showing 1 of 1 citing paper.

KSDiff: Keyframe-Augmented Speech-Aware Dual-Path Diffusion for Facial Animation cs.GR · 2025-09-24 · unverdicted · none · ref 1 · internal anchor
KSDiff introduces dual-path speech disentanglement and autoregressive keyframe prediction inside a diffusion model to improve lip synchronization and head-pose realism in audio-driven facial animation.

KSDiff: Keyframe-Augmented Speech-Aware Dual-Path Diffusion for Facial Animation

fields

years

verdicts

representative citing papers

citing papers explorer