C-MET transfers emotions from speech to facial video by learning cross-modal semantic vectors with pretrained audio and disentangled expression encoders, yielding 14% higher emotion accuracy on MEAD and CREMA-D even for unseen emotions.
Vasa-1: Lifelike audio-driven talking faces generated in real time.Advances in Neural Information Pro- cessing Systems, 37:660–684, 2024
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 2verdicts
UNVERDICTED 2representative citing papers
FlashLips delivers 100+ FPS mask-free lip-sync by reconstructing target frames in latent space from an audio-predicted lips-pose vector using a compact U-Net trained solely on reconstruction losses and self-supervised mask removal.
citing papers explorer
-
Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video
C-MET transfers emotions from speech to facial video by learning cross-modal semantic vectors with pretrained audio and disentangled expression encoders, yielding 14% higher emotion accuracy on MEAD and CREMA-D even for unseen emotions.
-
FlashLips: 100-FPS Mask-Free Latent Lip-Sync using Reconstruction Instead of Diffusion or GANs
FlashLips delivers 100+ FPS mask-free lip-sync by reconstructing target frames in latent space from an audio-predicted lips-pose vector using a compact U-Net trained solely on reconstruction losses and self-supervised mask removal.