C-MET transfers emotions from speech to facial video by learning cross-modal semantic vectors with pretrained audio and disentangled expression encoders, yielding 14% higher emotion accuracy on MEAD and CREMA-D even for unseen emotions.
FLOAT: generative motion latent flow matching for audio-driven talk- ing portrait.CoRR, abs/2412.01064
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 3representative citing papers
THEval proposes eight metrics for evaluating talking head videos on quality, naturalness, and synchronization, tested on 85,000 videos from 17 models with a new curated dataset.
PortraitDirector uses hierarchical disentanglement of spatial physical motions and semantic emotions to deliver controllable, high-fidelity real-time facial reenactment at 20 FPS.
citing papers explorer
-
Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video
C-MET transfers emotions from speech to facial video by learning cross-modal semantic vectors with pretrained audio and disentangled expression encoders, yielding 14% higher emotion accuracy on MEAD and CREMA-D even for unseen emotions.
-
THEval. Evaluation Framework for Talking Head Video Generation
THEval proposes eight metrics for evaluating talking head videos on quality, naturalness, and synchronization, tested on 85,000 videos from 17 models with a new curated dataset.
-
PortraitDirector: A Hierarchical Disentanglement Framework for Controllable and Real-time Facial Reenactment
PortraitDirector uses hierarchical disentanglement of spatial physical motions and semantic emotions to deliver controllable, high-fidelity real-time facial reenactment at 20 FPS.