pith. machine review for the scientific record. sign in

arxiv: 2604.07786 · v2 · submitted 2026-04-09 · 💻 cs.CV · cs.LG

Recognition: no theorem link

Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:33 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords cross-modal emotion transfertalking face generationemotion editingfacial expression synthesisvideo generationmultimodal learninggenerative models
0
0 comments X

The pith

A method that learns emotion as the difference between audio and visual embeddings generates more accurate talking face videos with a wider range of expressions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Cross-Modal Emotion Transfer to generate facial expressions in talking videos by learning vectors that capture emotional differences across speech and face modalities. It uses a pretrained audio encoder and a disentangled facial expression encoder to compute these vectors as embedding differences, then applies them to edit emotions. This addresses limits in label-based, audio-entangled, and reference-image methods. Sympathetic readers would care if this leads to more natural and varied emotional expressions in AI-generated videos without needing specific training data for each emotion. The approach reports 14% better emotion accuracy on benchmark datasets.

Core claim

C-MET learns emotion semantic vectors as the difference between two different emotional embeddings across modalities using a large-scale pretrained audio encoder and a disentangled facial expression encoder. These vectors represent pure emotional content and are used to generate expressive talking face videos from speech, including for unseen extended emotions.

What carries the argument

Emotion semantic vectors, computed as the difference between embeddings from a pretrained audio encoder and a disentangled facial expression encoder, which isolate emotional content for transfer.

If this is right

  • The generated videos achieve 14% higher emotion accuracy than state-of-the-art methods on MEAD and CREMA-D datasets.
  • Extended emotions not seen during training can still be rendered in the output videos.
  • The method works with speech input alone without requiring reference face images for the target emotion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the vectors successfully disentangle emotion, the method could generalize to editing emotions in real videos or other media like animations.
  • Pretrained encoders allow the approach to benefit from advances in audio and vision models without full retraining.
  • Testing on diverse speech styles could reveal if linguistic content is fully removed from the vectors.

Load-bearing premise

The differences between audio and visual embeddings capture only emotion and exclude any remaining linguistic or identity information.

What would settle it

Observing whether generated face videos correctly express the target emotion when the driving speech contains mismatched linguistic content or when applying emotions absent from the training set.

Figures

Figures reproduced from arXiv: 2604.07786 by Chanhyuk Choi, Donggyu Lee, Siyeol Jung, Taehwan Kim, Taesoo Kim.

Figure 1
Figure 1. Figure 1 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The comparison of emotion condition modules in the [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed Cross-Modal Emotion Transfer (C-MET). (a) We extract input and target embeddings using pre￾trained audio and visual encoders, and compute the semantic vectors by subtracting the target embeddings from the inputs. (b) During training, we apply contrastive learning between multimodal tokens—both from visual to audio and audio to visual—to align the represen￾tation spaces. (c) A multi… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results for angry (left) and sarcastic (right), respectively. embeddings. During inference, zero padding is applied to initialize the reference, and predicted vectors are fed in an autoregressive manner. Our model is implemented in Py￾Torch and trained using the AdamW optimizer on a single NVIDIA GeForce RTX 3090 GPU (24 GB). The loss coef￾ficients λcnt and λdir are set to 0.1 and 0.05, respect… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative analysis of integrating C-MET into disentan [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The results of continuous emotion editing. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Trend of emotion accuracy as the number of emotional [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Emotion editing results on extended emotions. [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Confusion matrices of input (x-axis) and predicted emotions (y-axis) on MEAD across models. [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Example interface used in our user study comparing [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
read the original abstract

Talking face generation has gained significant attention as a core application of generative models. To enhance the expressiveness and realism of synthesized videos, emotion editing in talking face video plays a crucial role. However, existing approaches often limit expressive flexibility and struggle to generate extended emotions. Label-based methods represent emotions with discrete categories, which fail to capture a wide range of emotions. Audio-based methods can leverage emotionally rich speech signals - and even benefit from expressive text-to-speech (TTS) synthesis - but they fail to express the target emotions because emotions and linguistic contents are entangled in emotional speeches. Images-based methods, on the other hand, rely on target reference images to guide emotion transfer, yet they require high-quality frontal views and face challenges in acquiring reference data for extended emotions (e.g., sarcasm). To address these limitations, we propose Cross-Modal Emotion Transfer (C-MET), a novel approach that generates facial expressions based on speeches by modeling emotion semantic vectors between speech and visual feature spaces. C-MET leverages a large-scale pretrained audio encoder and a disentangled facial expression encoder to learn emotion semantic vectors that represent the difference between two different emotional embeddings across modalities. Extensive experiments on the MEAD and CREMA-D datasets demonstrate that our method improves emotion accuracy by 14% over state-of-the-art methods, while generating expressive talking face videos - even for unseen extended emotions. Code, checkpoint, and demo are available at https://chanhyeok-choi.github.io/C-MET/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Cross-Modal Emotion Transfer (C-MET) for emotion editing in talking face videos. It computes emotion semantic vectors as the difference between embeddings from a large-scale pretrained audio encoder and a disentangled facial expression encoder to transfer emotional content across modalities. The approach is claimed to overcome limitations of label-based, audio-based, and image-based methods, with experiments on MEAD and CREMA-D datasets reporting a 14% improvement in emotion accuracy over state-of-the-art methods and successful generalization to unseen extended emotions.

Significance. If the central assumption holds—that simple subtraction of embeddings isolates transferable pure emotion without residual linguistic or identity entanglement—the method could enable more flexible and expressive talking-face generation, particularly for emotions lacking discrete labels or suitable reference images. Public release of code, checkpoints, and demos is a clear strength that supports reproducibility and extension.

major comments (2)
  1. [Abstract (method description)] The central claim of a 14% emotion accuracy gain and generalization to unseen emotions rests on emotion semantic vectors faithfully capturing pure emotional content. The abstract provides no description of auxiliary losses, adversarial objectives, or post-hoc verification (e.g., probing for residual linguistic or speaker-identity factors in the difference vectors) that would enforce or measure this isolation.
  2. [Abstract (experiments)] No information is given on the specific baselines, exact emotion accuracy metric, statistical significance testing, or ablation studies that would substantiate the reported 14% improvement on MEAD and CREMA-D. Without these, the performance gain cannot be assessed as load-bearing evidence for the cross-modal transfer approach.
minor comments (2)
  1. [Abstract] The abstract states that the method 'generates expressive talking face videos - even for unseen extended emotions' but does not clarify how extended emotions are defined or sampled in the evaluation protocol.
  2. [Abstract] Notation for the emotion semantic vector (difference between audio and visual embeddings) is introduced without an accompanying equation or diagram in the provided abstract, which reduces clarity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity on the method's isolation mechanism and experimental details.

read point-by-point responses
  1. Referee: [Abstract (method description)] The central claim of a 14% emotion accuracy gain and generalization to unseen emotions rests on emotion semantic vectors faithfully capturing pure emotional content. The abstract provides no description of auxiliary losses, adversarial objectives, or post-hoc verification (e.g., probing for residual linguistic or speaker-identity factors in the difference vectors) that would enforce or measure this isolation.

    Authors: We agree the abstract is concise and could better signal the isolation approach. The full manuscript (Section 3) describes the disentangled facial expression encoder trained with reconstruction and contrastive auxiliary losses to separate emotion from linguistic content and identity, enabling the cross-modal subtraction. Generalization results on unseen emotions and ablation studies removing these losses provide supporting evidence for effective isolation. We will revise the abstract to briefly note the role of the disentangled encoder and auxiliary objectives in achieving pure emotion transfer. revision: yes

  2. Referee: [Abstract (experiments)] No information is given on the specific baselines, exact emotion accuracy metric, statistical significance testing, or ablation studies that would substantiate the reported 14% improvement on MEAD and CREMA-D. Without these, the performance gain cannot be assessed as load-bearing evidence for the cross-modal transfer approach.

    Authors: The abstract summarizes the headline result, while Section 4 provides the full experimental details: comparisons against multiple SOTA baselines (both audio-driven and image-driven methods), emotion accuracy computed via a pretrained classifier on generated frames, mean accuracies with standard deviations, paired t-tests for significance, and dedicated ablation tables. We will revise the abstract to specify the evaluation metric and note that comprehensive baselines, significance testing, and ablations appear in the experiments section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical validation on external datasets

full rationale

The paper defines emotion semantic vectors explicitly as the difference between embeddings from a large-scale pretrained audio encoder and a disentangled facial expression encoder, then uses this construction to drive cross-modal transfer. However, the central performance claim (14% emotion accuracy gain and generalization to unseen emotions) is not derived by construction from this definition; it is instead reported from separate experiments on the public MEAD and CREMA-D datasets. No equations, self-citations, or fitted parameters are shown to reduce the accuracy improvement to a tautology or to prior author work. The method therefore remains self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim depends on the existence of transferable emotion semantics that can be isolated as vector differences across modalities using off-the-shelf pretrained models.

axioms (2)
  • domain assumption A large-scale pretrained audio encoder extracts emotion information that is sufficiently disentangled from linguistic content.
    Invoked to justify learning semantic vectors from speech embeddings.
  • domain assumption A disentangled facial expression encoder can produce embeddings whose differences correspond to pure emotion changes.
    Required for the visual side of the cross-modal transfer.
invented entities (1)
  • emotion semantic vector no independent evidence
    purpose: Represents the difference between two emotional embeddings across speech and visual modalities to enable transfer.
    Core new construct introduced to bridge the modalities without labels or reference images.

pith-pipeline@v0.9.0 · 5581 in / 1423 out tokens · 50708 ms · 2026-05-10T17:33:10.448373+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

74 extracted references · 9 canonical work pages · 4 internal anchors

  1. [1]

    Video rewrite: Driving visual speech with audio

    Christoph Bregler, Michele Covell, and Malcolm Slaney. Video rewrite: Driving visual speech with audio. InSem- inal Graphics Papers: Pushing the Boundaries, Volume 2, pages 715–722. 2023. 3

  2. [2]

    Iemocap: Interactive emotional dyadic motion capture database.Lan- guage Resources and Evaluation, 42:335–359, 2008

    Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower Provost, Samuel Kim, Jeannette Chang, Sungbok Lee, and Shrikanth Narayanan. Iemocap: Interactive emotional dyadic motion capture database.Lan- guage Resources and Evaluation, 42:335–359, 2008. 2

  3. [3]

    Crema-d: Crowd-sourced emotional multimodal actors dataset.IEEE transactions on affective computing, 5(4):377–390, 2014

    Houwei Cao, David G Cooper, Michael K Keutmann, Ruben C Gur, Ani Nenkova, and Ragini Verma. Crema-d: Crowd-sourced emotional multimodal actors dataset.IEEE transactions on affective computing, 5(4):377–390, 2014. 2, 6

  4. [4]

    EmoKnob: Enhance voice cloning with fine-grained emotion control

    Haozhe Chen, Run Chen, and Julia Hirschberg. EmoKnob: Enhance voice cloning with fine-grained emotion control. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8170–8180, Miami, Florida, USA, 2024. Association for Computational Linguis- tics. 1, 2, 6, 13

  5. [5]

    Lip movements generation at a glance

    Lele Chen, Zhiheng Li, Ross K Maddox, Zhiyao Duan, and Chenliang Xu. Lip movements generation at a glance. In Proceedings of the European conference on computer vision (ECCV), pages 520–535, 2018. 3

  6. [6]

    Hierarchical cross-modal talking face generation with dynamic pixel-wise loss

    Lele Chen, Ross K Maddox, Zhiyao Duan, and Chenliang Xu. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7832–7841, 2019. 1, 3

  7. [7]

    Talking-head generation with rhyth- mic head motion

    Lele Chen, Guofeng Cui, Celong Liu, Zhong Li, Ziyi Kou, Yi Xu, and Chenliang Xu. Talking-head generation with rhyth- mic head motion. InEuropean Conference on Computer Vi- sion, pages 35–51. Springer, 2020. 1, 3

  8. [8]

    Vast: vivify your talk- ing avatar via zero-shot expressive facial style transfer

    Liyang Chen, Zhiyong Wu, Runnan Li, Weihong Bao, Jun Ling, Xu Tan, and Sheng Zhao. Vast: vivify your talk- ing avatar via zero-shot expressive facial style transfer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2977–2987, 2023. 3

  9. [9]

    Executing your commands via motion diffusion in latent space

    Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18000–18010, 2023. 6

  10. [10]

    Echomimic: Lifelike audio-driven portrait animations through editable landmark conditions

    Zhiyuan Chen, Jiajiong Cao, Zhiquan Chen, Yuming Li, and Chenguang Ma. Echomimic: Lifelike audio-driven portrait animations through editable landmark conditions. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 2403–2410, 2025. 3

  11. [11]

    Out of time: auto- mated lip sync in the wild

    Joon Son Chung and Andrew Zisserman. Out of time: auto- mated lip sync in the wild. InAsian conference on computer vision, pages 251–263. Springer, 2016. 6

  12. [12]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 2, 7, 13

  13. [13]

    Emohuman: Fine-grained emotion- controlled talking head generation via audio-text multimodal detangling

    Qifeng Dai, Huidong Feng, Wendi Cui, Xinqi Cai, Yinglin Zheng, and Ming Zeng. Emohuman: Fine-grained emotion- controlled talking head generation via audio-text multimodal detangling. InProceedings of the 2025 International Confer- ence on Multimedia Retrieval, pages 145–154, 2025. 2, 3

  14. [14]

    Motionlcm: Real-time controllable motion generation via latent consistency model

    Wenxun Dai, Ling-Hao Chen, Jingbo Wang, Jinpeng Liu, Bo Dai, and Yansong Tang. Motionlcm: Real-time controllable motion generation via latent consistency model. InEuropean Conference on Computer Vision, pages 390–408. Springer,

  15. [15]

    Speech-driven facial animation using cas- caded gans for learning of motion and texture

    Dipanjan Das, Sandika Biswas, Sanjana Sinha, and Brojesh- war Bhowmick. Speech-driven facial animation using cas- caded gans for learning of motion and texture. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pages 408–424. Springer, 2020. 1, 3

  16. [16]

    An argument for basic emotions.Cognition & emotion, 6(3-4):169–200, 1992

    Paul Ekman. An argument for basic emotions.Cognition & emotion, 6(3-4):169–200, 1992. 1

  17. [17]

    Space: Speech-driven portrait animation with controllable expression

    Gururani et al. Space: Speech-driven portrait animation with controllable expression. InICCV, 2023. 13

  18. [18]

    Xu et al. Qwen2. 5-omni technical report.arXiv, 2025. 14

  19. [19]

    Efficient emotional adaptation for audio-driven talking-head generation

    Yuan Gan, Zongxin Yang, Xihang Yue, Lingyun Sun, and Yi Yang. Efficient emotional adaptation for audio-driven talking-head generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22634– 22645, 2023. 1, 2, 3, 5, 6

  20. [20]

    Emotionally en- hanced talking face generation

    Sahil Goyal, Sarthak Bhagat, Shagun Uppal, Hitkul Jangra, Yi Yu, Yifang Yin, and Rajiv Ratn Shah. Emotionally en- hanced talking face generation. InProceedings of the 1st International Workshop on Multimedia Content Generation and Evaluation: New Methods and Practice, pages 81–90,

  21. [21]

    Affective conver- sational agents: understanding expectations and personal in- fluences.arXiv preprint arXiv:2310.12459, 2023

    Javier Hernandez, Jina Suh, Judith Amores, Kael Rowan, Gonzalo Ramos, and Mary Czerwinski. Affective conver- sational agents: understanding expectations and personal in- fluences.arXiv preprint arXiv:2310.12459, 2023. 1

  22. [22]

    Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017. 6

  23. [23]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- 9 hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 6

  24. [24]

    Eamm: One-shot emotional talking face via audio-based emotion-aware motion model

    Xinya Ji, Hang Zhou, Kaisiyuan Wang, Qianyi Wu, Wayne Wu, Feng Xu, and Xun Cao. Eamm: One-shot emotional talking face via audio-based emotion-aware motion model. InACM SIGGRAPH 2022 conference proceedings, pages 1– 10, 2022. 1, 2, 3, 5, 6

  25. [25]

    FLOAT: generative motion latent flow matching for audio-driven talk- ing portrait.CoRR, abs/2412.01064, 2024

    Taekyung Ki, Dongchan Min, and Gyeongsu Chae. Float: Generative motion latent flow matching for audio-driven talking portrait.arXiv preprint arXiv:2412.01064, 2024. 1, 2, 3, 5, 6

  26. [26]

    Expressive talking head gener- ation with granular audio-visual control

    Borong Liang, Yan Pan, Zhizhi Guo, Hang Zhou, Zhibin Hong, Xiaoguang Han, Junyu Han, Jingtuo Liu, Errui Ding, and Jingdong Wang. Expressive talking head gener- ation with granular audio-visual control. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3387–3396, 2022. 3

  27. [27]

    Mvportrait: Text-guided motion and emotion control for multi-view vivid portrait an- imation

    Yukang Lin, Hokit Fung, Jianjin Xu, Zeping Ren, Adela SM Lau, Guosheng Yin, and Xiu Li. Mvportrait: Text-guided motion and emotion control for multi-view vivid portrait an- imation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26242–26252, 2025. 2, 3

  28. [28]

    Moee: Mixture of emotion experts for audio-driven portrait animation

    Huaize Liu, Wenzhang Sun, Donglin Di, Shibo Sun, Jiahui Yang, Changqing Zou, and Hujun Bao. Moee: Mixture of emotion experts for audio-driven portrait animation. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 26222–26231, 2025. 1, 2, 3

  29. [29]

    Semantic-aware implicit neural audio- driven video portrait generation

    Xian Liu, Yinghao Xu, Qianyi Wu, Hang Zhou, Wayne Wu, and Bolei Zhou. Semantic-aware implicit neural audio- driven video portrait generation. InEuropean conference on computer vision, pages 106–125. Springer, 2022. 3

  30. [30]

    Building naturalistic emo- tionally balanced speech corpus by retrieving emotional speech from existing podcast recordings.IEEE Transactions on Affective Computing, PP:1–1, 2017

    Reza Lotfian and Carlos Busso. Building naturalistic emo- tionally balanced speech corpus by retrieving emotional speech from existing podcast recordings.IEEE Transactions on Affective Computing, PP:1–1, 2017. 2

  31. [31]

    Styletalk: One-shot talking head generation with controllable speaking styles

    Yifeng Ma, Suzhen Wang, Zhipeng Hu, Changjie Fan, Tangjie Lv, Yu Ding, Zhidong Deng, and Xin Yu. Styletalk: One-shot talking head generation with controllable speaking styles. InProceedings of the AAAI conference on artificial intelligence, pages 1896–1904, 2023. 3

  32. [32]

    emotion2vec: Self- supervised pre-training for speech emotion representation

    Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, Shiliang Zhang, and Xie Chen. emotion2vec: Self- supervised pre-training for speech emotion representation. Proc. ACL 2024 Findings, 2024. 2, 4, 5, 14

  33. [33]

    Ma-avt: Modality alignment for parameter- efficient audio-visual transformers

    Tanvir Mahmud, Shentong Mo, Yapeng Tian, and Diana Marculescu. Ma-avt: Modality alignment for parameter- efficient audio-visual transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7996–8005, 2024. 4

  34. [34]

    Frame attention networks for facial expression recognition in videos

    Debin Meng, Xiaojiang Peng, Kai Wang, and Yu Qiao. Frame attention networks for facial expression recognition in videos. In2019 IEEE international conference on image processing (ICIP), pages 3866–3870. IEEE, 2019. 6

  35. [35]

    Dpe: Dis- entanglement of pose and expression for general video por- trait editing

    Youxin Pang, Yong Zhang, Weize Quan, Yanbo Fan, Xi- aodong Cun, Ying Shan, and Dong-ming Yan. Dpe: Dis- entanglement of pose and expression for general video por- trait editing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 427–436,

  36. [36]

    Ai- generated characters for supporting personalized learning and well-being.Nature Machine Intelligence, 3(12):1013– 1022, 2021

    Pat Pataranutaporn, Valdemar Danry, Joanne Leong, Parinya Punpongsanon, Dan Novy, Pattie Maes, and Misha Sra. Ai- generated characters for supporting personalized learning and well-being.Nature Machine Intelligence, 3(12):1013– 1022, 2021. 1

  37. [37]

    Emotalk: Speech-driven emotional disentanglement for 3d face anima- tion

    Ziqiao Peng, Haoyu Wu, Zhenbo Song, Hao Xu, Xiangyu Zhu, Jun He, Hongyan Liu, and Zhaoxin Fan. Emotalk: Speech-driven emotional disentanglement for 3d face anima- tion. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 20687–20697, 2023. 3

  38. [38]

    MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations

    Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. Meld: A multimodal multi-party dataset for emotion recognition in conversations.arXiv preprint arXiv:1810.02508, 2018. 1

  39. [39]

    A lip sync expert is all you need for speech to lip generation in the wild

    KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Nambood- iri, and CV Jawahar. A lip sync expert is all you need for speech to lip generation in the wild. InProceedings of the 28th ACM international conference on multimedia, pages 484–492, 2020. 3

  40. [40]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3

  41. [41]

    Empathy in virtual agents: How emotional expressions can influence user perception.ICAT- EGVE, 2024

    Sebastian Rings, Susanne Schmidt, Julia Janßen, Nale Lehmann-Willenbrock, Frank Steinicke, S Hasegawa, N Sakata, and V Sundstedt. Empathy in virtual agents: How emotional expressions can influence user perception.ICAT- EGVE, 2024. 1

  42. [42]

    Nastaran Saffaryazdi, Tamil Selvan Gunasekaran, Kate Loveys, Elizabeth Broadbent, and Mark Billinghurst. Em- pathetic conversational agents: Utilizing neural and physi- ological signals for enhanced empathetic interactions.In- ternational Journal of Human–Computer Interaction, pages 1–25, 2025. 1

  43. [43]

    Learning dynamic facial radiance fields for few-shot talking head synthesis

    Shuai Shen, Wanhua Li, Zheng Zhu, Yueqi Duan, Jie Zhou, and Jiwen Lu. Learning dynamic facial radiance fields for few-shot talking head synthesis. InEuropean conference on computer vision, pages 666–682. Springer, 2022. 3

  44. [44]

    Difftalk: Crafting diffusion models for generalized audio-driven portraits animation

    Shuai Shen, Wenliang Zhao, Zibin Meng, Wanhua Li, Zheng Zhu, Jie Zhou, and Jiwen Lu. Difftalk: Crafting diffusion models for generalized audio-driven portraits animation. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 1982–1991, 2023. 3

  45. [45]

    Emotalker: Audio driven emotion aware talking head generation

    Xiaoqian Shen, Faizan Farooq Khan, and Mohamed Elho- seiny. Emotalker: Audio driven emotion aware talking head generation. InProceedings of the Asian Conference on Com- puter Vision, pages 1900–1917, 2024. 2, 3

  46. [46]

    Emohead: Emotional talking head via manipulating semantic expression parameters.arXiv preprint arXiv:2503.19416, 2025

    Xuli Shen, Hua Cai, Dingding Yu, Weilin Shen, Qing Xu, and Xiangyang Xue. Emohead: Emotional talking head via manipulating semantic expression parameters.arXiv preprint arXiv:2503.19416, 2025. 2, 3

  47. [47]

    Continuously 10 controllable facial expression editing in talking face videos

    Zhiyao Sun, Yu-Hui Wen, Tian Lv, Yanan Sun, Ziyang Zhang, Yaoyuan Wang, and Yong-Jin Liu. Continuously 10 controllable facial expression editing in talking face videos. IEEE Transactions on Affective Computing, 15(3):1400– 1413, 2023. 1, 2, 3, 6

  48. [48]

    Fg- emotalk: Talking head video generation with fine-grained controllable facial expressions

    Zhaoxu Sun, Yuze Xuan, Fang Liu, and Yang Xiang. Fg- emotalk: Talking head video generation with fine-grained controllable facial expressions. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5043–5051,

  49. [49]

    Emmn: Emotional motion memory network for audio-driven emotional talking face generation

    Shuai Tan, Bin Ji, and Ye Pan. Emmn: Emotional motion memory network for audio-driven emotional talking face generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22146–22156, 2023. 2, 3

  50. [50]

    Edtalk: Effi- cient disentanglement for emotional talking head synthesis

    Shuai Tan, Bin Ji, Mengxiao Bi, and Ye Pan. Edtalk: Effi- cient disentanglement for emotional talking head synthesis. InEuropean Conference on Computer Vision, pages 398–

  51. [51]

    1, 2, 3, 4, 5, 6, 7, 15

    Springer, 2024. 1, 2, 3, 4, 5, 6, 7, 15

  52. [52]

    Style2talker: High-resolution talking head generation with emotion style and art style

    Shuai Tan, Bin Ji, and Ye Pan. Style2talker: High-resolution talking head generation with emotion style and art style. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 5079–5087, 2024. 3

  53. [53]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 7

  54. [54]

    Neural voice puppetry: Audio-driven facial reenactment

    Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, and Matthias Nießner. Neural voice puppetry: Audio-driven facial reenactment. InEuropean conference on computer vision, pages 716–731. Springer, 2020. 3

  55. [55]

    Emo: Emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions

    Linrui Tian, Qi Wang, Bang Zhang, and Liefeng Bo. Emo: Emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions. In European Conference on Computer Vision, pages 244–260. Springer, 2024. 2, 3

  56. [56]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. To- wards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018. 6

  57. [57]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 5

  58. [58]

    Progressive disentangled representation learning for fine-grained controllable talking head synthesis

    Duomin Wang, Yu Deng, Zixin Yin, Heung-Yeung Shum, and Baoyuan Wang. Progressive disentangled representation learning for fine-grained controllable talking head synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17979–17989, 2023. 3, 7, 15

  59. [59]

    Eat-face: Emotion-controllable audio-driven talking face generation via diffusion model

    Haodi Wang, Xiaojun Jia, and Xiaochun Cao. Eat-face: Emotion-controllable audio-driven talking face generation via diffusion model. In2024 IEEE 18th International Con- ference on Automatic Face and Gesture Recognition (FG), pages 1–10. IEEE, 2024. 2, 3

  60. [60]

    Emotivetalk: Expressive talking head generation through audio information decoupling and emotional video diffusion

    Haotian Wang, Yuzhe Weng, Yueyan Li, Zilu Guo, Jun Du, Shutong Niu, Jiefeng Ma, Shan He, Xiaoyan Wu, Qiming Hu, et al. Emotivetalk: Expressive talking head generation through audio information decoupling and emotional video diffusion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26212–26221, 2025. 2, 3

  61. [61]

    Lipformer: High- fidelity and generalizable talking face generation with a pre- learned facial codebook

    Jiayu Wang, Kang Zhao, Shiwei Zhang, Yingya Zhang, Yu- jun Shen, Deli Zhao, and Jingren Zhou. Lipformer: High- fidelity and generalizable talking face generation with a pre- learned facial codebook. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13844–13853, 2023. 3

  62. [62]

    Mead: A large-scale audio-visual dataset for emotional talking-face generation

    Kaisiyuan Wang, Qianyi Wu, Linsen Song, Zhuoqian Yang, Wayne Wu, Chen Qian, Ran He, Yu Qiao, and Chen Change Loy. Mead: A large-scale audio-visual dataset for emotional talking-face generation. InEuropean conference on com- puter vision, pages 700–717. Springer, 2020. 2, 3, 6

  63. [63]

    Audio2head: Audio-driven one-shot talking-head generation with natural head motion

    S Wang, L Li, Y Ding, C Fan, and X Yu. Audio2head: Audio-driven one-shot talking-head generation with natural head motion. InInternational Joint Conference on Artificial Intelligence. IJCAI, 2021. 1, 3

  64. [64]

    One- shot talking face generation from single-speaker audio-visual correlation learning

    Suzhen Wang, Lincheng Li, Yu Ding, and Xin Yu. One- shot talking face generation from single-speaker audio-visual correlation learning. InProceedings of the AAAI Conference on Artificial Intelligence, pages 2531–2539, 2022. 1, 3

  65. [65]

    Vasa-1: Lifelike audio-driven talking faces generated in real time.Advances in Neural Information Pro- cessing Systems, 37:660–684, 2024

    Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, and Baining Guo. Vasa-1: Lifelike audio-driven talking faces generated in real time.Advances in Neural Information Pro- cessing Systems, 37:660–684, 2024. 3

  66. [66]

    Face2faceρ: Real- time high-resolution one-shot face reenactment

    Kewei Yang, Kang Chen, Daoliang Guo, Song-Hai Zhang, Yuan-Chen Guo, and Weidong Zhang. Face2faceρ: Real- time high-resolution one-shot face reenactment. InEuropean conference on computer vision, pages 55–71. Springer, 2022. 1, 3

  67. [67]

    Styleheat: One-shot high-resolution ed- itable talking face generation via pre-trained stylegan

    Fei Yin, Yong Zhang, Xiaodong Cun, Mingdeng Cao, Yanbo Fan, Xuan Wang, Qingyan Bai, Baoyuan Wu, Jue Wang, and Yujiu Yang. Styleheat: One-shot high-resolution ed- itable talking face generation via pre-trained stylegan. In European conference on computer vision, pages 85–101. Springer, 2022. 3

  68. [68]

    Talking head generation with probabilistic audio-to-visual diffusion priors

    Zhentao Yu, Zixin Yin, Deyu Zhou, Duomin Wang, Finn Wong, and Baoyuan Wang. Talking head generation with probabilistic audio-to-visual diffusion priors. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 7645–7655, 2023. 3

  69. [69]

    Few-shot adversarial learning of realistic neural talking head models

    Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and Victor Lempitsky. Few-shot adversarial learning of realistic neural talking head models. InProceedings of the IEEE/CVF international conference on computer vision, pages 9459– 9468, 2019. 1, 3

  70. [70]

    Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset

    Zhimeng Zhang, Lincheng Li, Yu Ding, and Changjie Fan. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3661–3670, 2021. 6

  71. [71]

    Identity- preserving talking face generation with landmark and ap- pearance priors

    Weizhi Zhong, Chaowei Fang, Yinqi Cai, Pengxu Wei, Gangming Zhao, Liang Lin, and Guanbin Li. Identity- preserving talking face generation with landmark and ap- pearance priors. InProceedings of the IEEE/CVF Confer- 11 ence on Computer Vision and Pattern Recognition, pages 9729–9738, 2023. 1, 3, 4

  72. [72]

    Talking face generation by adversarially disentangled audio-visual representation

    Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, and Xiaogang Wang. Talking face generation by adversarially disentangled audio-visual representation. InProceedings of the AAAI con- ference on artificial intelligence, pages 9299–9306, 2019. 3

  73. [73]

    Pose-controllable talking face generation by implicitly modularized audio-visual rep- resentation

    Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, and Ziwei Liu. Pose-controllable talking face generation by implicitly modularized audio-visual rep- resentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4176–4186,

  74. [74]

    Say with{emotion}voice: {sentence}

    Yang Zhou, Xintong Han, Eli Shechtman, Jose Echevar- ria, Evangelos Kalogerakis, and Dingzeyu Li. Makelttalk: speaker-aware talking-head animation.ACM Transactions On Graphics (TOG), 39(6):1–15, 2020. 1, 3 12 Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video Supplementary Material In this supplementary material, we first describe how ...