arxiv: 2604.07786 · v2 · submitted 2026-04-09 · 💻 cs.CV · cs.LG

Recognition: no theorem link

Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video

Chanhyuk Choi , Taesoo Kim , Donggyu Lee , Siyeol Jung , Taehwan Kim

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:33 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords cross-modal emotion transfertalking face generationemotion editingfacial expression synthesisvideo generationmultimodal learninggenerative models

0 comments

The pith

A method that learns emotion as the difference between audio and visual embeddings generates more accurate talking face videos with a wider range of expressions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Cross-Modal Emotion Transfer to generate facial expressions in talking videos by learning vectors that capture emotional differences across speech and face modalities. It uses a pretrained audio encoder and a disentangled facial expression encoder to compute these vectors as embedding differences, then applies them to edit emotions. This addresses limits in label-based, audio-entangled, and reference-image methods. Sympathetic readers would care if this leads to more natural and varied emotional expressions in AI-generated videos without needing specific training data for each emotion. The approach reports 14% better emotion accuracy on benchmark datasets.

Core claim

C-MET learns emotion semantic vectors as the difference between two different emotional embeddings across modalities using a large-scale pretrained audio encoder and a disentangled facial expression encoder. These vectors represent pure emotional content and are used to generate expressive talking face videos from speech, including for unseen extended emotions.

What carries the argument

Emotion semantic vectors, computed as the difference between embeddings from a pretrained audio encoder and a disentangled facial expression encoder, which isolate emotional content for transfer.

If this is right

The generated videos achieve 14% higher emotion accuracy than state-of-the-art methods on MEAD and CREMA-D datasets.
Extended emotions not seen during training can still be rendered in the output videos.
The method works with speech input alone without requiring reference face images for the target emotion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the vectors successfully disentangle emotion, the method could generalize to editing emotions in real videos or other media like animations.
Pretrained encoders allow the approach to benefit from advances in audio and vision models without full retraining.
Testing on diverse speech styles could reveal if linguistic content is fully removed from the vectors.

Load-bearing premise

The differences between audio and visual embeddings capture only emotion and exclude any remaining linguistic or identity information.

What would settle it

Observing whether generated face videos correctly express the target emotion when the driving speech contains mismatched linguistic content or when applying emotions absent from the training set.

Figures

Figures reproduced from arXiv: 2604.07786 by Chanhyuk Choi, Donggyu Lee, Siyeol Jung, Taehwan Kim, Taesoo Kim.

**Figure 2.** Figure 2: The comparison of emotion condition modules in the [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the proposed Cross-Modal Emotion Transfer (C-MET). (a) We extract input and target embeddings using pretrained audio and visual encoders, and compute the semantic vectors by subtracting the target embeddings from the inputs. (b) During training, we apply contrastive learning between multimodal tokens—both from visual to audio and audio to visual—to align the representation spaces. (c) A multi… view at source ↗

**Figure 4.** Figure 4: Qualitative results for angry (left) and sarcastic (right), respectively. embeddings. During inference, zero padding is applied to initialize the reference, and predicted vectors are fed in an autoregressive manner. Our model is implemented in PyTorch and trained using the AdamW optimizer on a single NVIDIA GeForce RTX 3090 GPU (24 GB). The loss coefficients λcnt and λdir are set to 0.1 and 0.05, respect… view at source ↗

**Figure 6.** Figure 6: Qualitative analysis of integrating C-MET into disentan [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: The results of continuous emotion editing. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Trend of emotion accuracy as the number of emotional [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 10.** Figure 10: Emotion editing results on extended emotions. [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: Confusion matrices of input (x-axis) and predicted emotions (y-axis) on MEAD across models. [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 12.** Figure 12: Example interface used in our user study comparing [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

read the original abstract

Talking face generation has gained significant attention as a core application of generative models. To enhance the expressiveness and realism of synthesized videos, emotion editing in talking face video plays a crucial role. However, existing approaches often limit expressive flexibility and struggle to generate extended emotions. Label-based methods represent emotions with discrete categories, which fail to capture a wide range of emotions. Audio-based methods can leverage emotionally rich speech signals - and even benefit from expressive text-to-speech (TTS) synthesis - but they fail to express the target emotions because emotions and linguistic contents are entangled in emotional speeches. Images-based methods, on the other hand, rely on target reference images to guide emotion transfer, yet they require high-quality frontal views and face challenges in acquiring reference data for extended emotions (e.g., sarcasm). To address these limitations, we propose Cross-Modal Emotion Transfer (C-MET), a novel approach that generates facial expressions based on speeches by modeling emotion semantic vectors between speech and visual feature spaces. C-MET leverages a large-scale pretrained audio encoder and a disentangled facial expression encoder to learn emotion semantic vectors that represent the difference between two different emotional embeddings across modalities. Extensive experiments on the MEAD and CREMA-D datasets demonstrate that our method improves emotion accuracy by 14% over state-of-the-art methods, while generating expressive talking face videos - even for unseen extended emotions. Code, checkpoint, and demo are available at https://chanhyeok-choi.github.io/C-MET/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's new angle is computing emotion vectors as simple differences between pretrained audio and disentangled face embeddings to edit talking-face videos, with reported accuracy gains but unproven isolation of pure emotion.

read the letter

The main contribution here is C-MET, which models emotion as the difference between embeddings from a large pretrained audio encoder and a disentangled visual expression encoder, then uses that vector to drive facial edits in audio-driven talking face synthesis. This sidesteps the limits of discrete labels, entangled audio signals, and hard-to-get reference images, and the authors say it works for unseen extended emotions like sarcasm. They back it with experiments on MEAD and CREMA-D showing a 14% emotion accuracy lift over prior methods, plus code and a demo release, which is genuinely useful for anyone replicating or extending this kind of work. The framing of cross-modal differences as a way to extract transferable emotion semantics is the clearest novelty relative to the cited baselines. The approach looks practical for controllable video generation tasks. The main soft spot is the assumption that subtracting the two embeddings cleanly removes linguistic content and identity without leaving residuals. The abstract gives no auxiliary losses, adversarial terms, or post-subtraction checks to enforce or measure that isolation, so the gains could partly reflect dataset correlations rather than true transfer. Details on exact baselines, metrics, ablations, and statistical tests are also missing from the summary, which makes the 14% figure hard to weigh. This is for people building or evaluating emotion-controllable talking heads and generative video systems. A reader in that niche would find the idea and released artifacts worth examining. I would send it to peer review; the core construction is straightforward and addresses a real pain point, even if the disentanglement claim needs more evidence in revision.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Cross-Modal Emotion Transfer (C-MET) for emotion editing in talking face videos. It computes emotion semantic vectors as the difference between embeddings from a large-scale pretrained audio encoder and a disentangled facial expression encoder to transfer emotional content across modalities. The approach is claimed to overcome limitations of label-based, audio-based, and image-based methods, with experiments on MEAD and CREMA-D datasets reporting a 14% improvement in emotion accuracy over state-of-the-art methods and successful generalization to unseen extended emotions.

Significance. If the central assumption holds—that simple subtraction of embeddings isolates transferable pure emotion without residual linguistic or identity entanglement—the method could enable more flexible and expressive talking-face generation, particularly for emotions lacking discrete labels or suitable reference images. Public release of code, checkpoints, and demos is a clear strength that supports reproducibility and extension.

major comments (2)

[Abstract (method description)] The central claim of a 14% emotion accuracy gain and generalization to unseen emotions rests on emotion semantic vectors faithfully capturing pure emotional content. The abstract provides no description of auxiliary losses, adversarial objectives, or post-hoc verification (e.g., probing for residual linguistic or speaker-identity factors in the difference vectors) that would enforce or measure this isolation.
[Abstract (experiments)] No information is given on the specific baselines, exact emotion accuracy metric, statistical significance testing, or ablation studies that would substantiate the reported 14% improvement on MEAD and CREMA-D. Without these, the performance gain cannot be assessed as load-bearing evidence for the cross-modal transfer approach.

minor comments (2)

[Abstract] The abstract states that the method 'generates expressive talking face videos - even for unseen extended emotions' but does not clarify how extended emotions are defined or sampled in the evaluation protocol.
[Abstract] Notation for the emotion semantic vector (difference between audio and visual embeddings) is introduced without an accompanying equation or diagram in the provided abstract, which reduces clarity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity on the method's isolation mechanism and experimental details.

read point-by-point responses

Referee: [Abstract (method description)] The central claim of a 14% emotion accuracy gain and generalization to unseen emotions rests on emotion semantic vectors faithfully capturing pure emotional content. The abstract provides no description of auxiliary losses, adversarial objectives, or post-hoc verification (e.g., probing for residual linguistic or speaker-identity factors in the difference vectors) that would enforce or measure this isolation.

Authors: We agree the abstract is concise and could better signal the isolation approach. The full manuscript (Section 3) describes the disentangled facial expression encoder trained with reconstruction and contrastive auxiliary losses to separate emotion from linguistic content and identity, enabling the cross-modal subtraction. Generalization results on unseen emotions and ablation studies removing these losses provide supporting evidence for effective isolation. We will revise the abstract to briefly note the role of the disentangled encoder and auxiliary objectives in achieving pure emotion transfer. revision: yes
Referee: [Abstract (experiments)] No information is given on the specific baselines, exact emotion accuracy metric, statistical significance testing, or ablation studies that would substantiate the reported 14% improvement on MEAD and CREMA-D. Without these, the performance gain cannot be assessed as load-bearing evidence for the cross-modal transfer approach.

Authors: The abstract summarizes the headline result, while Section 4 provides the full experimental details: comparisons against multiple SOTA baselines (both audio-driven and image-driven methods), emotion accuracy computed via a pretrained classifier on generated frames, mean accuracies with standard deviations, paired t-tests for significance, and dedicated ablation tables. We will revise the abstract to specify the evaluation metric and note that comprehensive baselines, significance testing, and ablations appear in the experiments section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical validation on external datasets

full rationale

The paper defines emotion semantic vectors explicitly as the difference between embeddings from a large-scale pretrained audio encoder and a disentangled facial expression encoder, then uses this construction to drive cross-modal transfer. However, the central performance claim (14% emotion accuracy gain and generalization to unseen emotions) is not derived by construction from this definition; it is instead reported from separate experiments on the public MEAD and CREMA-D datasets. No equations, self-citations, or fitted parameters are shown to reduce the accuracy improvement to a tautology or to prior author work. The method therefore remains self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim depends on the existence of transferable emotion semantics that can be isolated as vector differences across modalities using off-the-shelf pretrained models.

axioms (2)

domain assumption A large-scale pretrained audio encoder extracts emotion information that is sufficiently disentangled from linguistic content.
Invoked to justify learning semantic vectors from speech embeddings.
domain assumption A disentangled facial expression encoder can produce embeddings whose differences correspond to pure emotion changes.
Required for the visual side of the cross-modal transfer.

invented entities (1)

emotion semantic vector no independent evidence
purpose: Represents the difference between two emotional embeddings across speech and visual modalities to enable transfer.
Core new construct introduced to bridge the modalities without labels or reference images.

pith-pipeline@v0.9.0 · 5581 in / 1423 out tokens · 50708 ms · 2026-05-10T17:33:10.448373+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

74 extracted references · 9 canonical work pages · 4 internal anchors

[1]

Video rewrite: Driving visual speech with audio

Christoph Bregler, Michele Covell, and Malcolm Slaney. Video rewrite: Driving visual speech with audio. InSem- inal Graphics Papers: Pushing the Boundaries, Volume 2, pages 715–722. 2023. 3

2023
[2]

Iemocap: Interactive emotional dyadic motion capture database.Lan- guage Resources and Evaluation, 42:335–359, 2008

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower Provost, Samuel Kim, Jeannette Chang, Sungbok Lee, and Shrikanth Narayanan. Iemocap: Interactive emotional dyadic motion capture database.Lan- guage Resources and Evaluation, 42:335–359, 2008. 2

2008
[3]

Crema-d: Crowd-sourced emotional multimodal actors dataset.IEEE transactions on affective computing, 5(4):377–390, 2014

Houwei Cao, David G Cooper, Michael K Keutmann, Ruben C Gur, Ani Nenkova, and Ragini Verma. Crema-d: Crowd-sourced emotional multimodal actors dataset.IEEE transactions on affective computing, 5(4):377–390, 2014. 2, 6

2014
[4]

EmoKnob: Enhance voice cloning with fine-grained emotion control

Haozhe Chen, Run Chen, and Julia Hirschberg. EmoKnob: Enhance voice cloning with fine-grained emotion control. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8170–8180, Miami, Florida, USA, 2024. Association for Computational Linguis- tics. 1, 2, 6, 13

2024
[5]

Lip movements generation at a glance

Lele Chen, Zhiheng Li, Ross K Maddox, Zhiyao Duan, and Chenliang Xu. Lip movements generation at a glance. In Proceedings of the European conference on computer vision (ECCV), pages 520–535, 2018. 3

2018
[6]

Hierarchical cross-modal talking face generation with dynamic pixel-wise loss

Lele Chen, Ross K Maddox, Zhiyao Duan, and Chenliang Xu. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7832–7841, 2019. 1, 3

2019
[7]

Talking-head generation with rhyth- mic head motion

Lele Chen, Guofeng Cui, Celong Liu, Zhong Li, Ziyi Kou, Yi Xu, and Chenliang Xu. Talking-head generation with rhyth- mic head motion. InEuropean Conference on Computer Vi- sion, pages 35–51. Springer, 2020. 1, 3

2020
[8]

Vast: vivify your talk- ing avatar via zero-shot expressive facial style transfer

Liyang Chen, Zhiyong Wu, Runnan Li, Weihong Bao, Jun Ling, Xu Tan, and Sheng Zhao. Vast: vivify your talk- ing avatar via zero-shot expressive facial style transfer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2977–2987, 2023. 3

2023
[9]

Executing your commands via motion diffusion in latent space

Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18000–18010, 2023. 6

2023
[10]

Echomimic: Lifelike audio-driven portrait animations through editable landmark conditions

Zhiyuan Chen, Jiajiong Cao, Zhiquan Chen, Yuming Li, and Chenguang Ma. Echomimic: Lifelike audio-driven portrait animations through editable landmark conditions. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 2403–2410, 2025. 3

2025
[11]

Out of time: auto- mated lip sync in the wild

Joon Son Chung and Andrew Zisserman. Out of time: auto- mated lip sync in the wild. InAsian conference on computer vision, pages 251–263. Springer, 2016. 6

2016
[12]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 2, 7, 13

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Emohuman: Fine-grained emotion- controlled talking head generation via audio-text multimodal detangling

Qifeng Dai, Huidong Feng, Wendi Cui, Xinqi Cai, Yinglin Zheng, and Ming Zeng. Emohuman: Fine-grained emotion- controlled talking head generation via audio-text multimodal detangling. InProceedings of the 2025 International Confer- ence on Multimedia Retrieval, pages 145–154, 2025. 2, 3

2025
[14]

Motionlcm: Real-time controllable motion generation via latent consistency model

Wenxun Dai, Ling-Hao Chen, Jingbo Wang, Jinpeng Liu, Bo Dai, and Yansong Tang. Motionlcm: Real-time controllable motion generation via latent consistency model. InEuropean Conference on Computer Vision, pages 390–408. Springer,
[15]

Speech-driven facial animation using cas- caded gans for learning of motion and texture

Dipanjan Das, Sandika Biswas, Sanjana Sinha, and Brojesh- war Bhowmick. Speech-driven facial animation using cas- caded gans for learning of motion and texture. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pages 408–424. Springer, 2020. 1, 3

2020
[16]

An argument for basic emotions.Cognition & emotion, 6(3-4):169–200, 1992

Paul Ekman. An argument for basic emotions.Cognition & emotion, 6(3-4):169–200, 1992. 1

1992
[17]

Space: Speech-driven portrait animation with controllable expression

Gururani et al. Space: Speech-driven portrait animation with controllable expression. InICCV, 2023. 13

2023
[18]

Xu et al. Qwen2. 5-omni technical report.arXiv, 2025. 14

2025
[19]

Efficient emotional adaptation for audio-driven talking-head generation

Yuan Gan, Zongxin Yang, Xihang Yue, Lingyun Sun, and Yi Yang. Efficient emotional adaptation for audio-driven talking-head generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22634– 22645, 2023. 1, 2, 3, 5, 6

2023
[20]

Emotionally en- hanced talking face generation

Sahil Goyal, Sarthak Bhagat, Shagun Uppal, Hitkul Jangra, Yi Yu, Yifang Yin, and Rajiv Ratn Shah. Emotionally en- hanced talking face generation. InProceedings of the 1st International Workshop on Multimedia Content Generation and Evaluation: New Methods and Practice, pages 81–90,
[21]

Affective conver- sational agents: understanding expectations and personal in- fluences.arXiv preprint arXiv:2310.12459, 2023

Javier Hernandez, Jina Suh, Judith Amores, Kael Rowan, Gonzalo Ramos, and Mary Czerwinski. Affective conver- sational agents: understanding expectations and personal in- fluences.arXiv preprint arXiv:2310.12459, 2023. 1

work page arXiv 2023
[22]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017. 6

2017
[23]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- 9 hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Eamm: One-shot emotional talking face via audio-based emotion-aware motion model

Xinya Ji, Hang Zhou, Kaisiyuan Wang, Qianyi Wu, Wayne Wu, Feng Xu, and Xun Cao. Eamm: One-shot emotional talking face via audio-based emotion-aware motion model. InACM SIGGRAPH 2022 conference proceedings, pages 1– 10, 2022. 1, 2, 3, 5, 6

2022
[25]

FLOAT: generative motion latent flow matching for audio-driven talk- ing portrait.CoRR, abs/2412.01064, 2024

Taekyung Ki, Dongchan Min, and Gyeongsu Chae. Float: Generative motion latent flow matching for audio-driven talking portrait.arXiv preprint arXiv:2412.01064, 2024. 1, 2, 3, 5, 6

work page arXiv 2024
[26]

Expressive talking head gener- ation with granular audio-visual control

Borong Liang, Yan Pan, Zhizhi Guo, Hang Zhou, Zhibin Hong, Xiaoguang Han, Junyu Han, Jingtuo Liu, Errui Ding, and Jingdong Wang. Expressive talking head gener- ation with granular audio-visual control. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3387–3396, 2022. 3

2022
[27]

Mvportrait: Text-guided motion and emotion control for multi-view vivid portrait an- imation

Yukang Lin, Hokit Fung, Jianjin Xu, Zeping Ren, Adela SM Lau, Guosheng Yin, and Xiu Li. Mvportrait: Text-guided motion and emotion control for multi-view vivid portrait an- imation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26242–26252, 2025. 2, 3

2025
[28]

Moee: Mixture of emotion experts for audio-driven portrait animation

Huaize Liu, Wenzhang Sun, Donglin Di, Shibo Sun, Jiahui Yang, Changqing Zou, and Hujun Bao. Moee: Mixture of emotion experts for audio-driven portrait animation. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 26222–26231, 2025. 1, 2, 3

2025
[29]

Semantic-aware implicit neural audio- driven video portrait generation

Xian Liu, Yinghao Xu, Qianyi Wu, Hang Zhou, Wayne Wu, and Bolei Zhou. Semantic-aware implicit neural audio- driven video portrait generation. InEuropean conference on computer vision, pages 106–125. Springer, 2022. 3

2022
[30]

Building naturalistic emo- tionally balanced speech corpus by retrieving emotional speech from existing podcast recordings.IEEE Transactions on Affective Computing, PP:1–1, 2017

Reza Lotfian and Carlos Busso. Building naturalistic emo- tionally balanced speech corpus by retrieving emotional speech from existing podcast recordings.IEEE Transactions on Affective Computing, PP:1–1, 2017. 2

2017
[31]

Styletalk: One-shot talking head generation with controllable speaking styles

Yifeng Ma, Suzhen Wang, Zhipeng Hu, Changjie Fan, Tangjie Lv, Yu Ding, Zhidong Deng, and Xin Yu. Styletalk: One-shot talking head generation with controllable speaking styles. InProceedings of the AAAI conference on artificial intelligence, pages 1896–1904, 2023. 3

1904
[32]

emotion2vec: Self- supervised pre-training for speech emotion representation

Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, Shiliang Zhang, and Xie Chen. emotion2vec: Self- supervised pre-training for speech emotion representation. Proc. ACL 2024 Findings, 2024. 2, 4, 5, 14

2024
[33]

Ma-avt: Modality alignment for parameter- efficient audio-visual transformers

Tanvir Mahmud, Shentong Mo, Yapeng Tian, and Diana Marculescu. Ma-avt: Modality alignment for parameter- efficient audio-visual transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7996–8005, 2024. 4

2024
[34]

Frame attention networks for facial expression recognition in videos

Debin Meng, Xiaojiang Peng, Kai Wang, and Yu Qiao. Frame attention networks for facial expression recognition in videos. In2019 IEEE international conference on image processing (ICIP), pages 3866–3870. IEEE, 2019. 6

2019
[35]

Dpe: Dis- entanglement of pose and expression for general video por- trait editing

Youxin Pang, Yong Zhang, Weize Quan, Yanbo Fan, Xi- aodong Cun, Ying Shan, and Dong-ming Yan. Dpe: Dis- entanglement of pose and expression for general video por- trait editing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 427–436,
[36]

Ai- generated characters for supporting personalized learning and well-being.Nature Machine Intelligence, 3(12):1013– 1022, 2021

Pat Pataranutaporn, Valdemar Danry, Joanne Leong, Parinya Punpongsanon, Dan Novy, Pattie Maes, and Misha Sra. Ai- generated characters for supporting personalized learning and well-being.Nature Machine Intelligence, 3(12):1013– 1022, 2021. 1

2021
[37]

Emotalk: Speech-driven emotional disentanglement for 3d face anima- tion

Ziqiao Peng, Haoyu Wu, Zhenbo Song, Hao Xu, Xiangyu Zhu, Jun He, Hongyan Liu, and Zhaoxin Fan. Emotalk: Speech-driven emotional disentanglement for 3d face anima- tion. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 20687–20697, 2023. 3

2023
[38]

MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations

Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. Meld: A multimodal multi-party dataset for emotion recognition in conversations.arXiv preprint arXiv:1810.02508, 2018. 1

work page Pith review arXiv 2018
[39]

A lip sync expert is all you need for speech to lip generation in the wild

KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Nambood- iri, and CV Jawahar. A lip sync expert is all you need for speech to lip generation in the wild. InProceedings of the 28th ACM international conference on multimedia, pages 484–492, 2020. 3

2020
[40]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3

2021
[41]

Empathy in virtual agents: How emotional expressions can influence user perception.ICAT- EGVE, 2024

Sebastian Rings, Susanne Schmidt, Julia Janßen, Nale Lehmann-Willenbrock, Frank Steinicke, S Hasegawa, N Sakata, and V Sundstedt. Empathy in virtual agents: How emotional expressions can influence user perception.ICAT- EGVE, 2024. 1

2024
[42]

Nastaran Saffaryazdi, Tamil Selvan Gunasekaran, Kate Loveys, Elizabeth Broadbent, and Mark Billinghurst. Em- pathetic conversational agents: Utilizing neural and physi- ological signals for enhanced empathetic interactions.In- ternational Journal of Human–Computer Interaction, pages 1–25, 2025. 1

2025
[43]

Learning dynamic facial radiance fields for few-shot talking head synthesis

Shuai Shen, Wanhua Li, Zheng Zhu, Yueqi Duan, Jie Zhou, and Jiwen Lu. Learning dynamic facial radiance fields for few-shot talking head synthesis. InEuropean conference on computer vision, pages 666–682. Springer, 2022. 3

2022
[44]

Difftalk: Crafting diffusion models for generalized audio-driven portraits animation

Shuai Shen, Wenliang Zhao, Zibin Meng, Wanhua Li, Zheng Zhu, Jie Zhou, and Jiwen Lu. Difftalk: Crafting diffusion models for generalized audio-driven portraits animation. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 1982–1991, 2023. 3

1982
[45]

Emotalker: Audio driven emotion aware talking head generation

Xiaoqian Shen, Faizan Farooq Khan, and Mohamed Elho- seiny. Emotalker: Audio driven emotion aware talking head generation. InProceedings of the Asian Conference on Com- puter Vision, pages 1900–1917, 2024. 2, 3

1900
[46]

Emohead: Emotional talking head via manipulating semantic expression parameters.arXiv preprint arXiv:2503.19416, 2025

Xuli Shen, Hua Cai, Dingding Yu, Weilin Shen, Qing Xu, and Xiangyang Xue. Emohead: Emotional talking head via manipulating semantic expression parameters.arXiv preprint arXiv:2503.19416, 2025. 2, 3

work page arXiv 2025
[47]

Continuously 10 controllable facial expression editing in talking face videos

Zhiyao Sun, Yu-Hui Wen, Tian Lv, Yanan Sun, Ziyang Zhang, Yaoyuan Wang, and Yong-Jin Liu. Continuously 10 controllable facial expression editing in talking face videos. IEEE Transactions on Affective Computing, 15(3):1400– 1413, 2023. 1, 2, 3, 6

2023
[48]

Fg- emotalk: Talking head video generation with fine-grained controllable facial expressions

Zhaoxu Sun, Yuze Xuan, Fang Liu, and Yang Xiang. Fg- emotalk: Talking head video generation with fine-grained controllable facial expressions. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5043–5051,
[49]

Emmn: Emotional motion memory network for audio-driven emotional talking face generation

Shuai Tan, Bin Ji, and Ye Pan. Emmn: Emotional motion memory network for audio-driven emotional talking face generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22146–22156, 2023. 2, 3

2023
[50]

Edtalk: Effi- cient disentanglement for emotional talking head synthesis

Shuai Tan, Bin Ji, Mengxiao Bi, and Ye Pan. Edtalk: Effi- cient disentanglement for emotional talking head synthesis. InEuropean Conference on Computer Vision, pages 398–
[51]

1, 2, 3, 4, 5, 6, 7, 15

Springer, 2024. 1, 2, 3, 4, 5, 6, 7, 15

2024
[52]

Style2talker: High-resolution talking head generation with emotion style and art style

Shuai Tan, Bin Ji, and Ye Pan. Style2talker: High-resolution talking head generation with emotion style and art style. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 5079–5087, 2024. 3

2024
[53]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

Neural voice puppetry: Audio-driven facial reenactment

Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, and Matthias Nießner. Neural voice puppetry: Audio-driven facial reenactment. InEuropean conference on computer vision, pages 716–731. Springer, 2020. 3

2020
[55]

Emo: Emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions

Linrui Tian, Qi Wang, Bang Zhang, and Liefeng Bo. Emo: Emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions. In European Conference on Computer Vision, pages 244–260. Springer, 2024. 2, 3

2024
[56]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. To- wards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018. 6

work page internal anchor Pith review arXiv 2018
[57]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 5

2017
[58]

Progressive disentangled representation learning for fine-grained controllable talking head synthesis

Duomin Wang, Yu Deng, Zixin Yin, Heung-Yeung Shum, and Baoyuan Wang. Progressive disentangled representation learning for fine-grained controllable talking head synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17979–17989, 2023. 3, 7, 15

2023
[59]

Eat-face: Emotion-controllable audio-driven talking face generation via diffusion model

Haodi Wang, Xiaojun Jia, and Xiaochun Cao. Eat-face: Emotion-controllable audio-driven talking face generation via diffusion model. In2024 IEEE 18th International Con- ference on Automatic Face and Gesture Recognition (FG), pages 1–10. IEEE, 2024. 2, 3

2024
[60]

Emotivetalk: Expressive talking head generation through audio information decoupling and emotional video diffusion

Haotian Wang, Yuzhe Weng, Yueyan Li, Zilu Guo, Jun Du, Shutong Niu, Jiefeng Ma, Shan He, Xiaoyan Wu, Qiming Hu, et al. Emotivetalk: Expressive talking head generation through audio information decoupling and emotional video diffusion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26212–26221, 2025. 2, 3

2025
[61]

Lipformer: High- fidelity and generalizable talking face generation with a pre- learned facial codebook

Jiayu Wang, Kang Zhao, Shiwei Zhang, Yingya Zhang, Yu- jun Shen, Deli Zhao, and Jingren Zhou. Lipformer: High- fidelity and generalizable talking face generation with a pre- learned facial codebook. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13844–13853, 2023. 3

2023
[62]

Mead: A large-scale audio-visual dataset for emotional talking-face generation

Kaisiyuan Wang, Qianyi Wu, Linsen Song, Zhuoqian Yang, Wayne Wu, Chen Qian, Ran He, Yu Qiao, and Chen Change Loy. Mead: A large-scale audio-visual dataset for emotional talking-face generation. InEuropean conference on com- puter vision, pages 700–717. Springer, 2020. 2, 3, 6

2020
[63]

Audio2head: Audio-driven one-shot talking-head generation with natural head motion

S Wang, L Li, Y Ding, C Fan, and X Yu. Audio2head: Audio-driven one-shot talking-head generation with natural head motion. InInternational Joint Conference on Artificial Intelligence. IJCAI, 2021. 1, 3

2021
[64]

One- shot talking face generation from single-speaker audio-visual correlation learning

Suzhen Wang, Lincheng Li, Yu Ding, and Xin Yu. One- shot talking face generation from single-speaker audio-visual correlation learning. InProceedings of the AAAI Conference on Artificial Intelligence, pages 2531–2539, 2022. 1, 3

2022
[65]

Vasa-1: Lifelike audio-driven talking faces generated in real time.Advances in Neural Information Pro- cessing Systems, 37:660–684, 2024

Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, and Baining Guo. Vasa-1: Lifelike audio-driven talking faces generated in real time.Advances in Neural Information Pro- cessing Systems, 37:660–684, 2024. 3

2024
[66]

Face2faceρ: Real- time high-resolution one-shot face reenactment

Kewei Yang, Kang Chen, Daoliang Guo, Song-Hai Zhang, Yuan-Chen Guo, and Weidong Zhang. Face2faceρ: Real- time high-resolution one-shot face reenactment. InEuropean conference on computer vision, pages 55–71. Springer, 2022. 1, 3

2022
[67]

Styleheat: One-shot high-resolution ed- itable talking face generation via pre-trained stylegan

Fei Yin, Yong Zhang, Xiaodong Cun, Mingdeng Cao, Yanbo Fan, Xuan Wang, Qingyan Bai, Baoyuan Wu, Jue Wang, and Yujiu Yang. Styleheat: One-shot high-resolution ed- itable talking face generation via pre-trained stylegan. In European conference on computer vision, pages 85–101. Springer, 2022. 3

2022
[68]

Talking head generation with probabilistic audio-to-visual diffusion priors

Zhentao Yu, Zixin Yin, Deyu Zhou, Duomin Wang, Finn Wong, and Baoyuan Wang. Talking head generation with probabilistic audio-to-visual diffusion priors. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 7645–7655, 2023. 3

2023
[69]

Few-shot adversarial learning of realistic neural talking head models

Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and Victor Lempitsky. Few-shot adversarial learning of realistic neural talking head models. InProceedings of the IEEE/CVF international conference on computer vision, pages 9459– 9468, 2019. 1, 3

2019
[70]

Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset

Zhimeng Zhang, Lincheng Li, Yu Ding, and Changjie Fan. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3661–3670, 2021. 6

2021
[71]

Identity- preserving talking face generation with landmark and ap- pearance priors

Weizhi Zhong, Chaowei Fang, Yinqi Cai, Pengxu Wei, Gangming Zhao, Liang Lin, and Guanbin Li. Identity- preserving talking face generation with landmark and ap- pearance priors. InProceedings of the IEEE/CVF Confer- 11 ence on Computer Vision and Pattern Recognition, pages 9729–9738, 2023. 1, 3, 4

2023
[72]

Talking face generation by adversarially disentangled audio-visual representation

Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, and Xiaogang Wang. Talking face generation by adversarially disentangled audio-visual representation. InProceedings of the AAAI con- ference on artificial intelligence, pages 9299–9306, 2019. 3

2019
[73]

Pose-controllable talking face generation by implicitly modularized audio-visual rep- resentation

Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, and Ziwei Liu. Pose-controllable talking face generation by implicitly modularized audio-visual rep- resentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4176–4186,
[74]

Say with{emotion}voice: {sentence}

Yang Zhou, Xintong Han, Eli Shechtman, Jose Echevar- ria, Evangelos Kalogerakis, and Dingzeyu Li. Makelttalk: speaker-aware talking-head animation.ACM Transactions On Graphics (TOG), 39(6):1–15, 2020. 1, 3 12 Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video Supplementary Material In this supplementary material, we first describe how ...

work page arXiv 2020