pith. machine review for the scientific record. sign in

arxiv: 2604.25255 · v1 · submitted 2026-04-28 · 💻 cs.CV

Recognition: unknown

Personalized Cross-Modal Emotional Correlation Learning for Speech-Preserving Facial Expression Manipulation

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:57 UTC · model grok-4.3

classification 💻 cs.CV
keywords speech-preserving facial expression manipulationpersonalized promptscross-modal emotional correlationvisual-language modelsfeature differencingVLM supervisionfacial expression editing
0
0 comments X

The pith

PCMECL refines VLM supervision with personalized prompts and feature differencing for speech-preserving facial expression manipulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Speech-preserving facial expression manipulation lacks paired data showing the same person with identical speech but different expressions, which blocks direct training. Visual-language models can supply aligned visual and semantic features as supervision, yet they rely on generic prompts that ignore individual variations and suffer from mismatched feature distributions between modalities. PCMECL conditions prompts on each person's visual information to create personalized versions that capture finer expressive differences, then applies feature differencing to align the modalities by matching how visual features change to how semantic features change. A reader would care because this supplies usable training signals without hard-to-collect paired examples, letting existing models produce more accurate expression edits while leaving mouth movements tied to speech unchanged.

Core claim

The paper claims that standard VLMs use single generic prompts per emotion, failing to capture expressive variations among individuals, and that inherent discrepancies between visual and semantic feature distributions limit their value as supervision. PCMECL addresses both issues by conditioning on individual visual information to learn personalized prompts that establish more fine-grained visual-semantic correlations, and by employing feature differencing to correlate the modalities through matching the change in visual features to the change in semantic features, thereby providing more precisely aligned supervision. As a plug-and-play module, PCMECL integrates into existing SPFEM models, a

What carries the argument

Personalized Cross-Modal Emotional Correlation Learning (PCMECL) that conditions VLMs on individual visual information to generate personalized prompts and uses feature differencing to match changes across visual and semantic features for aligned supervision.

Load-bearing premise

That conditioning VLMs on individual visual information reliably produces personalized prompts capturing personal expressive variations and that feature differencing bridges visual-semantic distribution gaps without new artifacts.

What would settle it

Integrating PCMECL into a baseline SPFEM model produces no measurable gain in expression accuracy or identity preservation metrics, or introduces visible speech distortions, on a held-out dataset containing multiple individuals.

Figures

Figures reproduced from arXiv: 2604.25255 by Chunmei Qing, Feng Gao, Jianman Lin, Liang Lin, Tianshui Chen, Yujie Zhu, Zhijing Yang.

Figure 1
Figure 1. Figure 1: Motivating examples showing baseline (NED) failures and our improvements. From left to right: source identity image, reference emotion image, the baseline’s result, and our result. Red boxes highlight key areas for comparison. and effectively. These systems are eagerly awaited for their ability to enhance creative processes and deepen the emotional resonance of visual media. A fundamental challenge in adva… view at source ↗
Figure 2
Figure 2. Figure 2: Challenges in Applying Pre-trained VLMs, vi￾sualized via t-SNE of CLIP features from MEAD. (a) The inherent modality gap between image (circles) and text (triangles) feature distributions, which remain structurally separated. (b) The limitation of fixed prompts: current VLMs use a ”one-size-fits-all” approach, mapping diverse visual expressions of an emotion to a single cluster defined by a fixed text prom… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the PCMECL supervisory framework. During SPFEM training, the frozen PCMECL module computes a supervisory loss from the source (Is) and target (It) images. Its PEPL module process each image via two parallel branches: a visual branch extracts an emotion-centric visual embedding (I f ), while a textual branch uses a neutral reference image Ir to create a personalized text embedding (T f ). The VT… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the PEPL module’s contrastive pre-training. The learnable Emotion Projector extracts an emotion-centric visual embedding I f s from an emotional im￾age Is. Concurrently, the learnable Visual Guider extracts a personalized embedding from a reference image Ir, which is then concatenated with a corresponding text prompt to form the positive pair T f s , and with a non-corresponding prompt for the … view at source ↗
Figure 5
Figure 5. Figure 5: A t-SNE visualization of the learned feature dif￾ference space, using features extracted from the MEAD dataset. Each point is a difference vector from a ”neutral” source: blue circles (I f s→t ) are visual differences; green tri￾angles (T f s→t ) are their corresponding text differences; and orange triangles (T f s→k ) are non-corresponding text differences (e.g., ”neutral”→”angry” text vs. ”neutral”→”happ… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparisons of NED with and without PCMECL supervision on the MEAD dataset. to bridge the gap between these two modalities and ensure their alignment, PCMECL guides the generation of visual images that more accurately reflect the target emotion. Addi￾tionally, under this supervision, the SPFEM model is capable of generating more realistic facial expressions, resulting in a generated data distri… view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparisons of SSERD with and without PCMECL supervision on the MEAD dataset. C. Qualitative Comparisons In this section, we showcase the visualization results of the representative baselines (NED, ICface) and the state-of￾the-art method (SSERD) on the MEAD dataset. We com￾pare the outcomes with and without the application of our PCMECL algorithm, as illustrated in view at source ↗
Figure 9
Figure 9. Figure 9: Convergence curve of the PEPL module training with view at source ↗
read the original abstract

Speech-preserving facial expression manipulation (SPFEM) aims to enhance human expressiveness without altering mouth movements tied to the original speech. A primary challenge in this domain is the scarcity of paired data, namely aligned frames of the same individual with identical speech but different expressions, which impedes direct supervision for emotional manipulation. While current Visual-Language Models (VLMs) can extract aligned visual and semantic features, making them a promising source of supervision, their direct application is limited. To this end, we propose a Personalized Cross-Modal Emotional Correlation Learning (PCMECL) algorithm that refines VLM-based supervision through two major improvements. First, standard VLMs rely on a single generic prompt for each emotion, failing to capture expressive variations among individuals. PCMECL addresses this limitation by conditioning on individual visual information to learn personalized prompts, thereby establishing more fine-grained visual-semantic correlations. Second, even with personalization, inherent discrepancies persist between the visual and semantic feature distributions. To bridge this modality gap, PCMECL employs feature differencing to correlate the modalities, providing more precisely aligned supervision by matching the change in visual features to the change in semantic features. As a plug-and-play module, PCMECL can be seamlessly integrated into existing SPFEM models. Extensive experiments across various datasets demonstrate the superior efficacy of our algorithm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Personalized Cross-Modal Emotional Correlation Learning (PCMECL) as a plug-and-play module for speech-preserving facial expression manipulation (SPFEM). It refines VLM-based supervision by conditioning prompts on per-individual visual information to capture personalized expressive variations and by applying feature differencing to align changes in visual and semantic features, thereby addressing the scarcity of paired frames with identical speech but varying expressions. The paper claims this yields more precisely aligned cross-modal supervision and superior performance when integrated into existing SPFEM models, as demonstrated by extensive experiments across datasets.

Significance. If the central claims hold, PCMECL could meaningfully advance multimodal facial animation by enabling more individualized emotional editing while preserving speech synchronization, leveraging existing VLMs without requiring new paired datasets. The plug-and-play design would facilitate adoption in downstream applications such as video synthesis and virtual avatars.

major comments (2)
  1. [Proposed Method] Proposed Method (feature differencing step): The claim that subtracting features to match visual deltas to semantic deltas produces precisely aligned supervision is load-bearing for the improved correlation, yet the manuscript provides no direct verification against ground-truth paired deltas (which the introduction notes are scarce) nor a derivation showing why the deltas remain commensurate under non-linear VLM feature spaces or individual biases; this risks mapping to incorrect expression directions without additional safeguards.
  2. [Experiments] Experiments section: The abstract asserts superior efficacy from extensive experiments, but without reported quantitative metrics (e.g., FID, expression accuracy, or user-study scores), ablation results isolating the personalization and differencing components, or error analysis, the magnitude of improvement over baselines cannot be assessed and the central empirical claim remains unverified.
minor comments (1)
  1. [Abstract] Abstract: Including one or two key quantitative highlights or dataset names would strengthen the summary of results without altering length.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below with clarifications and indicate where revisions will be made to strengthen the presentation.

read point-by-point responses
  1. Referee: [Proposed Method] Proposed Method (feature differencing step): The claim that subtracting features to match visual deltas to semantic deltas produces precisely aligned supervision is load-bearing for the improved correlation, yet the manuscript provides no direct verification against ground-truth paired deltas (which the introduction notes are scarce) nor a derivation showing why the deltas remain commensurate under non-linear VLM feature spaces or individual biases; this risks mapping to incorrect expression directions without additional safeguards.

    Authors: We agree that direct verification against ground-truth paired deltas is not possible, as the introduction explicitly notes the scarcity of such data. The feature differencing step is motivated by the principle that relative changes (deltas) in visual and semantic features are more likely to be commensurate than absolute features, since non-linearities and per-individual biases in VLMs primarily affect baseline representations rather than expression-induced variations. This design choice is validated indirectly through consistent performance gains when PCMECL is integrated as a plug-and-play module into existing SPFEM models. In the revised manuscript, we will expand the method section with additional motivation for the delta alignment assumption, a discussion of its limitations, and safeguards against potential misalignment. revision: partial

  2. Referee: [Experiments] Experiments section: The abstract asserts superior efficacy from extensive experiments, but without reported quantitative metrics (e.g., FID, expression accuracy, or user-study scores), ablation results isolating the personalization and differencing components, or error analysis, the magnitude of improvement over baselines cannot be assessed and the central empirical claim remains unverified.

    Authors: The experiments section reports quantitative results across multiple datasets using metrics including FID and expression accuracy, with comparisons to baseline SPFEM models. However, we acknowledge that dedicated ablations isolating the personalization and feature differencing components, along with error analysis, would provide clearer evidence of their individual contributions and the overall improvement magnitude. We will revise the experiments section to include these ablations and error analysis. revision: yes

Circularity Check

0 steps flagged

No significant circularity in PCMECL method description

full rationale

The paper presents PCMECL as a plug-and-play refinement to VLM-based supervision via two added steps: conditioning prompts on per-individual visual features and applying feature differencing to align modality deltas. No equations, loss functions, or derivation chains are shown that reduce the claimed correlations or supervision signals to fitted parameters, self-definitions, or prior self-citations by construction. The approach is described as building on external VLMs with independent methodological additions whose efficacy is asserted via experiments rather than tautological equivalence to inputs. This is the common case of a non-circular algorithmic proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the approach relies on pre-existing VLMs and standard feature operations without introducing new postulated quantities.

pith-pipeline@v0.9.0 · 5552 in / 1190 out tokens · 36720 ms · 2026-05-07T16:57:31.224996+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 8 canonical work pages · 1 internal anchor

  1. [1]

    Head2head++: Deep facial attributes re-targeting,

    M. C. Doukas, M. R. Koujan, V . Sharmanska, A. Roussos, and S. Zafeiriou, “Head2head++: Deep facial attributes re-targeting,”IEEE Transactions on Biometrics, Behavior, and Identity Science, vol. 3, no. 1, pp. 31–43, 2021

  2. [2]

    Icface: Interpretable and con- trollable face reenactment using gans,

    S. Tripathy, J. Kannala, and E. Rahtu, “Icface: Interpretable and con- trollable face reenactment using gans,” inProceedings of the IEEE/CVF winter conference on applications of computer vision, 2020, pp. 3385– 3394

  3. [3]

    3d3m: 3d modulated morphable model for monocular face reconstruction,

    Y . Li, Q. Hao, J. Hu, X. Pan, Z. Li, and Z. Cui, “3d3m: 3d modulated morphable model for monocular face reconstruction,”IEEE Transactions on Multimedia, vol. 25, pp. 6642–6652, 2022

  4. [4]

    Expression-aware face reconstruction via a dual-stream network,

    X. Chai, J. Chen, C. Liang, D. Xu, and C.-W. Lin, “Expression-aware face reconstruction via a dual-stream network,”IEEE Transactions on Multimedia, vol. 23, pp. 2998–3012, 2021

  5. [5]

    Neural emotion director: Speech-preserving semantic control of facial ex- pressions in

    F. P. Papantoniou, P. P. Filntisis, P. Maragos, and A. Roussos, “Neural emotion director: Speech-preserving semantic control of facial ex- pressions in” in-the-wild” videos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 781–18 790

  6. [6]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PMLR, 2021, pp. 8748–8763

  7. [7]

    Flamingo: a visual language model for few-shot learning,

    J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynoldset al., “Flamingo: a visual language model for few-shot learning,”Advances in neural information processing systems, vol. 35, pp. 23 716–23 736, 2022

  8. [8]

    Emotalker: Emotionally editable talking face generation via diffusion model,

    B. Zhang, X. Zhang, N. Cheng, J. Yu, J. Xiao, and J. Wang, “Emotalker: Emotionally editable talking face generation via diffusion model,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 8276–8280

  9. [9]

    Speech driven talking face generation from a single image and an emotion condition,

    S. E. Eskimez, Y . Zhang, and Z. Duan, “Speech driven talking face generation from a single image and an emotion condition,”IEEE Transactions on Multimedia, vol. 24, pp. 3480–3490, 2021

  10. [10]

    Neural style-preserving visual dubbing,

    H. Kim, M. Elgharib, M. Zollh ¨ofer, H.-P. Seidel, T. Beeler, C. Richardt, and C. Theobalt, “Neural style-preserving visual dubbing,”ACM Trans- actions on Graphics (TOG), vol. 38, no. 6, pp. 1–13, 2019

  11. [11]

    Pirenderer: Controllable portrait image generation via semantic neural rendering,

    Y . Ren, G. Li, Y . Chen, T. H. Li, and S. Liu, “Pirenderer: Controllable portrait image generation via semantic neural rendering,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 13 759–13 768

  12. [12]

    Latent image animator: Learn- ing to animate images via latent space navigation.arXiv preprint arXiv:2203.09043,

    Y . Wang, D. Yang, F. Bremond, and A. Dantcheva, “Latent image animator: Learning to animate images via latent space navigation,”arXiv preprint arXiv:2203.09043, 2022

  13. [13]

    Progressive transformer machine for natural character reenactment,

    Y . Xu, Z. Yang, T. Chen, K. Li, and C. Qing, “Progressive transformer machine for natural character reenactment,”ACM Transactions on Mul- timedia Computing, Communications and Applications, vol. 19, no. 2s, pp. 1–22, 2023

  14. [14]

    Face reenactment based on motion field representation,

    S. Zheng, J. Chen, Z. Yang, T. Chen, and Y . Lu, “Face reenactment based on motion field representation,” inInternational Conference on Brain Inspired Cognitive Systems. Springer Nature Singapore Singapore, 2023, pp. 354–364

  15. [15]

    Exploiting temporal audio-visual correlation embedding for audio-driven one-shot talking head animation,

    Z. Xu, T. Chen, Z. Yang, S. Peng, K. Wang, and L. Lin, “Exploiting temporal audio-visual correlation embedding for audio-driven one-shot talking head animation,”arXiv preprint arXiv:2504.05746, 2025

  16. [16]

    Monocular and generalizable gaussian talking head animation,

    S. Gong, H. Li, J. Tang, D. Hu, S. Huang, H. Chen, T. Chen, and Z. Liu, “Monocular and generalizable gaussian talking head animation,” inPro- ceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 5523–5534

  17. [17]

    3d face reconstruction from a single image assisted by 2d face images in the wild,

    X. Tu, J. Zhao, M. Xie, Z. Jiang, A. Balamurugan, Y . Luo, Y . Zhao, L. He, Z. Ma, and J. Feng, “3d face reconstruction from a single image assisted by 2d face images in the wild,”IEEE Transactions on Multimedia, vol. 23, pp. 1160–1172, 2020

  18. [18]

    Self-supervised learning of detailed 3d face reconstruction,

    Y . Chen, F. Wu, Z. Wang, Y . Song, Y . Ling, and L. Bao, “Self-supervised learning of detailed 3d face reconstruction,”IEEE Transactions on Image Processing, vol. 29, pp. 8696–8705, 2020

  19. [19]

    Efficient emotional adaptation for audio-driven talking-head generation,

    Y . Gan, Z. Yang, X. Yue, L. Sun, and Y . Yang, “Efficient emotional adaptation for audio-driven talking-head generation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 634–22 645

  20. [20]

    First or- der motion model for image animation,

    A. Siarohin, S. Lathuili `ere, S. Tulyakov, E. Ricci, and N. Sebe, “First or- der motion model for image animation,”Advances in neural information processing systems, vol. 32, 2019

  21. [21]

    Learning adaptive spatial coherent correlations for speech-preserving facial expression manipulation,

    T. Chen, J. Lin, Z. Yang, C. Qing, and L. Lin, “Learning adaptive spatial coherent correlations for speech-preserving facial expression manipulation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 7267–7276

  22. [22]

    Contrastive de- coupled representation learning and regularization for speech-preserving facial expression manipulation,

    T. Chen, J. Lin, Z. Yang, C. Qing, Y . Shi, and L. Lin, “Contrastive de- coupled representation learning and regularization for speech-preserving facial expression manipulation,”International Journal of Computer Vision, vol. 133, no. 7, pp. 3822–3838, 2025

  23. [23]

    Ganimation: Anatomically-aware facial animation from a single image,

    A. Pumarola, A. Agudo, A. M. Martinez, A. Sanfeliu, and F. Moreno- Noguer, “Ganimation: Anatomically-aware facial animation from a single image,” inProceedings of the European conference on computer vision (ECCV), 2018, pp. 818–833

  24. [24]

    Facial action coding system,

    P. Ekman and W. V . Friesen, “Facial action coding system,”Environ- mental Psychology & Nonverbal Behavior, 1978

  25. [25]

    Styleheat: One-shot high-resolution editable talking face generation via pre-trained stylegan,

    F. Yin, Y . Zhang, X. Cun, M. Cao, Y . Fan, X. Wang, Q. Bai, B. Wu, J. Wang, and Y . Yang, “Styleheat: One-shot high-resolution editable talking face generation via pre-trained stylegan,” inEuropean conference on computer vision. Springer, 2022, pp. 85–101

  26. [26]

    Stargan: Unified generative adversarial networks for multi-domain image-to- image translation,

    Y . Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “Stargan: Unified generative adversarial networks for multi-domain image-to- image translation,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8789–8797

  27. [27]

    Language- guided face animation by recurrent stylegan-based generator,

    T. Hang, H. Yang, B. Liu, J. Fu, X. Geng, and B. Guo, “Language- guided face animation by recurrent stylegan-based generator,”IEEE Transactions on Multimedia, vol. 25, pp. 9216–9227, 2023

  28. [28]

    Self-supervised face image manipulation by conditioning gan on face decomposition,

    S. Karao ˘glu, T. Geverset al., “Self-supervised face image manipulation by conditioning gan on face decomposition,”IEEE Transactions on Multimedia, vol. 24, pp. 377–385, 2021

  29. [29]

    Stylerig: Rigging stylegan for 3d control over portrait images,

    A. Tewari, M. Elgharib, G. Bharaj, F. Bernard, H.-P. Seidel, P. P ´erez, M. Zollhofer, and C. Theobalt, “Stylerig: Rigging stylegan for 3d control over portrait images,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 6142–6151

  30. [30]

    Continuously controllable facial expression editing in talking face videos,

    Z. Sun, Y .-H. Wen, T. Lv, Y . Sun, Z. Zhang, Y . Wang, and Y .-J. Liu, “Continuously controllable facial expression editing in talking face videos,”IEEE Transactions on Affective Computing, 2023

  31. [31]

    Self-supervised emotion representation disentanglement for speech-preserving facial expression manipulation,

    Z. Xu, T. Chen, Z. Yang, C. Qing, Y . Shi, and L. Lin, “Self-supervised emotion representation disentanglement for speech-preserving facial expression manipulation,” inACM Multimedia 2024, 2024

  32. [32]

    Generative adversarial nets,

    I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adversarial nets,” Advances in neural information processing systems, vol. 27, 2014

  33. [33]

    A style-based generator architecture for generative adversarial networks,

    T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4401– 4410

  34. [34]

    Invertable frowns: Video-to-video facial emotion translation,

    I. Magnusson, A. Sankaranarayanan, and A. Lippman, “Invertable frowns: Video-to-video facial emotion translation,” inProceedings of the 1st Workshop on Synthetic Multimedia-Audiovisual Deepfake Generation and Detection, 2021, pp. 25–33

  35. [35]

    Learning Spatial-Temporal Coherent Correlations for Speech-Preserving Facial Expression Manipulation

    T. Chen, J. Lin, Z. Yang, C. Qing, G. Wang, and L. Lin, “Learning spatial-temporal coherent correlations for speech-preserving facial ex- pression manipulation,”arXiv preprint arXiv:2604.20226, 2026

  36. [36]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

    J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inInternational conference on machine learning. PMLR, 2022, pp. 12 888–12 900

  37. [37]

    Masked vision and language modeling for multi-modal representation learning,

    G. Kwon, Z. Cai, A. Ravichandran, E. Bas, R. Bhotika, and S. Soatto, “Masked vision and language modeling for multi-modal representation learning,”arXiv preprint arXiv:2208.02131, 2022

  38. [38]

    Conditional prompt learning for vision-language models,

    K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Conditional prompt learning for vision-language models,” inProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, 2022, pp. 16 816– 16 825

  39. [39]

    High- fidelity 3d face generation from natural language descriptions,

    M. Wu, H. Zhu, L. Huang, Y . Zhuang, Y . Lu, and X. Cao, “High- fidelity 3d face generation from natural language descriptions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 4521–4530

  40. [40]

    Prompting visual-language models for dynamic facial expression recognition,

    Z. Zhao and I. Patras, “Prompting visual-language models for dynamic facial expression recognition,”arXiv preprint arXiv:2308.13382, 2023

  41. [41]

    Vllms provide better context for emotion understanding through common sense reasoning,

    A. Xenos, N. M. Foteinopoulou, I. Ntinou, I. Patras, and G. Tzimiropou- los, “Vllms provide better context for emotion understanding through common sense reasoning,”arXiv preprint arXiv:2404.07078, 2024

  42. [42]

    Styleclip: Text-driven manipulation of stylegan imagery,

    O. Patashnik, Z. Wu, E. Shechtman, D. Cohen-Or, and D. Lischinski, “Styleclip: Text-driven manipulation of stylegan imagery,” inProceed- ings of the IEEE/CVF international conference on computer vision, 2021, pp. 2085–2094. 20

  43. [43]

    Zero-shot contrastive loss for text- guided diffusion image style transfer,

    S. Yang, H. Hwang, and J. C. Ye, “Zero-shot contrastive loss for text- guided diffusion image style transfer,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 873–22 882

  44. [44]

    Cliper: A unified vision-language framework for in-the-wild facial expression recognition,

    H. Li, H. Niu, Z. Zhu, and F. Zhao, “Cliper: A unified vision-language framework for in-the-wild facial expression recognition,”arXiv preprint arXiv:2303.00193, 2023

  45. [45]

    Contextual emo- tion recognition using large vision language models,

    Y . Etesam, ¨O. N. Yalc ¸ın, C. Zhang, and A. Lim, “Contextual emo- tion recognition using large vision language models,”arXiv preprint arXiv:2405.08992, 2024

  46. [46]

    High-fidelity generalized emotional talking face generation with multi-modal emotion space learning,

    C. Xu, J. Zhu, J. Zhang, Y . Han, W. Chu, Y . Tai, C. Wang, Z. Xie, and Y . Liu, “High-fidelity generalized emotional talking face generation with multi-modal emotion space learning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 6609– 6619

  47. [47]

    Stylegan-nada: Clip-guided domain adaptation of image generators,

    R. Gal, O. Patashnik, H. Maron, A. H. Bermano, G. Chechik, and D. Cohen-Or, “Stylegan-nada: Clip-guided domain adaptation of image generators,”ACM Transactions on Graphics (TOG), vol. 41, no. 4, pp. 1–13, 2022

  48. [48]

    Image-based clip- guided essence transfer,

    H. Chefer, S. Benaim, R. Paiss, and L. Wolf, “Image-based clip- guided essence transfer,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 695–711

  49. [49]

    v2: Diverse image synthesis for multiple domains. in 2020 ieee,

    Y . Choi, Y . Uh, J. Yoo, and J.-W. S. Ha, “v2: Diverse image synthesis for multiple domains. in 2020 ieee,” inCVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 8185–8194

  50. [50]

    A stochastic approximation method,

    H. Robbins and S. Monro, “A stochastic approximation method,”The annals of mathematical statistics, pp. 400–407, 1951

  51. [51]

    Mead: A large-scale audio-visual dataset for emotional talking-face generation,

    K. Wang, Q. Wu, L. Song, Z. Yang, W. Wu, C. Qian, R. He, Y . Qiao, and C. C. Loy, “Mead: A large-scale audio-visual dataset for emotional talking-face generation,” inEuropean Conference on Computer Vision. Springer, 2020, pp. 700–717

  52. [52]

    The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,

    S. R. Livingstone and F. A. Russo, “The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,”PloS one, vol. 13, no. 5, p. e0196391, 2018

  53. [53]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium,

    M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,”Advances in neural information processing systems, vol. 30, 2017

  54. [54]

    Arcface: Additive angular margin loss for deep face recognition,

    J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4690– 4699

  55. [55]

    A lip sync expert is all you need for speech to lip generation in the wild,

    K. Prajwal, R. Mukhopadhyay, V . P. Namboodiri, and C. Jawahar, “A lip sync expert is all you need for speech to lip generation in the wild,” in Proceedings of the 28th ACM international conference on multimedia, 2020, pp. 484–492

  56. [56]

    Out of time: automated lip sync in the wild,

    J. S. Chung and A. Zisserman, “Out of time: automated lip sync in the wild,” inComputer Vision–ACCV 2016 Workshops: ACCV 2016 In- ternational Workshops, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part II 13. Springer, 2017, pp. 251–263

  57. [57]

    Robust lightweight facial expression recognition network with label distribution training,

    Z. Zhao, Q. Liu, and F. Zhou, “Robust lightweight facial expression recognition network with label distribution training,” inProceedings of the AAAI conference on artificial intelligence, vol. 35, no. 4, 2021, pp. 3510–3519

  58. [58]

    Lip reading in the wild,

    J. S. Chung and A. Zisserman, “Lip reading in the wild,” inAsian conference on computer vision. Springer, 2016, pp. 87–103. Tianshui Chenreceived a Ph.D. degree in com- puter science at the School of Data and Computer Science Sun Yat-sen University, Guangzhou, China, in 2018. Prior to earning his Ph.D, he received a B.E. degree from the School of Informat...

  59. [59]

    He is a Fellow of IEEE, IAPR, and IET