Recognition: unknown
Personalized Cross-Modal Emotional Correlation Learning for Speech-Preserving Facial Expression Manipulation
Pith reviewed 2026-05-07 16:57 UTC · model grok-4.3
The pith
PCMECL refines VLM supervision with personalized prompts and feature differencing for speech-preserving facial expression manipulation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that standard VLMs use single generic prompts per emotion, failing to capture expressive variations among individuals, and that inherent discrepancies between visual and semantic feature distributions limit their value as supervision. PCMECL addresses both issues by conditioning on individual visual information to learn personalized prompts that establish more fine-grained visual-semantic correlations, and by employing feature differencing to correlate the modalities through matching the change in visual features to the change in semantic features, thereby providing more precisely aligned supervision. As a plug-and-play module, PCMECL integrates into existing SPFEM models, a
What carries the argument
Personalized Cross-Modal Emotional Correlation Learning (PCMECL) that conditions VLMs on individual visual information to generate personalized prompts and uses feature differencing to match changes across visual and semantic features for aligned supervision.
Load-bearing premise
That conditioning VLMs on individual visual information reliably produces personalized prompts capturing personal expressive variations and that feature differencing bridges visual-semantic distribution gaps without new artifacts.
What would settle it
Integrating PCMECL into a baseline SPFEM model produces no measurable gain in expression accuracy or identity preservation metrics, or introduces visible speech distortions, on a held-out dataset containing multiple individuals.
Figures
read the original abstract
Speech-preserving facial expression manipulation (SPFEM) aims to enhance human expressiveness without altering mouth movements tied to the original speech. A primary challenge in this domain is the scarcity of paired data, namely aligned frames of the same individual with identical speech but different expressions, which impedes direct supervision for emotional manipulation. While current Visual-Language Models (VLMs) can extract aligned visual and semantic features, making them a promising source of supervision, their direct application is limited. To this end, we propose a Personalized Cross-Modal Emotional Correlation Learning (PCMECL) algorithm that refines VLM-based supervision through two major improvements. First, standard VLMs rely on a single generic prompt for each emotion, failing to capture expressive variations among individuals. PCMECL addresses this limitation by conditioning on individual visual information to learn personalized prompts, thereby establishing more fine-grained visual-semantic correlations. Second, even with personalization, inherent discrepancies persist between the visual and semantic feature distributions. To bridge this modality gap, PCMECL employs feature differencing to correlate the modalities, providing more precisely aligned supervision by matching the change in visual features to the change in semantic features. As a plug-and-play module, PCMECL can be seamlessly integrated into existing SPFEM models. Extensive experiments across various datasets demonstrate the superior efficacy of our algorithm.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Personalized Cross-Modal Emotional Correlation Learning (PCMECL) as a plug-and-play module for speech-preserving facial expression manipulation (SPFEM). It refines VLM-based supervision by conditioning prompts on per-individual visual information to capture personalized expressive variations and by applying feature differencing to align changes in visual and semantic features, thereby addressing the scarcity of paired frames with identical speech but varying expressions. The paper claims this yields more precisely aligned cross-modal supervision and superior performance when integrated into existing SPFEM models, as demonstrated by extensive experiments across datasets.
Significance. If the central claims hold, PCMECL could meaningfully advance multimodal facial animation by enabling more individualized emotional editing while preserving speech synchronization, leveraging existing VLMs without requiring new paired datasets. The plug-and-play design would facilitate adoption in downstream applications such as video synthesis and virtual avatars.
major comments (2)
- [Proposed Method] Proposed Method (feature differencing step): The claim that subtracting features to match visual deltas to semantic deltas produces precisely aligned supervision is load-bearing for the improved correlation, yet the manuscript provides no direct verification against ground-truth paired deltas (which the introduction notes are scarce) nor a derivation showing why the deltas remain commensurate under non-linear VLM feature spaces or individual biases; this risks mapping to incorrect expression directions without additional safeguards.
- [Experiments] Experiments section: The abstract asserts superior efficacy from extensive experiments, but without reported quantitative metrics (e.g., FID, expression accuracy, or user-study scores), ablation results isolating the personalization and differencing components, or error analysis, the magnitude of improvement over baselines cannot be assessed and the central empirical claim remains unverified.
minor comments (1)
- [Abstract] Abstract: Including one or two key quantitative highlights or dataset names would strengthen the summary of results without altering length.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below with clarifications and indicate where revisions will be made to strengthen the presentation.
read point-by-point responses
-
Referee: [Proposed Method] Proposed Method (feature differencing step): The claim that subtracting features to match visual deltas to semantic deltas produces precisely aligned supervision is load-bearing for the improved correlation, yet the manuscript provides no direct verification against ground-truth paired deltas (which the introduction notes are scarce) nor a derivation showing why the deltas remain commensurate under non-linear VLM feature spaces or individual biases; this risks mapping to incorrect expression directions without additional safeguards.
Authors: We agree that direct verification against ground-truth paired deltas is not possible, as the introduction explicitly notes the scarcity of such data. The feature differencing step is motivated by the principle that relative changes (deltas) in visual and semantic features are more likely to be commensurate than absolute features, since non-linearities and per-individual biases in VLMs primarily affect baseline representations rather than expression-induced variations. This design choice is validated indirectly through consistent performance gains when PCMECL is integrated as a plug-and-play module into existing SPFEM models. In the revised manuscript, we will expand the method section with additional motivation for the delta alignment assumption, a discussion of its limitations, and safeguards against potential misalignment. revision: partial
-
Referee: [Experiments] Experiments section: The abstract asserts superior efficacy from extensive experiments, but without reported quantitative metrics (e.g., FID, expression accuracy, or user-study scores), ablation results isolating the personalization and differencing components, or error analysis, the magnitude of improvement over baselines cannot be assessed and the central empirical claim remains unverified.
Authors: The experiments section reports quantitative results across multiple datasets using metrics including FID and expression accuracy, with comparisons to baseline SPFEM models. However, we acknowledge that dedicated ablations isolating the personalization and feature differencing components, along with error analysis, would provide clearer evidence of their individual contributions and the overall improvement magnitude. We will revise the experiments section to include these ablations and error analysis. revision: yes
Circularity Check
No significant circularity in PCMECL method description
full rationale
The paper presents PCMECL as a plug-and-play refinement to VLM-based supervision via two added steps: conditioning prompts on per-individual visual features and applying feature differencing to align modality deltas. No equations, loss functions, or derivation chains are shown that reduce the claimed correlations or supervision signals to fitted parameters, self-definitions, or prior self-citations by construction. The approach is described as building on external VLMs with independent methodological additions whose efficacy is asserted via experiments rather than tautological equivalence to inputs. This is the common case of a non-circular algorithmic proposal.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Head2head++: Deep facial attributes re-targeting,
M. C. Doukas, M. R. Koujan, V . Sharmanska, A. Roussos, and S. Zafeiriou, “Head2head++: Deep facial attributes re-targeting,”IEEE Transactions on Biometrics, Behavior, and Identity Science, vol. 3, no. 1, pp. 31–43, 2021
2021
-
[2]
Icface: Interpretable and con- trollable face reenactment using gans,
S. Tripathy, J. Kannala, and E. Rahtu, “Icface: Interpretable and con- trollable face reenactment using gans,” inProceedings of the IEEE/CVF winter conference on applications of computer vision, 2020, pp. 3385– 3394
2020
-
[3]
3d3m: 3d modulated morphable model for monocular face reconstruction,
Y . Li, Q. Hao, J. Hu, X. Pan, Z. Li, and Z. Cui, “3d3m: 3d modulated morphable model for monocular face reconstruction,”IEEE Transactions on Multimedia, vol. 25, pp. 6642–6652, 2022
2022
-
[4]
Expression-aware face reconstruction via a dual-stream network,
X. Chai, J. Chen, C. Liang, D. Xu, and C.-W. Lin, “Expression-aware face reconstruction via a dual-stream network,”IEEE Transactions on Multimedia, vol. 23, pp. 2998–3012, 2021
2021
-
[5]
Neural emotion director: Speech-preserving semantic control of facial ex- pressions in
F. P. Papantoniou, P. P. Filntisis, P. Maragos, and A. Roussos, “Neural emotion director: Speech-preserving semantic control of facial ex- pressions in” in-the-wild” videos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 781–18 790
2022
-
[6]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PMLR, 2021, pp. 8748–8763
2021
-
[7]
Flamingo: a visual language model for few-shot learning,
J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynoldset al., “Flamingo: a visual language model for few-shot learning,”Advances in neural information processing systems, vol. 35, pp. 23 716–23 736, 2022
2022
-
[8]
Emotalker: Emotionally editable talking face generation via diffusion model,
B. Zhang, X. Zhang, N. Cheng, J. Yu, J. Xiao, and J. Wang, “Emotalker: Emotionally editable talking face generation via diffusion model,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 8276–8280
2024
-
[9]
Speech driven talking face generation from a single image and an emotion condition,
S. E. Eskimez, Y . Zhang, and Z. Duan, “Speech driven talking face generation from a single image and an emotion condition,”IEEE Transactions on Multimedia, vol. 24, pp. 3480–3490, 2021
2021
-
[10]
Neural style-preserving visual dubbing,
H. Kim, M. Elgharib, M. Zollh ¨ofer, H.-P. Seidel, T. Beeler, C. Richardt, and C. Theobalt, “Neural style-preserving visual dubbing,”ACM Trans- actions on Graphics (TOG), vol. 38, no. 6, pp. 1–13, 2019
2019
-
[11]
Pirenderer: Controllable portrait image generation via semantic neural rendering,
Y . Ren, G. Li, Y . Chen, T. H. Li, and S. Liu, “Pirenderer: Controllable portrait image generation via semantic neural rendering,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 13 759–13 768
2021
-
[12]
Y . Wang, D. Yang, F. Bremond, and A. Dantcheva, “Latent image animator: Learning to animate images via latent space navigation,”arXiv preprint arXiv:2203.09043, 2022
-
[13]
Progressive transformer machine for natural character reenactment,
Y . Xu, Z. Yang, T. Chen, K. Li, and C. Qing, “Progressive transformer machine for natural character reenactment,”ACM Transactions on Mul- timedia Computing, Communications and Applications, vol. 19, no. 2s, pp. 1–22, 2023
2023
-
[14]
Face reenactment based on motion field representation,
S. Zheng, J. Chen, Z. Yang, T. Chen, and Y . Lu, “Face reenactment based on motion field representation,” inInternational Conference on Brain Inspired Cognitive Systems. Springer Nature Singapore Singapore, 2023, pp. 354–364
2023
-
[15]
Z. Xu, T. Chen, Z. Yang, S. Peng, K. Wang, and L. Lin, “Exploiting temporal audio-visual correlation embedding for audio-driven one-shot talking head animation,”arXiv preprint arXiv:2504.05746, 2025
-
[16]
Monocular and generalizable gaussian talking head animation,
S. Gong, H. Li, J. Tang, D. Hu, S. Huang, H. Chen, T. Chen, and Z. Liu, “Monocular and generalizable gaussian talking head animation,” inPro- ceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 5523–5534
2025
-
[17]
3d face reconstruction from a single image assisted by 2d face images in the wild,
X. Tu, J. Zhao, M. Xie, Z. Jiang, A. Balamurugan, Y . Luo, Y . Zhao, L. He, Z. Ma, and J. Feng, “3d face reconstruction from a single image assisted by 2d face images in the wild,”IEEE Transactions on Multimedia, vol. 23, pp. 1160–1172, 2020
2020
-
[18]
Self-supervised learning of detailed 3d face reconstruction,
Y . Chen, F. Wu, Z. Wang, Y . Song, Y . Ling, and L. Bao, “Self-supervised learning of detailed 3d face reconstruction,”IEEE Transactions on Image Processing, vol. 29, pp. 8696–8705, 2020
2020
-
[19]
Efficient emotional adaptation for audio-driven talking-head generation,
Y . Gan, Z. Yang, X. Yue, L. Sun, and Y . Yang, “Efficient emotional adaptation for audio-driven talking-head generation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 634–22 645
2023
-
[20]
First or- der motion model for image animation,
A. Siarohin, S. Lathuili `ere, S. Tulyakov, E. Ricci, and N. Sebe, “First or- der motion model for image animation,”Advances in neural information processing systems, vol. 32, 2019
2019
-
[21]
Learning adaptive spatial coherent correlations for speech-preserving facial expression manipulation,
T. Chen, J. Lin, Z. Yang, C. Qing, and L. Lin, “Learning adaptive spatial coherent correlations for speech-preserving facial expression manipulation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 7267–7276
2024
-
[22]
Contrastive de- coupled representation learning and regularization for speech-preserving facial expression manipulation,
T. Chen, J. Lin, Z. Yang, C. Qing, Y . Shi, and L. Lin, “Contrastive de- coupled representation learning and regularization for speech-preserving facial expression manipulation,”International Journal of Computer Vision, vol. 133, no. 7, pp. 3822–3838, 2025
2025
-
[23]
Ganimation: Anatomically-aware facial animation from a single image,
A. Pumarola, A. Agudo, A. M. Martinez, A. Sanfeliu, and F. Moreno- Noguer, “Ganimation: Anatomically-aware facial animation from a single image,” inProceedings of the European conference on computer vision (ECCV), 2018, pp. 818–833
2018
-
[24]
Facial action coding system,
P. Ekman and W. V . Friesen, “Facial action coding system,”Environ- mental Psychology & Nonverbal Behavior, 1978
1978
-
[25]
Styleheat: One-shot high-resolution editable talking face generation via pre-trained stylegan,
F. Yin, Y . Zhang, X. Cun, M. Cao, Y . Fan, X. Wang, Q. Bai, B. Wu, J. Wang, and Y . Yang, “Styleheat: One-shot high-resolution editable talking face generation via pre-trained stylegan,” inEuropean conference on computer vision. Springer, 2022, pp. 85–101
2022
-
[26]
Stargan: Unified generative adversarial networks for multi-domain image-to- image translation,
Y . Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “Stargan: Unified generative adversarial networks for multi-domain image-to- image translation,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8789–8797
2018
-
[27]
Language- guided face animation by recurrent stylegan-based generator,
T. Hang, H. Yang, B. Liu, J. Fu, X. Geng, and B. Guo, “Language- guided face animation by recurrent stylegan-based generator,”IEEE Transactions on Multimedia, vol. 25, pp. 9216–9227, 2023
2023
-
[28]
Self-supervised face image manipulation by conditioning gan on face decomposition,
S. Karao ˘glu, T. Geverset al., “Self-supervised face image manipulation by conditioning gan on face decomposition,”IEEE Transactions on Multimedia, vol. 24, pp. 377–385, 2021
2021
-
[29]
Stylerig: Rigging stylegan for 3d control over portrait images,
A. Tewari, M. Elgharib, G. Bharaj, F. Bernard, H.-P. Seidel, P. P ´erez, M. Zollhofer, and C. Theobalt, “Stylerig: Rigging stylegan for 3d control over portrait images,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 6142–6151
2020
-
[30]
Continuously controllable facial expression editing in talking face videos,
Z. Sun, Y .-H. Wen, T. Lv, Y . Sun, Z. Zhang, Y . Wang, and Y .-J. Liu, “Continuously controllable facial expression editing in talking face videos,”IEEE Transactions on Affective Computing, 2023
2023
-
[31]
Self-supervised emotion representation disentanglement for speech-preserving facial expression manipulation,
Z. Xu, T. Chen, Z. Yang, C. Qing, Y . Shi, and L. Lin, “Self-supervised emotion representation disentanglement for speech-preserving facial expression manipulation,” inACM Multimedia 2024, 2024
2024
-
[32]
Generative adversarial nets,
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adversarial nets,” Advances in neural information processing systems, vol. 27, 2014
2014
-
[33]
A style-based generator architecture for generative adversarial networks,
T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4401– 4410
2019
-
[34]
Invertable frowns: Video-to-video facial emotion translation,
I. Magnusson, A. Sankaranarayanan, and A. Lippman, “Invertable frowns: Video-to-video facial emotion translation,” inProceedings of the 1st Workshop on Synthetic Multimedia-Audiovisual Deepfake Generation and Detection, 2021, pp. 25–33
2021
-
[35]
Learning Spatial-Temporal Coherent Correlations for Speech-Preserving Facial Expression Manipulation
T. Chen, J. Lin, Z. Yang, C. Qing, G. Wang, and L. Lin, “Learning spatial-temporal coherent correlations for speech-preserving facial ex- pression manipulation,”arXiv preprint arXiv:2604.20226, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[36]
Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,
J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inInternational conference on machine learning. PMLR, 2022, pp. 12 888–12 900
2022
-
[37]
Masked vision and language modeling for multi-modal representation learning,
G. Kwon, Z. Cai, A. Ravichandran, E. Bas, R. Bhotika, and S. Soatto, “Masked vision and language modeling for multi-modal representation learning,”arXiv preprint arXiv:2208.02131, 2022
-
[38]
Conditional prompt learning for vision-language models,
K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Conditional prompt learning for vision-language models,” inProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, 2022, pp. 16 816– 16 825
2022
-
[39]
High- fidelity 3d face generation from natural language descriptions,
M. Wu, H. Zhu, L. Huang, Y . Zhuang, Y . Lu, and X. Cao, “High- fidelity 3d face generation from natural language descriptions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 4521–4530
2023
-
[40]
Prompting visual-language models for dynamic facial expression recognition,
Z. Zhao and I. Patras, “Prompting visual-language models for dynamic facial expression recognition,”arXiv preprint arXiv:2308.13382, 2023
-
[41]
Vllms provide better context for emotion understanding through common sense reasoning,
A. Xenos, N. M. Foteinopoulou, I. Ntinou, I. Patras, and G. Tzimiropou- los, “Vllms provide better context for emotion understanding through common sense reasoning,”arXiv preprint arXiv:2404.07078, 2024
-
[42]
Styleclip: Text-driven manipulation of stylegan imagery,
O. Patashnik, Z. Wu, E. Shechtman, D. Cohen-Or, and D. Lischinski, “Styleclip: Text-driven manipulation of stylegan imagery,” inProceed- ings of the IEEE/CVF international conference on computer vision, 2021, pp. 2085–2094. 20
2021
-
[43]
Zero-shot contrastive loss for text- guided diffusion image style transfer,
S. Yang, H. Hwang, and J. C. Ye, “Zero-shot contrastive loss for text- guided diffusion image style transfer,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 873–22 882
2023
-
[44]
Cliper: A unified vision-language framework for in-the-wild facial expression recognition,
H. Li, H. Niu, Z. Zhu, and F. Zhao, “Cliper: A unified vision-language framework for in-the-wild facial expression recognition,”arXiv preprint arXiv:2303.00193, 2023
-
[45]
Contextual emo- tion recognition using large vision language models,
Y . Etesam, ¨O. N. Yalc ¸ın, C. Zhang, and A. Lim, “Contextual emo- tion recognition using large vision language models,”arXiv preprint arXiv:2405.08992, 2024
-
[46]
High-fidelity generalized emotional talking face generation with multi-modal emotion space learning,
C. Xu, J. Zhu, J. Zhang, Y . Han, W. Chu, Y . Tai, C. Wang, Z. Xie, and Y . Liu, “High-fidelity generalized emotional talking face generation with multi-modal emotion space learning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 6609– 6619
2023
-
[47]
Stylegan-nada: Clip-guided domain adaptation of image generators,
R. Gal, O. Patashnik, H. Maron, A. H. Bermano, G. Chechik, and D. Cohen-Or, “Stylegan-nada: Clip-guided domain adaptation of image generators,”ACM Transactions on Graphics (TOG), vol. 41, no. 4, pp. 1–13, 2022
2022
-
[48]
Image-based clip- guided essence transfer,
H. Chefer, S. Benaim, R. Paiss, and L. Wolf, “Image-based clip- guided essence transfer,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 695–711
2022
-
[49]
v2: Diverse image synthesis for multiple domains. in 2020 ieee,
Y . Choi, Y . Uh, J. Yoo, and J.-W. S. Ha, “v2: Diverse image synthesis for multiple domains. in 2020 ieee,” inCVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 8185–8194
2020
-
[50]
A stochastic approximation method,
H. Robbins and S. Monro, “A stochastic approximation method,”The annals of mathematical statistics, pp. 400–407, 1951
1951
-
[51]
Mead: A large-scale audio-visual dataset for emotional talking-face generation,
K. Wang, Q. Wu, L. Song, Z. Yang, W. Wu, C. Qian, R. He, Y . Qiao, and C. C. Loy, “Mead: A large-scale audio-visual dataset for emotional talking-face generation,” inEuropean Conference on Computer Vision. Springer, 2020, pp. 700–717
2020
-
[52]
The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,
S. R. Livingstone and F. A. Russo, “The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,”PloS one, vol. 13, no. 5, p. e0196391, 2018
2018
-
[53]
Gans trained by a two time-scale update rule converge to a local nash equilibrium,
M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,”Advances in neural information processing systems, vol. 30, 2017
2017
-
[54]
Arcface: Additive angular margin loss for deep face recognition,
J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4690– 4699
2019
-
[55]
A lip sync expert is all you need for speech to lip generation in the wild,
K. Prajwal, R. Mukhopadhyay, V . P. Namboodiri, and C. Jawahar, “A lip sync expert is all you need for speech to lip generation in the wild,” in Proceedings of the 28th ACM international conference on multimedia, 2020, pp. 484–492
2020
-
[56]
Out of time: automated lip sync in the wild,
J. S. Chung and A. Zisserman, “Out of time: automated lip sync in the wild,” inComputer Vision–ACCV 2016 Workshops: ACCV 2016 In- ternational Workshops, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part II 13. Springer, 2017, pp. 251–263
2016
-
[57]
Robust lightweight facial expression recognition network with label distribution training,
Z. Zhao, Q. Liu, and F. Zhou, “Robust lightweight facial expression recognition network with label distribution training,” inProceedings of the AAAI conference on artificial intelligence, vol. 35, no. 4, 2021, pp. 3510–3519
2021
-
[58]
Lip reading in the wild,
J. S. Chung and A. Zisserman, “Lip reading in the wild,” inAsian conference on computer vision. Springer, 2016, pp. 87–103. Tianshui Chenreceived a Ph.D. degree in com- puter science at the School of Data and Computer Science Sun Yat-sen University, Guangzhou, China, in 2018. Prior to earning his Ph.D, he received a B.E. degree from the School of Informat...
2016
-
[59]
He is a Fellow of IEEE, IAPR, and IET
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.