Recognition: unknown
Learning Spatial-Temporal Coherent Correlations for Speech-Preserving Facial Expression Manipulation
Pith reviewed 2026-05-10 00:43 UTC · model grok-4.3
The pith
Facial animations for identical spoken content remain highly correlated across different emotions in both space and time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Speakers who convey the same content with different emotions exhibit highly correlated local facial animations in both spatial and temporal spaces. These correlations are modeled as explicit spatial coherent correlation metrics between adjacent local regions within a frame and temporal coherent correlation metrics between corresponding regions across frames; the metrics are then used as an auxiliary loss, together with a correlation-aware adaptive strategy, to supervise expression manipulation while preserving speech-related facial animation.
What carries the argument
Spatial-temporal coherent correlation metrics that quantify and enforce similarity of local visual correlations between input and generated frames associated with different emotions, integrated as a training loss.
If this is right
- SPFEM models can be trained from readily available unpaired videos rather than requiring matched pairs from the same speaker.
- Mouth and jaw animations linked to spoken content are preserved more reliably during emotion transfer.
- An adaptive focus on difficult local regions improves fine-grained control without uniform over-regularization.
- The same correlation metrics can be added to other video-to-video translation tasks that need content invariance.
Where Pith is reading between the lines
- The same local-correlation idea could extend to audio-driven facial animation or cross-lingual lip-sync tasks where speech content must stay fixed while style changes.
- If the correlations prove robust across speakers and languages, large-scale unpaired video collections could replace small paired datasets for many expression-editing applications.
- The work suggests that emotional variation mainly alters global face configuration while leaving the fine structure of speech-related motion largely invariant.
Load-bearing premise
The spatial and temporal correlations in local facial animations observed for the same speech content across emotions are consistent enough to serve as reliable supervision signals without paired data.
What would settle it
A controlled experiment on unpaired video pairs where the correlation-loss model either matches or exceeds baseline lip-sync accuracy and speech-content preservation scores, or where removing the correlation loss causes measurable degradation in mouth animation fidelity.
read the original abstract
Speech-preserving facial expression manipulation (SPFEM) aims to modify facial emotions while meticulously maintaining the mouth animation associated with spoken content. Current works depend on inaccessible paired training samples for the person, where two aligned frames exhibit the same speech content yet differ in emotional expression, limiting the SPFEM applications in real-world scenarios. In this work, we discover that speakers who convey the same content with different emotions exhibit highly correlated local facial animations in both spatial and temporal spaces, providing valuable supervision for SPFEM. To capitalize on this insight, we propose a novel spatial-temporal coherent correlation learning (STCCL) algorithm, which models the aforementioned correlations as explicit metrics and integrates the metrics to supervise manipulating facial expression and meanwhile better preserving the facial animation of spoken content. To this end, it first learns a spatial coherent correlation metric, ensuring that the visual correlations of adjacent local regions within an image linked to a specific emotion closely resemble those of corresponding regions in an image linked to a different emotion. Simultaneously, it develops a temporal coherent correlation metric, ensuring that the visual correlations of specific regions across adjacent image frames associated with one emotion are similar to those in the corresponding regions of frames associated with another emotion. Recognizing that visual correlations are not uniform across all regions, we have also crafted a correlation-aware adaptive strategy that prioritizes regions that present greater challenges. During SPFEM model training, we construct the spatial-temporal coherent correlation metric between corresponding local regions of the input and output image frames as an additional loss to supervise the generation process.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that speakers conveying identical speech content with different emotions exhibit highly correlated local facial animations in both spatial and temporal domains; it introduces the STCCL algorithm that explicitly models these as spatial and temporal coherent correlation metrics, incorporates them as an additional loss during training, and adds a correlation-aware adaptive weighting strategy to enable speech-preserving facial expression manipulation (SPFEM) without paired per-speaker training data.
Significance. If the correlations are robust and the metrics successfully constrain mouth dynamics while permitting expression changes, the work could enable practical unpaired SPFEM for applications such as video editing and virtual agents where paired data is unavailable. The explicit construction of cross-emotion correlation losses represents a potentially reusable idea for other unpaired facial animation tasks.
major comments (3)
- [STCCL algorithm description] The central claim that the observed correlations provide sufficient supervision for speech preservation rests on the construction of the STCCL loss between 'corresponding local regions' of input and output frames (described in the abstract and method overview). The manuscript does not specify how region correspondence is established without paired supervision or paired examples, leaving open whether this step implicitly requires additional alignment mechanisms.
- [Abstract and experimental sections] The abstract asserts that speakers 'exhibit highly correlated local facial animations' and that the metrics 'better preserv[e] the facial animation of spoken content,' yet supplies no quantitative correlation values, statistical tests, or ablation results isolating the contribution of the spatial-temporal loss to speech-related metrics (e.g., lip-sync error or phoneme alignment). This evidence gap directly affects the no-paired-data claim.
- [Adaptive weighting strategy] The correlation-aware adaptive strategy is said to 'prioritize regions that present greater challenges,' but the manuscript provides no derivation or ablation showing that the weighting reliably protects speech-critical areas (mouth region) rather than under-emphasizing them when correlations vary across emotions.
minor comments (2)
- [Method] Formal equations for the spatial coherent correlation metric and temporal coherent correlation metric are needed; the current prose description leaves the exact functional form and normalization ambiguous.
- [Abstract] The abstract contains repetitive phrasing when introducing the two metrics; a single consolidated sentence would improve readability.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and detailed comments, which highlight important aspects of clarity and evidence in our work on STCCL for speech-preserving facial expression manipulation. We address each major comment point by point below, indicating where revisions will be made to improve the manuscript.
read point-by-point responses
-
Referee: [STCCL algorithm description] The central claim that the observed correlations provide sufficient supervision for speech preservation rests on the construction of the STCCL loss between 'corresponding local regions' of input and output frames (described in the abstract and method overview). The manuscript does not specify how region correspondence is established without paired supervision or paired examples, leaving open whether this step implicitly requires additional alignment mechanisms.
Authors: We appreciate this observation on the need for explicit detail. Region correspondence is established by applying a pre-trained facial landmark detector (e.g., based on standard models like FAN) independently to each input and output frame to define consistent local regions such as the mouth, eyes, and cheeks. These landmarks enable patch extraction at corresponding anatomical locations without any paired supervision or additional alignment steps beyond initial face cropping. The spatial and temporal correlation metrics are then computed between these detected regions. We will revise the method section (Section 3) to include this description, a diagram of the region extraction process, and pseudocode to eliminate ambiguity. revision: yes
-
Referee: [Abstract and experimental sections] The abstract asserts that speakers 'exhibit highly correlated local facial animations' and that the metrics 'better preserv[e] the facial animation of spoken content,' yet supplies no quantitative correlation values, statistical tests, or ablation results isolating the contribution of the spatial-temporal loss to speech-related metrics (e.g., lip-sync error or phoneme alignment). This evidence gap directly affects the no-paired-data claim.
Authors: We agree that quantitative backing for the correlation claims and the loss contribution is essential to support the unpaired training assertion. While our current experiments demonstrate overall improvements in speech preservation through qualitative and some quantitative comparisons, we will add: (i) explicit Pearson correlation coefficients and statistical tests (e.g., t-tests) computed between local regions across different emotions on our dataset; (ii) an ablation isolating the STCCL loss's effect on lip-sync error (LSE) and phoneme alignment scores. These will be incorporated into the experimental section and abstract if space allows, or highlighted in the main results. revision: yes
-
Referee: [Adaptive weighting strategy] The correlation-aware adaptive strategy is said to 'prioritize regions that present greater challenges,' but the manuscript provides no derivation or ablation showing that the weighting reliably protects speech-critical areas (mouth region) rather than under-emphasizing them when correlations vary across emotions.
Authors: This is a valid point regarding the need for formal justification and validation. We will add the full mathematical derivation of the correlation-aware adaptive weighting (including how it modulates the loss based on per-region correlation variance) to the supplementary material. We will also include a new ablation study in the experiments that compares the full STCCL model against variants with uniform or disabled weighting, reporting results specifically on mouth-region preservation metrics and overall speech-related errors to confirm it prioritizes challenging areas like the mouth across emotion variations. revision: yes
Circularity Check
No significant circularity; derivation introduces independent loss from empirical observation
full rationale
The paper's chain begins with an empirical observation (speakers exhibit correlated local facial animations across emotions for identical speech content) and proceeds to define explicit spatial-temporal coherent correlation metrics from that observation. These metrics are then used to construct an additional loss term applied between input and generated frames during training. No equation or claim reduces a 'prediction' or result to a fitted parameter, self-referential definition, or prior self-citation by construction. The central modeling step (enforcing correlation similarity to preserve speech) is an independent design choice rather than a definitional equivalence, and the method remains self-contained against external benchmarks without load-bearing self-citations or ansatz smuggling.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Speakers conveying the same content with different emotions exhibit highly correlated local facial animations in both spatial and temporal spaces.
invented entities (2)
-
spatial coherent correlation metric
no independent evidence
-
temporal coherent correlation metric
no independent evidence
Forward citations
Cited by 1 Pith paper
-
Personalized Cross-Modal Emotional Correlation Learning for Speech-Preserving Facial Expression Manipulation
PCMECL improves speech-preserving facial expression manipulation by learning personalized prompts from individual visuals and using feature differencing to align visual and semantic changes from VLMs.
Reference graph
Works this paper leans on
-
[1]
ICface: In- terpretable and controllable face reenactment using GANs,
S. Tripathy, J. Kannala, and E. Rahtu, “ICface: In- terpretable and controllable face reenactment using GANs,” inProc. IEEE/CVF Winter Conf. Appl. Comput. Vis. (WACV), 2020, pp. 3385–3394
2020
-
[2]
Head2head++: Deep facial attributes re-targeting,
M. C. Doukas, M. R. Koujan, V . Sharmanska, A. Roussos, and S. Zafeiriou, “Head2head++: Deep facial attributes re-targeting,”IEEE Transactions on Biometrics, Behavior, and Identity Science, vol. 3, no. 1, pp. 31–43, 2021
2021
-
[3]
Unpaired image-to-image translation using cycle-consistent adver- sarial networks,
J.-Y. Zhu, T. Park, P . Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adver- sarial networks,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2017, pp. 2223–2232
2017
-
[4]
Neural emotion director: Speech-preserving se- mantic control of facial expressions in
F. P . Papantoniou, P . P . Filntisis, P . Maragos, and A. Rous- sos, “Neural emotion director: Speech-preserving se- mantic control of facial expressions in ”in-the-wild” videos,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 18 781–18 790
2022
-
[5]
A style-based generator architecture for generative adversarial networks,
T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 4401–4410
2019
-
[6]
Automatic analysis of facial expressions: The state of the art,
M. Pantic and L. J. M. Rothkrantz, “Automatic analysis of facial expressions: The state of the art,”IEEE Transac- tions on Pattern Analysis and Machine Intelligence, vol. 22, no. 12, pp. 1424–1445, 2000
2000
-
[7]
Styletalk: One-shot talking head generation with controllable speaking styles,
Y. Ma, S. Wang, Z. Hu, C. Fan, T. Lv, Y. Ding, Z. Deng, and X. Yu, “Styletalk: One-shot talking head generation with controllable speaking styles,” inProc. AAAI Conf. Artif. Intell. (AAAI), vol. 37, no. 2, 2023, pp. 1896–1904
2023
-
[8]
Efficient emotional adaptation for audio-driven talking-head generation,
Y. Gan, Z. Yang, X. Yue, L. Sun, and Y. Yang, “Efficient emotional adaptation for audio-driven talking-head generation,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2023, pp. 22 634–22 645
2023
-
[9]
Disentangle identity, cooperate emotion: Correlation-aware emotional talking portrait generation,
W. Tan, C. Lin, C. Xu, F. Xu, X. Hu, X. Ji, J. Zhu, C. Wang, and Y. Fu, “Disentangle identity, cooperate emotion: Correlation-aware emotional talking portrait generation,” inProc. ACM Int. Conf. Multimedia (ACM MM), 2025, pp. 9987–9995
2025
-
[10]
Learn- ing adaptive spatial coherent correlations for speech- preserving facial expression manipulation,
T. Chen, J. Lin, Z. Yang, C. Qing, and L. Lin, “Learn- ing adaptive spatial coherent correlations for speech- preserving facial expression manipulation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 7267–7276
2024
-
[11]
Image-to-image translation with disentan- gled latent vectors for face editing,
Y. Dalva, H. Pehlivan, O. I. Hatipoglu, C. Moran, and A. Dundar, “Image-to-image translation with disentan- gled latent vectors for face editing,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, pp. 14 777–14 788, 2023
2023
-
[12]
Image- to-image translation with conditional adversarial net- works,
P . Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image- to-image translation with conditional adversarial net- works,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 1125–1134
2017
-
[13]
Stargan: Unified generative adversarial networks for multi-domain image-to-image translation,
Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “Stargan: Unified generative adversarial networks for multi-domain image-to-image translation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 8789–8797
2018
-
[14]
Gan- based facial attribute manipulation,
Y. Liu, Q. Li, Q. Deng, Z. Sun, and M. Yang, “Gan- based facial attribute manipulation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, pp. 14 590–14 610, 2022
2022
-
[15]
Exprgan: Facial expression editing with controllable expression intensity,
H. Ding, K. Sricharan, and R. Chellappa, “Exprgan: Facial expression editing with controllable expression intensity,” inProc. AAAI Conf. Artif. Intell. (AAAI), vol. 32, no. 1, 2018
2018
-
[16]
3d guided fine- grained face manipulation,
Z. Geng, C. Cao, and S. Tulyakov, “3d guided fine- grained face manipulation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 9821– 9830
2019
-
[17]
Stylerig: Rigging stylegan for 3d control over portrait images,
A. Tewari, M. Elgharib, G. Bharaj, F. Bernard, H.-P . Seidel, P . P´erez, M. Zollhofer, and C. Theobalt, “Stylerig: Rigging stylegan for 3d control over portrait images,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2020, pp. 6142–6151
2020
-
[18]
Ganmut: Learning interpretable condi- tional space for gamut of emotions,
S. d’Apolito, D. P . Paudel, Z. Huang, A. Romero, and L. Van Gool, “Ganmut: Learning interpretable condi- tional space for gamut of emotions,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 568–577
2021
-
[19]
Progressive transformer machine for natural character reenactment,
Y. Xu, Z. Yang, T. Chen, K. Li, and C. Qing, “Progressive transformer machine for natural character reenactment,” ACM Transactions on Multimedia Computing, Communica- tions and Applications, vol. 19, no. 2s, pp. 1–22, 2023
2023
-
[20]
Ganimation: One-shot anatom- ically consistent facial animation,
A. Pumarola, A. Agudo, A. M. Martinez, A. Sanfeliu, and F. Moreno-Noguer, “Ganimation: One-shot anatom- ically consistent facial animation,”International Journal of Computer Vision, vol. 128, pp. 698–713, 2020
2020
-
[21]
Analyzing and improving the image quality of stylegan,
T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of stylegan,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2020, pp. 8110–8119
2020
-
[22]
Image2stylegan: How to embed images into the stylegan latent space?
R. Abdal, Y. Qin, and P . Wonka, “Image2stylegan: How to embed images into the stylegan latent space?” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2019, pp. 4432–4441
2019
-
[23]
Image2stylegan++: How to edit the embedded images?
R. Abdal, Y. Qin, and P . Wonka, “Image2stylegan++: How to edit the embedded images?” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2020, pp. 8296–8305
2020
-
[24]
Encoding in style: a stylegan encoder for image-to-image translation,
E. Richardson, Y. Alaluf, O. Patashnik, Y. Nitzan, Y. Azar, S. Shapiro, and D. Cohen-Or, “Encoding in style: a stylegan encoder for image-to-image translation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 2287–2296
2021
-
[25]
Restyle: A residual-based stylegan encoder via iterative refine- ment,
Y. Alaluf, O. Patashnik, and D. Cohen-Or, “Restyle: A residual-based stylegan encoder via iterative refine- ment,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2021, pp. 6711–6720
2021
-
[26]
In-domain gan inversion for faithful reconstruction and editability,
J. Zhu, Y. Shen, Y. Xu, D. Zhao, Q. Chen, and B. Zhou, “In-domain gan inversion for faithful reconstruction and editability,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, pp. 2607–2621, 2023
2023
-
[27]
Style transformer for image inversion and editing,
X. Hu, Q. Huang, Z. Shi, S. Li, C. Gao, L. Sun, and Q. Li, “Style transformer for image inversion and editing,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 11 337–11 346
2022
-
[28]
Reganie: rectifying gan inversion errors for accurate real image editing,
B. Li, T. Ma, P . Zhang, M. Hua, W. Liu, Q. He, and Z. Yi, “Reganie: rectifying gan inversion errors for accurate real image editing,” inProc. AAAI Conf. Artif. Intell. (AAAI), vol. 37, no. 1, 2023, pp. 1269–1277
2023
-
[29]
Out-of-domain gan inversion via invertibility decomposition for photo- realistic human face manipulation,
X. Yang, X. Xu, and Y. Chen, “Out-of-domain gan inversion via invertibility decomposition for photo- realistic human face manipulation,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2023, pp. 7492–7501
2023
-
[30]
High-fidelity gan inversion for image attribute editing,
T. Wang, Y. Zhang, Y. Fan, J. Wang, and Q. Chen, “High-fidelity gan inversion for image attribute editing,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 11 379–11 388
2022
-
[31]
Gan inversion: A survey,
W. Xia, Y. Zhang, Y. Yang, J.-H. Xue, B. Zhou, and M.-H. Yang, “Gan inversion: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 3, pp. 3121–3138, 2023
2023
-
[32]
Pivotal tuning for latent-based editing of real images,
D. Roich, R. Mokady, A. H. Bermano, and D. Cohen-Or, “Pivotal tuning for latent-based editing of real images,” ACM Transactions on graphics (TOG), vol. 42, no. 1, pp. 1–13, 2022
2022
-
[33]
Stitch it in time: Gan-based facial editing of real videos,
R. Tzaban, R. Mokady, R. Gal, A. Bermano, and D. Cohen-Or, “Stitch it in time: Gan-based facial editing of real videos,” inSIGGRAPH Asia 2022 Conference Papers, 2022, pp. 1–9
2022
-
[34]
Temporally consis- tent semantic video editing,
Y. Xu, B. AlBahar, and J.-B. Huang, “Temporally consis- tent semantic video editing,” inProc. Eur. Conf. Comput. Vis. (ECCV). Springer, 2022, pp. 357–374
2022
-
[35]
Rigid: Recurrent gan inversion and editing of real face videos,
Y. Xu, S. He, K.-Y. K. Wong, and P . Luo, “Rigid: Recurrent gan inversion and editing of real face videos,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2023, pp. 13 691–13 701
2023
-
[36]
Eamm: One-shot emotional talking face via audio-based emotion-aware motion model,
X. Ji, H. Zhou, K. Wang, Q. Wu, W. Wu, F. Xu, and X. Cao, “Eamm: One-shot emotional talking face via audio-based emotion-aware motion model,” inProc. SIGGRAPH, 2022, pp. 1–10
2022
-
[37]
Expressive talking head generation with granular audio-visual control,
B. Liang, Y. Pan, Z. Guo, H. Zhou, Z. Hong, X. Han, J. Han, J. Liu, E. Ding, and J. Wang, “Expressive talking head generation with granular audio-visual control,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 3387–3396
2022
-
[38]
Edtalk: Efficient disentanglement for emotional talking head synthesis,
S. Tan, B. Ji, M. Bi, and Y. Pan, “Edtalk: Efficient disentanglement for emotional talking head synthesis,” inProc. Eur. Conf. Comput. Vis. (ECCV). Springer, 2024, pp. 398–416
2024
-
[39]
Mead: A large-scale audio-visual dataset for emotional talking-face generation,
K. Wang, Q. Wu, L. Song, Z. Yang, W. Wu, C. Qian, R. He, Y. Qiao, and C. C. Loy, “Mead: A large-scale audio-visual dataset for emotional talking-face generation,” inProc. Eur. Conf. Comput. Vis. (ECCV), Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI. Springer, 2020, pp. 700–717
2020
-
[40]
Speech driven talking face generation from a single image and an emotion condition,
S. E. Eskimez, Y. Zhang, and Z. Duan, “Speech driven talking face generation from a single image and an emotion condition,”IEEE Transactions on Multimedia, vol. 24, pp. 3480–3490, 2022
2022
-
[41]
Emo: Emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions,
L. Tian, Q. Wang, B. Zhang, and L. Bo, “Emo: Emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions,” inProc. Eur. Conf. Comput. Vis. (ECCV), 2024, pp. 244– 260
2024
-
[42]
Difftalk: Crafting diffusion models for generalized audio-driven portraits animation,
S. Shen, W. Zhao, Z. Meng, W. Li, Z. Zhu, J. Zhou, and J. Lu, “Difftalk: Crafting diffusion models for generalized audio-driven portraits animation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023, pp. 1982–1991
2023
-
[43]
Invertible frowns: Video-to-video facial emotion trans- lation,
I. Magnusson, A. Sankaranarayanan, and A. Lippman, “Invertible frowns: Video-to-video facial emotion trans- lation,” inProceedings of the ADGD Workshop at ACM Multimedia 2021. Association for Computing Machinery, 2021, pp. 25–33
2021
-
[44]
A morphable model for the syn- thesis of 3d faces,
V . Blanz and T. Vetter, “A morphable model for the syn- thesis of 3d faces,” inSeminal Graphics Papers: Pushing the Boundaries, Volume 2, 2023, pp. 157–164
2023
-
[45]
Learning an animatable detailed 3d face model from in-the-wild images,
Y. Feng, H. Feng, M. J. Black, and T. Bolkart, “Learning an animatable detailed 3d face model from in-the-wild images,”ACM Transactions on Graphics (ToG), vol. 40, no. 4, pp. 1–13, 2021
2021
-
[46]
Diffusionrig: Learning personalized priors for facial appearance editing,
Z. Ding, X. Zhang, Z. Xia, L. Jebe, Z. Tu, and X. Zhang, “Diffusionrig: Learning personalized priors for facial appearance editing,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023, pp. 12 736–12 746
2023
-
[47]
Enhancing lip dynamic authenticity: Learning 3d tem- poral representations for talking head synthesis,
H. Li, H. Chen, Y. Huang, T. Chen, and S. Huang, “Enhancing lip dynamic authenticity: Learning 3d tem- poral representations for talking head synthesis,”ACM Transactions on Multimedia Computing, Communications and Applications, vol. 21, pp. 1 – 21, 2025
2025
-
[48]
Continuously controllable facial expression editing in talking face videos,
Z. Sun, Y.-H. Wen, T. Lv, Y. Sun, Z. Zhang, Y. Wang, and Y. Liu, “Continuously controllable facial expression editing in talking face videos,”IEEE Transactions on Affective Computing, vol. 15, no. 1, pp. 1400–1413, 2024
2024
-
[49]
Deep semantic manipu- lation of facial videos,
G. K. Solanki and A. Roussos, “Deep semantic manipu- lation of facial videos,” inProc. Eur. Conf. Comput. Vis. (ECCV). Springer, 2023, pp. 104–120
2023
-
[50]
Ccpl: contrastive coherence preserving loss for versatile style transfer,
Z. Wu, Z. Zhu, J. Du, and X. Bai, “Ccpl: contrastive coherence preserving loss for versatile style transfer,” in Proc. Eur. Conf. Comput. Vis. (ECCV). Springer, 2022, pp. 189–206
2022
-
[51]
Arbitrary video style transfer via multi-channel correlation,
Y. Deng, F. Tang, W. Dong, H. Huang, C. Ma, and C. Xu, “Arbitrary video style transfer via multi-channel correlation,” inProc. AAAI Conf. Artif. Intell. (AAAI), vol. 35, no. 2, 2021, pp. 1210–1217
2021
-
[52]
Dynamic correlation learning and regularization for multi-label confidence calibration,
T. Chen, W. Wang, T. Pu, J. Qin, Z. Yang, J. Liu, and L. Lin, “Dynamic correlation learning and regularization for multi-label confidence calibration,”IEEE Transactions on Image Processing, vol. 33, pp. 4811–4823, 2024
2024
-
[53]
Arcface: Additive angular margin loss for deep face recognition,
J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 4690–4699
2019
-
[54]
A morphable model for the synthesis of 3d faces,
V . Blanz and T. Vetter, “A morphable model for the synthesis of 3d faces,” inProc. SIGGRAPH, 1999, pp. 187–194
1999
-
[55]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2016, pp. 770–778
2016
-
[56]
Contrastive decoupled representation learning and regularization for speech-preserving facial expression manipulation,
T. Chen, J. Lin, Z. Yang, C. Qing, Y. Shi, and L. Lin, “Contrastive decoupled representation learning and regularization for speech-preserving facial expression manipulation,”International Journal of Computer Vision, vol. 133, pp. 3822 – 3838, 2025
2025
-
[57]
Neural scene designer: Self-styled semantic im- age manipulation,
J. Lin, T. Chen, C. Qing, Z. Yang, S. Huang, Y. Ren, and L. Lin, “Neural scene designer: Self-styled semantic im- age manipulation,”IEEE Transactions on Image Processing, vol. 34, pp. 6577–6588, 2025
2025
-
[58]
A simple framework for contrastive learning of visual representations,
T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” inProc. Int. Conf. Mach. Learn. (ICML). PMLR, 2020, pp. 1597–1607
2020
-
[59]
Focal loss for dense object detection,
T.-Y. Lin, P . Goyal, R. Girshick, K. He, and P . Doll ´ar, “Focal loss for dense object detection,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2017, pp. 2980–2988
2017
-
[60]
The ryerson audio- visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expres- sions in north american english,
S. R. Livingstone and F. A. Russo, “The ryerson audio- visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expres- sions in north american english,”PloS one, vol. 13, no. 5, p. e0196391, 2018
2018
-
[61]
A lip sync expert is all you need for speech to lip generation in the wild,
K. R. Prajwal, R. Mukhopadhyay, V . P . Namboodiri, and C. V . Jawahar, “A lip sync expert is all you need for speech to lip generation in the wild,” inProc. ACM Int. Conf. Multimedia (ACM MM), 2020, pp. 484–492
2020
-
[62]
Out of time: automated lip sync in the wild,
J. S. Chung and A. Zisserman, “Out of time: automated lip sync in the wild,” inWorkshop on Multi-view Lip- reading, ACCV, 2016
2016
-
[63]
Using dynamic time warp- ing to find patterns in time series,
D. J. Berndt and J. Clifford, “Using dynamic time warp- ing to find patterns in time series,” inKDD workshop, vol. 10, no. 16. Seattle, WA, USA:, 1994, pp. 359–370
1994
-
[64]
Controlnet++: Improving conditional controls with efficient consistency feedback: Project page: liming- ai. github. io/controlnet plus plus,
M. Li, T. Yang, H. Kuang, J. Wu, Z. Wang, X. Xiao, and C. Chen, “Controlnet++: Improving conditional controls with efficient consistency feedback: Project page: liming- ai. github. io/controlnet plus plus,” inProc. Eur. Conf. Comput. Vis. (ECCV). Springer, 2024, pp. 129–147
2024
-
[65]
Deep semantic manipu- lation of facial videos,
G. K. Solanki and A. Roussos, “Deep semantic manipu- lation of facial videos,” inProc. Eur. Conf. Comput. Vis. (ECCV) Workshops, 2022, pp. 104–120
2022
-
[66]
Invertable frowns: Video-to-video facial emotion trans- lation,
I. Magnusson, A. Sankaranarayanan, and A. Lippman, “Invertable frowns: Video-to-video facial emotion trans- lation,” inProceedings of the 1st Workshop on Synthetic Multimedia-Audiovisual Deepfake Generation and Detection, 2021, pp. 25–33. Tianshui Chenreceived a Ph.D. degree in com- puter science at the School of Data and Computer Science Sun Y at-sen Uni...
2021
-
[67]
Detailed Implementation & Integration Strategies:We elucidate the network architecture of the STCCL metric learning and detail the specific integration mechanisms for applying STCCL across diverse paradigms, includ- ing video-to-video SPFEM, and audio-driventalking head generation(covering both Transformer-based and Diffusion-based architectures) (Section 8)
-
[68]
This level of granularity validates the consistency of STCCL across varying emotional intensities (Section 9.1)
Full Per-emotion Quantitative Results:We provide the complete performance breakdown across all seven basic emotions for all four mainstream backbones (ICface, NED, EAT, and DICE-Talk) on both the MEAD and RAVDESS datasets. This level of granularity validates the consistency of STCCL across varying emotional intensities (Section 9.1)
-
[69]
Quantitative Comparisons on Additional Baselines:To demonstrate robustness across different fidelity levels, we provide quantitative evaluations on two additional frameworks: the 3DMM-basedDSMand the 2D-based Wav2Lip-Emotion(Section 9.2)
-
[70]
Extended Qualitative Comparisons:We provide ex- tensive visualization results for the one-stage SPFEM (ICface) and advanced talking head models (EATand DICE), offering a deeper analysis of realism and struc- tural consistency (Section 9.3)
-
[71]
Comprehensive User Studies:We report detailed user study results for both NED and ICface baselines across intra-dataset and cross-dataset settings, validating the perceptual improvements brought by STCCL (Section 9.4)
-
[72]
For dynamic comparisons, we invite reviewers to view the video demonstrations on our project page: https:// jianmanlincjx.github.io/STCCL/
Supplementary Ablation Analysis:We present addi- tional ablation studies using the Correlation Matrix (CM)-based STCCL to demonstrate the robustness of our framework across different correlation representations (Section 9.5). For dynamic comparisons, we invite reviewers to view the video demonstrations on our project page: https:// jianmanlincjx.github.io...
-
[73]
Angry” and “Surprised
as a feature extractor to discern multi-scale features of an image. By fine-tuning ArcFace with paired data, we can establish spatial-temporal coherent correlations within the feature space between two images / two image sequences. Specifically, for the calculation of spatially coherent correla- tions, we designate the image with the neutral emotion as x ...
-
[74]
datasets. Specifically, we extend the analysis to include the one-stage SPFEM baseline (ICface), the Transformer- based talking head model (EAT), and the Diffusion-based framework (DICE). We omit qualitative visualizations for DSM due to its methodological similarity to NED, and for Wav2Lip-Emotion as its low baseline resolution (96×96 ) lim- its the visu...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.