arxiv: 2604.20226 · v1 · submitted 2026-04-22 · 💻 cs.CV

Recognition: unknown

Learning Spatial-Temporal Coherent Correlations for Speech-Preserving Facial Expression Manipulation

Tianshui Chen , Jianman Lin , Zhijing Yang , Chunmei Qing , Guangrun Wang , Liang Lin

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:43 UTC · model grok-4.3

classification 💻 cs.CV

keywords speech-preserving facial expression manipulationspatial-temporal correlationsunsupervised video generationfacial animation preservationexpression transfercorrelation metricslocal region consistency

0 comments

The pith

Facial animations for identical spoken content remain highly correlated across different emotions in both space and time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that when speakers deliver the same words with varying emotions, the local movements and relations among facial regions stay similar both within each frame and across consecutive frames. These similarities are turned into explicit metrics that act as an additional loss to guide a model in changing overall facial expression while leaving mouth and jaw motions tied to the speech untouched. Because the approach draws supervision from unpaired examples, it avoids the requirement for matched videos of one person showing two emotions while saying the exact same phrase. An adaptive weighting step further concentrates the loss on regions where the correlations are hardest to satisfy.

Core claim

Speakers who convey the same content with different emotions exhibit highly correlated local facial animations in both spatial and temporal spaces. These correlations are modeled as explicit spatial coherent correlation metrics between adjacent local regions within a frame and temporal coherent correlation metrics between corresponding regions across frames; the metrics are then used as an auxiliary loss, together with a correlation-aware adaptive strategy, to supervise expression manipulation while preserving speech-related facial animation.

What carries the argument

Spatial-temporal coherent correlation metrics that quantify and enforce similarity of local visual correlations between input and generated frames associated with different emotions, integrated as a training loss.

If this is right

SPFEM models can be trained from readily available unpaired videos rather than requiring matched pairs from the same speaker.
Mouth and jaw animations linked to spoken content are preserved more reliably during emotion transfer.
An adaptive focus on difficult local regions improves fine-grained control without uniform over-regularization.
The same correlation metrics can be added to other video-to-video translation tasks that need content invariance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same local-correlation idea could extend to audio-driven facial animation or cross-lingual lip-sync tasks where speech content must stay fixed while style changes.
If the correlations prove robust across speakers and languages, large-scale unpaired video collections could replace small paired datasets for many expression-editing applications.
The work suggests that emotional variation mainly alters global face configuration while leaving the fine structure of speech-related motion largely invariant.

Load-bearing premise

The spatial and temporal correlations in local facial animations observed for the same speech content across emotions are consistent enough to serve as reliable supervision signals without paired data.

What would settle it

A controlled experiment on unpaired video pairs where the correlation-loss model either matches or exceeds baseline lip-sync accuracy and speech-content preservation scores, or where removing the correlation loss causes measurable degradation in mouth animation fidelity.

read the original abstract

Speech-preserving facial expression manipulation (SPFEM) aims to modify facial emotions while meticulously maintaining the mouth animation associated with spoken content. Current works depend on inaccessible paired training samples for the person, where two aligned frames exhibit the same speech content yet differ in emotional expression, limiting the SPFEM applications in real-world scenarios. In this work, we discover that speakers who convey the same content with different emotions exhibit highly correlated local facial animations in both spatial and temporal spaces, providing valuable supervision for SPFEM. To capitalize on this insight, we propose a novel spatial-temporal coherent correlation learning (STCCL) algorithm, which models the aforementioned correlations as explicit metrics and integrates the metrics to supervise manipulating facial expression and meanwhile better preserving the facial animation of spoken content. To this end, it first learns a spatial coherent correlation metric, ensuring that the visual correlations of adjacent local regions within an image linked to a specific emotion closely resemble those of corresponding regions in an image linked to a different emotion. Simultaneously, it develops a temporal coherent correlation metric, ensuring that the visual correlations of specific regions across adjacent image frames associated with one emotion are similar to those in the corresponding regions of frames associated with another emotion. Recognizing that visual correlations are not uniform across all regions, we have also crafted a correlation-aware adaptive strategy that prioritizes regions that present greater challenges. During SPFEM model training, we construct the spatial-temporal coherent correlation metric between corresponding local regions of the input and output image frames as an additional loss to supervise the generation process.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main move is to turn observed spatial-temporal correlations in facial animations into an explicit loss for speech-preserving expression editing without paired data, but the validation for that step is still thin.

read the letter

The paper introduces STCCL, which models spatial correlations between adjacent local regions within frames and temporal correlations for the same regions across frames, then uses those as an added loss term during training. The idea is that these correlations stay consistent across different emotions for the same spoken content, so enforcing them between input and output should keep mouth movements intact while the rest of the face changes expression. They also add an adaptive weighting scheme to focus on harder regions. This is a clear departure from prior SPFEM work that needed aligned paired frames of the same speaker, which are expensive to collect. The approach directly targets a real bottleneck in scaling these models to real-world use like video production or virtual avatars. The adaptive part is a sensible practical addition to avoid diluting the signal on critical areas. The soft spots sit in the leap from observed correlations to reliable supervision. The description assumes corresponding regions can be matched without extra labels and that matching the metrics will tightly constrain speech content, but the abstract gives no numbers on correlation strength, no ablation removing the new loss, and no lip-sync or preservation metrics on held-out data. If those experiments in the full paper are weak or missing controls for how the loss affects mouth dynamics, the no-paired-data claim stays unproven. This is aimed at CV researchers working on facial video generation and weakly supervised editing. Someone looking for new loss designs to reduce data needs would get something out of the idea. It deserves a serious referee because the problem is concrete and the proposed signal is distinct, even if it needs more evidence to land. I would send it for review with a request for ablations and quantitative checks on speech preservation.

Referee Report

3 major / 2 minor

Summary. The paper claims that speakers conveying identical speech content with different emotions exhibit highly correlated local facial animations in both spatial and temporal domains; it introduces the STCCL algorithm that explicitly models these as spatial and temporal coherent correlation metrics, incorporates them as an additional loss during training, and adds a correlation-aware adaptive weighting strategy to enable speech-preserving facial expression manipulation (SPFEM) without paired per-speaker training data.

Significance. If the correlations are robust and the metrics successfully constrain mouth dynamics while permitting expression changes, the work could enable practical unpaired SPFEM for applications such as video editing and virtual agents where paired data is unavailable. The explicit construction of cross-emotion correlation losses represents a potentially reusable idea for other unpaired facial animation tasks.

major comments (3)

[STCCL algorithm description] The central claim that the observed correlations provide sufficient supervision for speech preservation rests on the construction of the STCCL loss between 'corresponding local regions' of input and output frames (described in the abstract and method overview). The manuscript does not specify how region correspondence is established without paired supervision or paired examples, leaving open whether this step implicitly requires additional alignment mechanisms.
[Abstract and experimental sections] The abstract asserts that speakers 'exhibit highly correlated local facial animations' and that the metrics 'better preserv[e] the facial animation of spoken content,' yet supplies no quantitative correlation values, statistical tests, or ablation results isolating the contribution of the spatial-temporal loss to speech-related metrics (e.g., lip-sync error or phoneme alignment). This evidence gap directly affects the no-paired-data claim.
[Adaptive weighting strategy] The correlation-aware adaptive strategy is said to 'prioritize regions that present greater challenges,' but the manuscript provides no derivation or ablation showing that the weighting reliably protects speech-critical areas (mouth region) rather than under-emphasizing them when correlations vary across emotions.

minor comments (2)

[Method] Formal equations for the spatial coherent correlation metric and temporal coherent correlation metric are needed; the current prose description leaves the exact functional form and normalization ambiguous.
[Abstract] The abstract contains repetitive phrasing when introducing the two metrics; a single consolidated sentence would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and detailed comments, which highlight important aspects of clarity and evidence in our work on STCCL for speech-preserving facial expression manipulation. We address each major comment point by point below, indicating where revisions will be made to improve the manuscript.

read point-by-point responses

Referee: [STCCL algorithm description] The central claim that the observed correlations provide sufficient supervision for speech preservation rests on the construction of the STCCL loss between 'corresponding local regions' of input and output frames (described in the abstract and method overview). The manuscript does not specify how region correspondence is established without paired supervision or paired examples, leaving open whether this step implicitly requires additional alignment mechanisms.

Authors: We appreciate this observation on the need for explicit detail. Region correspondence is established by applying a pre-trained facial landmark detector (e.g., based on standard models like FAN) independently to each input and output frame to define consistent local regions such as the mouth, eyes, and cheeks. These landmarks enable patch extraction at corresponding anatomical locations without any paired supervision or additional alignment steps beyond initial face cropping. The spatial and temporal correlation metrics are then computed between these detected regions. We will revise the method section (Section 3) to include this description, a diagram of the region extraction process, and pseudocode to eliminate ambiguity. revision: yes
Referee: [Abstract and experimental sections] The abstract asserts that speakers 'exhibit highly correlated local facial animations' and that the metrics 'better preserv[e] the facial animation of spoken content,' yet supplies no quantitative correlation values, statistical tests, or ablation results isolating the contribution of the spatial-temporal loss to speech-related metrics (e.g., lip-sync error or phoneme alignment). This evidence gap directly affects the no-paired-data claim.

Authors: We agree that quantitative backing for the correlation claims and the loss contribution is essential to support the unpaired training assertion. While our current experiments demonstrate overall improvements in speech preservation through qualitative and some quantitative comparisons, we will add: (i) explicit Pearson correlation coefficients and statistical tests (e.g., t-tests) computed between local regions across different emotions on our dataset; (ii) an ablation isolating the STCCL loss's effect on lip-sync error (LSE) and phoneme alignment scores. These will be incorporated into the experimental section and abstract if space allows, or highlighted in the main results. revision: yes
Referee: [Adaptive weighting strategy] The correlation-aware adaptive strategy is said to 'prioritize regions that present greater challenges,' but the manuscript provides no derivation or ablation showing that the weighting reliably protects speech-critical areas (mouth region) rather than under-emphasizing them when correlations vary across emotions.

Authors: This is a valid point regarding the need for formal justification and validation. We will add the full mathematical derivation of the correlation-aware adaptive weighting (including how it modulates the loss based on per-region correlation variance) to the supplementary material. We will also include a new ablation study in the experiments that compares the full STCCL model against variants with uniform or disabled weighting, reporting results specifically on mouth-region preservation metrics and overall speech-related errors to confirm it prioritizes challenging areas like the mouth across emotion variations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation introduces independent loss from empirical observation

full rationale

The paper's chain begins with an empirical observation (speakers exhibit correlated local facial animations across emotions for identical speech content) and proceeds to define explicit spatial-temporal coherent correlation metrics from that observation. These metrics are then used to construct an additional loss term applied between input and generated frames during training. No equation or claim reduces a 'prediction' or result to a fitted parameter, self-referential definition, or prior self-citation by construction. The central modeling step (enforcing correlation similarity to preserve speech) is an independent design choice rather than a definitional equivalence, and the method remains self-contained against external benchmarks without load-bearing self-citations or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that local facial animations remain highly correlated across different emotions for identical speech content; this assumption is used to define the supervision metrics. No numerical free parameters or invented entities with external falsifiable evidence are specified in the abstract.

axioms (1)

domain assumption Speakers conveying the same content with different emotions exhibit highly correlated local facial animations in both spatial and temporal spaces.
This is the key discovery invoked to justify the correlation metrics as supervision.

invented entities (2)

spatial coherent correlation metric no independent evidence
purpose: Measure and enforce similarity of visual correlations between adjacent local regions in images of different emotions.
Introduced as part of the STCCL algorithm to supervise expression manipulation.
temporal coherent correlation metric no independent evidence
purpose: Measure and enforce similarity of visual correlations for specific regions across adjacent frames of different emotions.
Introduced as part of the STCCL algorithm to supervise expression manipulation.

pith-pipeline@v0.9.0 · 5584 in / 1358 out tokens · 39089 ms · 2026-05-10T00:43:34.331959+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Personalized Cross-Modal Emotional Correlation Learning for Speech-Preserving Facial Expression Manipulation
cs.CV 2026-04 unverdicted novelty 5.0

PCMECL improves speech-preserving facial expression manipulation by learning personalized prompts from individual visuals and using feature differencing to align visual and semantic changes from VLMs.

Reference graph

Works this paper leans on

74 extracted references · 1 canonical work pages · cited by 1 Pith paper

[1]

ICface: In- terpretable and controllable face reenactment using GANs,

S. Tripathy, J. Kannala, and E. Rahtu, “ICface: In- terpretable and controllable face reenactment using GANs,” inProc. IEEE/CVF Winter Conf. Appl. Comput. Vis. (WACV), 2020, pp. 3385–3394

2020
[2]

Head2head++: Deep facial attributes re-targeting,

M. C. Doukas, M. R. Koujan, V . Sharmanska, A. Roussos, and S. Zafeiriou, “Head2head++: Deep facial attributes re-targeting,”IEEE Transactions on Biometrics, Behavior, and Identity Science, vol. 3, no. 1, pp. 31–43, 2021

2021
[3]

Unpaired image-to-image translation using cycle-consistent adver- sarial networks,

J.-Y. Zhu, T. Park, P . Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adver- sarial networks,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2017, pp. 2223–2232

2017
[4]

Neural emotion director: Speech-preserving se- mantic control of facial expressions in

F. P . Papantoniou, P . P . Filntisis, P . Maragos, and A. Rous- sos, “Neural emotion director: Speech-preserving se- mantic control of facial expressions in ”in-the-wild” videos,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 18 781–18 790

2022
[5]

A style-based generator architecture for generative adversarial networks,

T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 4401–4410

2019
[6]

Automatic analysis of facial expressions: The state of the art,

M. Pantic and L. J. M. Rothkrantz, “Automatic analysis of facial expressions: The state of the art,”IEEE Transac- tions on Pattern Analysis and Machine Intelligence, vol. 22, no. 12, pp. 1424–1445, 2000

2000
[7]

Styletalk: One-shot talking head generation with controllable speaking styles,

Y. Ma, S. Wang, Z. Hu, C. Fan, T. Lv, Y. Ding, Z. Deng, and X. Yu, “Styletalk: One-shot talking head generation with controllable speaking styles,” inProc. AAAI Conf. Artif. Intell. (AAAI), vol. 37, no. 2, 2023, pp. 1896–1904

2023
[8]

Efficient emotional adaptation for audio-driven talking-head generation,

Y. Gan, Z. Yang, X. Yue, L. Sun, and Y. Yang, “Efficient emotional adaptation for audio-driven talking-head generation,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2023, pp. 22 634–22 645

2023
[9]

Disentangle identity, cooperate emotion: Correlation-aware emotional talking portrait generation,

W. Tan, C. Lin, C. Xu, F. Xu, X. Hu, X. Ji, J. Zhu, C. Wang, and Y. Fu, “Disentangle identity, cooperate emotion: Correlation-aware emotional talking portrait generation,” inProc. ACM Int. Conf. Multimedia (ACM MM), 2025, pp. 9987–9995

2025
[10]

Learn- ing adaptive spatial coherent correlations for speech- preserving facial expression manipulation,

T. Chen, J. Lin, Z. Yang, C. Qing, and L. Lin, “Learn- ing adaptive spatial coherent correlations for speech- preserving facial expression manipulation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 7267–7276

2024
[11]

Image-to-image translation with disentan- gled latent vectors for face editing,

Y. Dalva, H. Pehlivan, O. I. Hatipoglu, C. Moran, and A. Dundar, “Image-to-image translation with disentan- gled latent vectors for face editing,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, pp. 14 777–14 788, 2023

2023
[12]

Image- to-image translation with conditional adversarial net- works,

P . Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image- to-image translation with conditional adversarial net- works,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 1125–1134

2017
[13]

Stargan: Unified generative adversarial networks for multi-domain image-to-image translation,

Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “Stargan: Unified generative adversarial networks for multi-domain image-to-image translation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 8789–8797

2018
[14]

Gan- based facial attribute manipulation,

Y. Liu, Q. Li, Q. Deng, Z. Sun, and M. Yang, “Gan- based facial attribute manipulation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, pp. 14 590–14 610, 2022

2022
[15]

Exprgan: Facial expression editing with controllable expression intensity,

H. Ding, K. Sricharan, and R. Chellappa, “Exprgan: Facial expression editing with controllable expression intensity,” inProc. AAAI Conf. Artif. Intell. (AAAI), vol. 32, no. 1, 2018

2018
[16]

3d guided fine- grained face manipulation,

Z. Geng, C. Cao, and S. Tulyakov, “3d guided fine- grained face manipulation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 9821– 9830

2019
[17]

Stylerig: Rigging stylegan for 3d control over portrait images,

A. Tewari, M. Elgharib, G. Bharaj, F. Bernard, H.-P . Seidel, P . P´erez, M. Zollhofer, and C. Theobalt, “Stylerig: Rigging stylegan for 3d control over portrait images,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2020, pp. 6142–6151

2020
[18]

Ganmut: Learning interpretable condi- tional space for gamut of emotions,

S. d’Apolito, D. P . Paudel, Z. Huang, A. Romero, and L. Van Gool, “Ganmut: Learning interpretable condi- tional space for gamut of emotions,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 568–577

2021
[19]

Progressive transformer machine for natural character reenactment,

Y. Xu, Z. Yang, T. Chen, K. Li, and C. Qing, “Progressive transformer machine for natural character reenactment,” ACM Transactions on Multimedia Computing, Communica- tions and Applications, vol. 19, no. 2s, pp. 1–22, 2023

2023
[20]

Ganimation: One-shot anatom- ically consistent facial animation,

A. Pumarola, A. Agudo, A. M. Martinez, A. Sanfeliu, and F. Moreno-Noguer, “Ganimation: One-shot anatom- ically consistent facial animation,”International Journal of Computer Vision, vol. 128, pp. 698–713, 2020

2020
[21]

Analyzing and improving the image quality of stylegan,

T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of stylegan,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2020, pp. 8110–8119

2020
[22]

Image2stylegan: How to embed images into the stylegan latent space?

R. Abdal, Y. Qin, and P . Wonka, “Image2stylegan: How to embed images into the stylegan latent space?” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2019, pp. 4432–4441

2019
[23]

Image2stylegan++: How to edit the embedded images?

R. Abdal, Y. Qin, and P . Wonka, “Image2stylegan++: How to edit the embedded images?” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2020, pp. 8296–8305

2020
[24]

Encoding in style: a stylegan encoder for image-to-image translation,

E. Richardson, Y. Alaluf, O. Patashnik, Y. Nitzan, Y. Azar, S. Shapiro, and D. Cohen-Or, “Encoding in style: a stylegan encoder for image-to-image translation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 2287–2296

2021
[25]

Restyle: A residual-based stylegan encoder via iterative refine- ment,

Y. Alaluf, O. Patashnik, and D. Cohen-Or, “Restyle: A residual-based stylegan encoder via iterative refine- ment,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2021, pp. 6711–6720

2021
[26]

In-domain gan inversion for faithful reconstruction and editability,

J. Zhu, Y. Shen, Y. Xu, D. Zhao, Q. Chen, and B. Zhou, “In-domain gan inversion for faithful reconstruction and editability,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, pp. 2607–2621, 2023

2023
[27]

Style transformer for image inversion and editing,

X. Hu, Q. Huang, Z. Shi, S. Li, C. Gao, L. Sun, and Q. Li, “Style transformer for image inversion and editing,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 11 337–11 346

2022
[28]

Reganie: rectifying gan inversion errors for accurate real image editing,

B. Li, T. Ma, P . Zhang, M. Hua, W. Liu, Q. He, and Z. Yi, “Reganie: rectifying gan inversion errors for accurate real image editing,” inProc. AAAI Conf. Artif. Intell. (AAAI), vol. 37, no. 1, 2023, pp. 1269–1277

2023
[29]

Out-of-domain gan inversion via invertibility decomposition for photo- realistic human face manipulation,

X. Yang, X. Xu, and Y. Chen, “Out-of-domain gan inversion via invertibility decomposition for photo- realistic human face manipulation,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2023, pp. 7492–7501

2023
[30]

High-fidelity gan inversion for image attribute editing,

T. Wang, Y. Zhang, Y. Fan, J. Wang, and Q. Chen, “High-fidelity gan inversion for image attribute editing,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 11 379–11 388

2022
[31]

Gan inversion: A survey,

W. Xia, Y. Zhang, Y. Yang, J.-H. Xue, B. Zhou, and M.-H. Yang, “Gan inversion: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 3, pp. 3121–3138, 2023

2023
[32]

Pivotal tuning for latent-based editing of real images,

D. Roich, R. Mokady, A. H. Bermano, and D. Cohen-Or, “Pivotal tuning for latent-based editing of real images,” ACM Transactions on graphics (TOG), vol. 42, no. 1, pp. 1–13, 2022

2022
[33]

Stitch it in time: Gan-based facial editing of real videos,

R. Tzaban, R. Mokady, R. Gal, A. Bermano, and D. Cohen-Or, “Stitch it in time: Gan-based facial editing of real videos,” inSIGGRAPH Asia 2022 Conference Papers, 2022, pp. 1–9

2022
[34]

Temporally consis- tent semantic video editing,

Y. Xu, B. AlBahar, and J.-B. Huang, “Temporally consis- tent semantic video editing,” inProc. Eur. Conf. Comput. Vis. (ECCV). Springer, 2022, pp. 357–374

2022
[35]

Rigid: Recurrent gan inversion and editing of real face videos,

Y. Xu, S. He, K.-Y. K. Wong, and P . Luo, “Rigid: Recurrent gan inversion and editing of real face videos,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2023, pp. 13 691–13 701

2023
[36]

Eamm: One-shot emotional talking face via audio-based emotion-aware motion model,

X. Ji, H. Zhou, K. Wang, Q. Wu, W. Wu, F. Xu, and X. Cao, “Eamm: One-shot emotional talking face via audio-based emotion-aware motion model,” inProc. SIGGRAPH, 2022, pp. 1–10

2022
[37]

Expressive talking head generation with granular audio-visual control,

B. Liang, Y. Pan, Z. Guo, H. Zhou, Z. Hong, X. Han, J. Han, J. Liu, E. Ding, and J. Wang, “Expressive talking head generation with granular audio-visual control,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 3387–3396

2022
[38]

Edtalk: Efficient disentanglement for emotional talking head synthesis,

S. Tan, B. Ji, M. Bi, and Y. Pan, “Edtalk: Efficient disentanglement for emotional talking head synthesis,” inProc. Eur. Conf. Comput. Vis. (ECCV). Springer, 2024, pp. 398–416

2024
[39]

Mead: A large-scale audio-visual dataset for emotional talking-face generation,

K. Wang, Q. Wu, L. Song, Z. Yang, W. Wu, C. Qian, R. He, Y. Qiao, and C. C. Loy, “Mead: A large-scale audio-visual dataset for emotional talking-face generation,” inProc. Eur. Conf. Comput. Vis. (ECCV), Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI. Springer, 2020, pp. 700–717

2020
[40]

Speech driven talking face generation from a single image and an emotion condition,

S. E. Eskimez, Y. Zhang, and Z. Duan, “Speech driven talking face generation from a single image and an emotion condition,”IEEE Transactions on Multimedia, vol. 24, pp. 3480–3490, 2022

2022
[41]

Emo: Emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions,

L. Tian, Q. Wang, B. Zhang, and L. Bo, “Emo: Emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions,” inProc. Eur. Conf. Comput. Vis. (ECCV), 2024, pp. 244– 260

2024
[42]

Difftalk: Crafting diffusion models for generalized audio-driven portraits animation,

S. Shen, W. Zhao, Z. Meng, W. Li, Z. Zhu, J. Zhou, and J. Lu, “Difftalk: Crafting diffusion models for generalized audio-driven portraits animation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023, pp. 1982–1991

2023
[43]

Invertible frowns: Video-to-video facial emotion trans- lation,

I. Magnusson, A. Sankaranarayanan, and A. Lippman, “Invertible frowns: Video-to-video facial emotion trans- lation,” inProceedings of the ADGD Workshop at ACM Multimedia 2021. Association for Computing Machinery, 2021, pp. 25–33

2021
[44]

A morphable model for the syn- thesis of 3d faces,

V . Blanz and T. Vetter, “A morphable model for the syn- thesis of 3d faces,” inSeminal Graphics Papers: Pushing the Boundaries, Volume 2, 2023, pp. 157–164

2023
[45]

Learning an animatable detailed 3d face model from in-the-wild images,

Y. Feng, H. Feng, M. J. Black, and T. Bolkart, “Learning an animatable detailed 3d face model from in-the-wild images,”ACM Transactions on Graphics (ToG), vol. 40, no. 4, pp. 1–13, 2021

2021
[46]

Diffusionrig: Learning personalized priors for facial appearance editing,

Z. Ding, X. Zhang, Z. Xia, L. Jebe, Z. Tu, and X. Zhang, “Diffusionrig: Learning personalized priors for facial appearance editing,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023, pp. 12 736–12 746

2023
[47]

Enhancing lip dynamic authenticity: Learning 3d tem- poral representations for talking head synthesis,

H. Li, H. Chen, Y. Huang, T. Chen, and S. Huang, “Enhancing lip dynamic authenticity: Learning 3d tem- poral representations for talking head synthesis,”ACM Transactions on Multimedia Computing, Communications and Applications, vol. 21, pp. 1 – 21, 2025

2025
[48]

Continuously controllable facial expression editing in talking face videos,

Z. Sun, Y.-H. Wen, T. Lv, Y. Sun, Z. Zhang, Y. Wang, and Y. Liu, “Continuously controllable facial expression editing in talking face videos,”IEEE Transactions on Affective Computing, vol. 15, no. 1, pp. 1400–1413, 2024

2024
[49]

Deep semantic manipu- lation of facial videos,

G. K. Solanki and A. Roussos, “Deep semantic manipu- lation of facial videos,” inProc. Eur. Conf. Comput. Vis. (ECCV). Springer, 2023, pp. 104–120

2023
[50]

Ccpl: contrastive coherence preserving loss for versatile style transfer,

Z. Wu, Z. Zhu, J. Du, and X. Bai, “Ccpl: contrastive coherence preserving loss for versatile style transfer,” in Proc. Eur. Conf. Comput. Vis. (ECCV). Springer, 2022, pp. 189–206

2022
[51]

Arbitrary video style transfer via multi-channel correlation,

Y. Deng, F. Tang, W. Dong, H. Huang, C. Ma, and C. Xu, “Arbitrary video style transfer via multi-channel correlation,” inProc. AAAI Conf. Artif. Intell. (AAAI), vol. 35, no. 2, 2021, pp. 1210–1217

2021
[52]

Dynamic correlation learning and regularization for multi-label confidence calibration,

T. Chen, W. Wang, T. Pu, J. Qin, Z. Yang, J. Liu, and L. Lin, “Dynamic correlation learning and regularization for multi-label confidence calibration,”IEEE Transactions on Image Processing, vol. 33, pp. 4811–4823, 2024

2024
[53]

Arcface: Additive angular margin loss for deep face recognition,

J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 4690–4699

2019
[54]

A morphable model for the synthesis of 3d faces,

V . Blanz and T. Vetter, “A morphable model for the synthesis of 3d faces,” inProc. SIGGRAPH, 1999, pp. 187–194

1999
[55]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2016, pp. 770–778

2016
[56]

Contrastive decoupled representation learning and regularization for speech-preserving facial expression manipulation,

T. Chen, J. Lin, Z. Yang, C. Qing, Y. Shi, and L. Lin, “Contrastive decoupled representation learning and regularization for speech-preserving facial expression manipulation,”International Journal of Computer Vision, vol. 133, pp. 3822 – 3838, 2025

2025
[57]

Neural scene designer: Self-styled semantic im- age manipulation,

J. Lin, T. Chen, C. Qing, Z. Yang, S. Huang, Y. Ren, and L. Lin, “Neural scene designer: Self-styled semantic im- age manipulation,”IEEE Transactions on Image Processing, vol. 34, pp. 6577–6588, 2025

2025
[58]

A simple framework for contrastive learning of visual representations,

T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” inProc. Int. Conf. Mach. Learn. (ICML). PMLR, 2020, pp. 1597–1607

2020
[59]

Focal loss for dense object detection,

T.-Y. Lin, P . Goyal, R. Girshick, K. He, and P . Doll ´ar, “Focal loss for dense object detection,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2017, pp. 2980–2988

2017
[60]

The ryerson audio- visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expres- sions in north american english,

S. R. Livingstone and F. A. Russo, “The ryerson audio- visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expres- sions in north american english,”PloS one, vol. 13, no. 5, p. e0196391, 2018

2018
[61]

A lip sync expert is all you need for speech to lip generation in the wild,

K. R. Prajwal, R. Mukhopadhyay, V . P . Namboodiri, and C. V . Jawahar, “A lip sync expert is all you need for speech to lip generation in the wild,” inProc. ACM Int. Conf. Multimedia (ACM MM), 2020, pp. 484–492

2020
[62]

Out of time: automated lip sync in the wild,

J. S. Chung and A. Zisserman, “Out of time: automated lip sync in the wild,” inWorkshop on Multi-view Lip- reading, ACCV, 2016

2016
[63]

Using dynamic time warp- ing to find patterns in time series,

D. J. Berndt and J. Clifford, “Using dynamic time warp- ing to find patterns in time series,” inKDD workshop, vol. 10, no. 16. Seattle, WA, USA:, 1994, pp. 359–370

1994
[64]

Controlnet++: Improving conditional controls with efficient consistency feedback: Project page: liming- ai. github. io/controlnet plus plus,

M. Li, T. Yang, H. Kuang, J. Wu, Z. Wang, X. Xiao, and C. Chen, “Controlnet++: Improving conditional controls with efficient consistency feedback: Project page: liming- ai. github. io/controlnet plus plus,” inProc. Eur. Conf. Comput. Vis. (ECCV). Springer, 2024, pp. 129–147

2024
[65]

Deep semantic manipu- lation of facial videos,

G. K. Solanki and A. Roussos, “Deep semantic manipu- lation of facial videos,” inProc. Eur. Conf. Comput. Vis. (ECCV) Workshops, 2022, pp. 104–120

2022
[66]

Invertable frowns: Video-to-video facial emotion trans- lation,

I. Magnusson, A. Sankaranarayanan, and A. Lippman, “Invertable frowns: Video-to-video facial emotion trans- lation,” inProceedings of the 1st Workshop on Synthetic Multimedia-Audiovisual Deepfake Generation and Detection, 2021, pp. 25–33. Tianshui Chenreceived a Ph.D. degree in com- puter science at the School of Data and Computer Science Sun Y at-sen Uni...

2021
[67]

Detailed Implementation & Integration Strategies:We elucidate the network architecture of the STCCL metric learning and detail the specific integration mechanisms for applying STCCL across diverse paradigms, includ- ing video-to-video SPFEM, and audio-driventalking head generation(covering both Transformer-based and Diffusion-based architectures) (Section 8)
[68]

This level of granularity validates the consistency of STCCL across varying emotional intensities (Section 9.1)

Full Per-emotion Quantitative Results:We provide the complete performance breakdown across all seven basic emotions for all four mainstream backbones (ICface, NED, EAT, and DICE-Talk) on both the MEAD and RAVDESS datasets. This level of granularity validates the consistency of STCCL across varying emotional intensities (Section 9.1)
[69]

Quantitative Comparisons on Additional Baselines:To demonstrate robustness across different fidelity levels, we provide quantitative evaluations on two additional frameworks: the 3DMM-basedDSMand the 2D-based Wav2Lip-Emotion(Section 9.2)
[70]

Extended Qualitative Comparisons:We provide ex- tensive visualization results for the one-stage SPFEM (ICface) and advanced talking head models (EATand DICE), offering a deeper analysis of realism and struc- tural consistency (Section 9.3)
[71]

Comprehensive User Studies:We report detailed user study results for both NED and ICface baselines across intra-dataset and cross-dataset settings, validating the perceptual improvements brought by STCCL (Section 9.4)
[72]

For dynamic comparisons, we invite reviewers to view the video demonstrations on our project page: https:// jianmanlincjx.github.io/STCCL/

Supplementary Ablation Analysis:We present addi- tional ablation studies using the Correlation Matrix (CM)-based STCCL to demonstrate the robustness of our framework across different correlation representations (Section 9.5). For dynamic comparisons, we invite reviewers to view the video demonstrations on our project page: https:// jianmanlincjx.github.io...
[73]

Angry” and “Surprised

as a feature extractor to discern multi-scale features of an image. By fine-tuning ArcFace with paired data, we can establish spatial-temporal coherent correlations within the feature space between two images / two image sequences. Specifically, for the calculation of spatially coherent correla- tions, we designate the image with the neutral emotion as x ...
[74]

averaging effect

datasets. Specifically, we extend the analysis to include the one-stage SPFEM baseline (ICface), the Transformer- based talking head model (EAT), and the Diffusion-based framework (DICE). We omit qualitative visualizations for DSM due to its methodological similarity to NED, and for Wav2Lip-Emotion as its low baseline resolution (96×96 ) lim- its the visu...

work page arXiv 1916