Temporally-Aligned Evaluation for Audio-Driven Talking Head Generation

Lei Wang; Yongsheng Gao; Yu Zhang; Zhicheng Zhang

arxiv: 2606.01031 · v1 · pith:LI7Y53MTnew · submitted 2026-05-31 · 💻 cs.GR · cs.AI· cs.CV· cs.LG· cs.MM

Temporally-Aligned Evaluation for Audio-Driven Talking Head Generation

Zhicheng Zhang , Lei Wang , Yu Zhang , Yongsheng Gao This is my paper

Pith reviewed 2026-06-28 16:17 UTC · model grok-4.3

classification 💻 cs.GR cs.AIcs.CVcs.LGcs.MM

keywords talking head generationevaluation metricstemporal alignmentaudio-driven animationvideo quality assessmentdynamic time warpingsynchronization evaluationfacial animation

0 comments

The pith

Evaluation of audio-driven talking heads requires sequence alignment to handle natural timing variations in speech.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that conventional frame-wise metrics penalize harmless timing differences that arise naturally from varying speech speeds and styles. It reformulates evaluation as a sequence-alignment task by adding Soft Dynamic Time Warping to existing perceptual, identity, and synchronization pipelines. This change treats rigid frame comparison as a special case and yields metrics that remain stable when timing shifts occur. A benchmark across twenty methods and seven datasets shows the aligned approach produces more consistent scores and clearer distinctions between modeling choices such as synchronization versus realism.

Core claim

Reformulating evaluation of audio-driven talking-head videos as a sequence-alignment problem with Soft Dynamic Time Warping supplies robustness to bounded temporal misalignments, leaves the underlying encoders unchanged, treats frame-wise comparison as rigid alignment, and produces more stable results that better expose trade-offs between synchronization, realism, expressiveness, and stability.

What carries the argument

Soft Dynamic Time Warping applied to feature trajectories inside established perceptual, identity, and synchronization evaluation pipelines to align sequences while preserving temporal order.

If this is right

Aligned metrics reduce sensitivity to harmless timing differences that occur in natural speech.
Scores become more consistent when the same methods are tested on different datasets.
Trade-offs between synchronization performance and visual realism appear more clearly.
Differences between expressiveness and temporal stability in the generated motion become easier to observe.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same alignment step could be added to evaluation protocols for other video generation tasks that tolerate small timing freedom.
Training losses for talking-head models might incorporate a similar alignment term to encourage robustness rather than exact frame matching.
Benchmark organizers could adopt sequence alignment to avoid unfairly penalizing methods that produce stylistically varied but still plausible motion.

Load-bearing premise

Bounded temporal misalignments are the primary mismatch between generated and reference videos, and adding Soft DTW leaves the perceptual, identity, and synchronization encoders unchanged.

What would settle it

Run the aligned and frame-wise metrics on the same generated videos after inserting controlled timing shifts of known size; if the method rankings remain identical, the claimed robustness gain is not observed.

read the original abstract

Audio-driven talking-head generation has advanced rapidly, yet existing evaluation protocols mainly rely on frame-wise metrics that assume strict temporal correspondence between generated and reference videos. This assumption does not match speech-driven facial motion, which naturally includes slight timing shifts, different speaking speeds, and stylistic variations. As a result, conventional metrics may treat harmless timing differences as quality errors, making it harder to fairly compare methods and understand their trade-offs. In this work, we argue that evaluation of dynamic generative models should be formulated as a sequence-alignment problem rather than independent frame comparison. We introduce a unified sequence-level reformulation that integrates Soft Dynamic Time Warping into established evaluation pipelines. By aligning feature trajectories while preserving temporal order, the proposed framework provides robustness to bounded temporal misalignments without altering the underlying perceptual, identity, or synchronization encoders. We show that frame-wise evaluation can be viewed as a special case under rigid alignment, while sequence-level alignment provides improved stability, lower sensitivity to timing differences, and clearer separation between modeling paradigms. Building on this principled formulation, we conduct a large-scale benchmark of 20 methods across seven datasets spanning canonical, in-the-wild, and style-diverse scenarios under standardized protocols. Extensive experiments show that temporally aligned metrics are more robust to timing differences, provide more consistent results across datasets, and better reveal systematic trade-offs between modeling paradigms, such as synchronization versus realism and expressiveness versus stability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reframes evaluation of audio-driven talking heads as a sequence alignment task with Soft DTW and backs the change with a 20-method, 7-dataset benchmark.

read the letter

The main point is that frame-wise metrics punish harmless timing shifts in speech-driven faces, and the authors fix this by folding Soft DTW into the evaluation pipeline so that sequences can slide a bit while keeping order. Frame-wise comparison becomes the rigid special case. They keep the usual perceptual, identity, and sync encoders untouched.

What is new is the explicit sequence-level formulation for this task and the scale of the follow-up benchmark. They test 20 methods across seven datasets that cover clean, in-the-wild, and style-varying cases under one protocol. The experiments show the aligned metrics are steadier across datasets and surface the expected trade-offs (sync versus realism, expressiveness versus stability) more cleanly than the old numbers.

The work is straightforward and the central claim holds up on the evidence given. The benchmark is the real contribution; without it the reformulation would be just a suggestion. The soft spots are limited. The paper does not explore how sensitive the results are to the DTW smoothing parameter or to the exact feature trajectories chosen, and it would help to see whether any method rankings actually flip once alignment is allowed. Those are normal next-step questions rather than load-bearing problems.

This is for people who build or compare talking-head models and want evaluation that matches how speech actually varies. It is the kind of incremental but concrete methodological paper that deserves referee time because the benchmark is large enough to be informative and the change is easy to adopt. I would send it to review.

Referee Report

2 major / 2 minor

Summary. The paper claims that frame-wise metrics for audio-driven talking-head generation assume strict temporal correspondence that does not hold for natural speech-driven motion (timing shifts, speed variations). It reformulates evaluation as a sequence-alignment problem by integrating Soft Dynamic Time Warping (Soft DTW) into existing perceptual, identity, and synchronization encoders, treating rigid frame-wise comparison as the special case of zero misalignment. The authors argue this yields robustness to bounded temporal differences without changing the underlying feature extractors, and they support the claim with a standardized benchmark of 20 methods across 7 datasets (canonical, in-the-wild, style-diverse), reporting improved stability, cross-dataset consistency, and clearer separation of modeling trade-offs such as synchronization vs. realism.

Significance. If the empirical results hold under the reported protocols, the work supplies a principled, encoder-preserving alternative to rigid alignment that directly addresses a known mismatch between evaluation assumptions and the generative process. The large-scale benchmark (20 methods, 7 datasets) and the explicit reduction of frame-wise metrics to a rigid-alignment special case are concrete strengths that could influence standard practice in talking-head and other temporally dynamic generation tasks.

major comments (2)

[§4] §4 (Benchmark setup): the claim that temporally aligned metrics are 'more robust to timing differences' and 'provide more consistent results across datasets' requires explicit quantification (e.g., variance reduction, rank stability, or sensitivity curves) with error bars; the abstract asserts these outcomes but the load-bearing evidence must be shown to be statistically distinguishable from frame-wise baselines.
[§3.2] §3.2 (Soft DTW integration): the statement that the framework 'does not alter the underlying perceptual, identity, or synchronization encoders' is central; the manuscript must demonstrate that the alignment operates strictly on the already-extracted feature trajectories and does not introduce any re-training or re-parameterization of those encoders.

minor comments (2)

[§3] Notation: the distinction between rigid alignment (frame-wise) and Soft DTW alignment should be formalized with a single equation that shows the former as the limiting case of the latter (e.g., when the warping path is forced to the diagonal).
[Tables/Figures in §4] Table captions and axis labels in the benchmark figures should explicitly state whether reported values are means over multiple seeds or single runs, and whether the same random seeds were used for all compared methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation of minor revision. We address each major comment below and will strengthen the manuscript with the requested clarifications and analyses.

read point-by-point responses

Referee: [§4] §4 (Benchmark setup): the claim that temporally aligned metrics are 'more robust to timing differences' and 'provide more consistent results across datasets' requires explicit quantification (e.g., variance reduction, rank stability, or sensitivity curves) with error bars; the abstract asserts these outcomes but the load-bearing evidence must be shown to be statistically distinguishable from frame-wise baselines.

Authors: We agree that the claims of improved robustness and cross-dataset consistency require explicit statistical support. In the revised manuscript we will augment Section 4 with variance-reduction percentages, rank-stability tables across the seven datasets, and sensitivity curves (with error bars from multiple random seeds) that directly compare temporally-aligned versus frame-wise metrics, thereby demonstrating statistical distinguishability. revision: yes
Referee: [§3.2] §3.2 (Soft DTW integration): the statement that the framework 'does not alter the underlying perceptual, identity, or synchronization encoders' is central; the manuscript must demonstrate that the alignment operates strictly on the already-extracted feature trajectories and does not introduce any re-training or re-parameterization of those encoders.

Authors: The alignment step is applied exclusively to feature sequences that have already been produced by the frozen encoders; no gradients flow back to the encoders and no re-parameterization occurs. We will revise §3.2 to include an explicit pipeline diagram and pseudocode that isolate feature extraction from the subsequent Soft DTW computation, thereby making the separation unambiguous. revision: yes

Circularity Check

0 steps flagged

No significant circularity; methodological reformulation is self-contained

full rationale

The paper's central contribution is a methodological reframing of evaluation as sequence alignment via integration of the established Soft DTW technique into existing perceptual/identity/synchronization pipelines, with frame-wise metrics positioned as the rigid-alignment special case. No load-bearing equations, parameters, or uniqueness claims reduce by construction to fitted inputs, self-definitions, or self-citation chains. The large-scale benchmark across 20 methods and 7 datasets supplies independent empirical content. The argument does not rely on any of the enumerated circularity patterns and remains externally falsifiable via the reported consistency and trade-off observations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that timing shifts are the main evaluation mismatch and that DTW alignment preserves metric integrity; no free parameters or invented entities are specified in the abstract.

axioms (1)

domain assumption Speech-driven facial motion naturally includes slight timing shifts, different speaking speeds, and stylistic variations that should not be treated as quality errors.
Explicitly stated in the abstract as the motivation for moving beyond frame-wise metrics.

pith-pipeline@v0.9.1-grok · 5794 in / 1130 out tokens · 34042 ms · 2026-06-28T16:17:36.777689+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 1 linked inside Pith

[1]

arXiv preprint arXiv:2301.037862(4), 5 (2023)

Shen, S., Zhao, W., Meng, Z., Li, W., Zhu, Z., Zhou, J., Lu, J.: Difftalk: Crafting diffusion models for generalized talking head synthesis. arXiv preprint arXiv:2301.037862(4), 5 (2023)

arXiv 2023
[2]

arXiv preprint arXiv:2403.17694 (2024) 23

Wei, H., Yang, Z., Wang, Z.: Aniportrait: Audio-driven synthesis of photorealistic portrait animation. arXiv preprint arXiv:2403.17694 (2024) 23

arXiv 2024
[3]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Zhang, W., Cun, X., Wang, X., Zhang, Y., Shen, X., Guo, Y., Shan, Y., Wang, F.: Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8652–8661 (2023)

2023
[4]

IEEE transactions on pattern analysis and machine intelligence45(9), 10850–10869 (2023)

Croitoru, F.-A., Hondru, V., Ionescu, R.T., Shah, M.: Diffusion models in vision: A survey. IEEE transactions on pattern analysis and machine intelligence45(9), 10850–10869 (2023)

2023
[5]

ACM Computing Surveys57(2), 1–42 (2024)

Xing, Z., Feng, Q., Chen, H., Dai, Q., Hu, H., Xu, H., Wu, Z., Jiang, Y.-G.: A survey on video diffusion models. ACM Computing Surveys57(2), 1–42 (2024)

2024
[6]

In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

Li, J., Zhang, J., Bai, X., Zheng, J., Zhou, J., Gu, L.: Instag: Learning personalized 3d talking head from few-second video. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 10690–10700 (2025)

2025
[7]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Fan, Y., Lin, Z., Saito, J., Wang, W., Komura, T.: Faceformer: Speech-driven 3d facial animation with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18770–18780 (2022)

2022
[8]

arXiv preprint arXiv:2107.09293 (2021)

Wang, S., Li, L., Ding, Y., Fan, C., Yu, X.: Audio2head: Audio-driven one-shot talking-head generation with natural head motion. arXiv preprint arXiv:2107.09293 (2021)

arXiv 2021
[9]

In: Proceedings of the 28th ACM International Conference on Multimedia, pp

Prajwal, K., Mukhopadhyay, R., Namboodiri, V.P., Jawahar, C.: A lip sync expert is all you need for speech to lip generation in the wild. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 484–492 (2020)

2020
[10]

In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

Zhou, H., Liu, Y., Liu, Z., Luo, P., Wang, X.: Talking face generation by adver- sarially disentangled audio-visual representation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 9299–9306 (2019)

2019
[11]

In: Asian Conference on Computer Vision, pp

Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: Asian Conference on Computer Vision, pp. 251–263 (2016). Springer

2016
[12]

In: European Conference on Computer Vision, pp

Thies, J., Elgharib, M., Tewari, A., Theobalt, C., Nießner, M.: Neural voice pup- petry: Audio-driven facial reenactment. In: European Conference on Computer Vision, pp. 716–731 (2020). Springer

2020
[13]

ACM Transactions On Graph- ics (TOG)39(6), 1–15 (2020)

Zhou, Y., Han, X., Shechtman, E., Echevarria, J., Kalogerakis, E., Li, D.: Makelttalk: speaker-aware talking-head animation. ACM Transactions On Graph- ics (TOG)39(6), 1–15 (2020)

2020
[14]

The Asian Journal of Applied Linguistics10(1), 1313–1313 (2026)

Har, F., Javier, D.R.C.: Heygen’s ai video platform for english language teaching. The Asian Journal of Applied Linguistics10(1), 1313–1313 (2026)

2026
[15]

PRESENCE: Virtual and Augmented Reality29, 113–139 (2020)

Jin, A., Deng, Q., Deng, Z.: A live speech-driven avatar-mediated three-party 24 telepresence system: design and evaluation. PRESENCE: Virtual and Augmented Reality29, 113–139 (2020)

2020
[16]

Electronics12(23), 4788 (2023)

Christoff, N., Neshov, N.N., Tonchev, K., Manolova, A.: Application of a 3d talk- ing head as part of telecommunication ar, vr, mr system: Systematic review. Electronics12(23), 4788 (2023)

2023
[17]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Zhu, Y., Zhang, L., Rong, Z., Hu, T., Liang, S., Ge, Z.: Infp: Audio-driven inter- active head generation in dyadic conversations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10667–10677 (2025)

2025
[18]

Visual Intelligence2(1), 24 (2024)

Yan, Y., Zhou, Z., Wang, Z., Gao, J., Yang, X.: Dialoguenerf: Towards realis- tic avatar face-to-face conversation video generation. Visual Intelligence2(1), 24 (2024)

2024
[19]

In: 2025 IEEE 6th International Seminar on Artificial Intelligence, Networking and Information Technology (AINIT), pp

Bai, X., He, X., Ma, M., Wang, X., Jiang, W., Du, T., Huang, Z.: A survey on audio-driven talking face generation. In: 2025 IEEE 6th International Seminar on Artificial Intelligence, Networking and Information Technology (AINIT), pp. 1–6 (2025). IEEE

2025
[20]

Electronics12(1), 218 (2023)

Zhen, R., Song, W., He, Q., Cao, J., Shi, L., Luo, J.: Human-computer interaction system: A survey of talking-head generation. Electronics12(1), 218 (2023)

2023
[21]

arXiv preprint arXiv:2308.16041 (2023)

Gowda, S.N., Pandey, D., Gowda, S.N.: From pixels to portraits: A comprehensive survey of talking head generation techniques and applications. arXiv preprint arXiv:2308.16041 (2023)

arXiv 2023
[22]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Zhang, Z., Li, L., Ding, Y., Fan, C.: Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3661–3670 (2021)

2021
[23]

arXiv preprint arXiv:1806.05622 (2018)

Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622 (2018)

arXiv 2018
[24]

In: European Conference on Computer Vision, pp

Wang, K., Wu, Q., Song, L., Yang, Z., Wu, W., Qian, C., He, R., Qiao, Y., Loy, C.C.: Mead: A large-scale audio-visual dataset for emotional talking-face generation. In: European Conference on Computer Vision, pp. 700–717 (2020). Springer

2020
[25]

In: CVPR (2018)

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)

2018
[26]

In: 2017 IEEE International Conference on Image Processing (ICIP), pp

Snell, J., Ridgeway, K., Liao, R., Roads, B.D., Mozer, M.C., Zemel, R.S.: Learning to generate images with perceptual similarity metrics. In: 2017 IEEE International Conference on Image Processing (ICIP), pp. 4277–4281 (2017). IEEE 25

2017
[27]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4690–4699 (2019)

2019
[28]

In: BMVC (2016)

Chung, J.S., Zisserman, A.: Lip reading in the wild. In: BMVC (2016)

2016
[29]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N.,et al.: Vbench: Comprehensive benchmark suite for video generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21807–21818 (2024)

2024
[30]

Pattern recognition44(3), 678–693 (2011)

Petitjean, F., Ketterlin, A., Gan¸ carski, P.: A global averaging method for dynamic time warping, with applications to clustering. Pattern recognition44(3), 678–693 (2011)

2011
[31]

In: International Conference on Machine Learning, pp

Cuturi, M., Blondel, M.: Soft-dtw: a differentiable loss function for time-series. In: International Conference on Machine Learning, pp. 894–903 (2017). PMLR

2017
[32]

arXiv preprint arXiv:2312.097672(3) (2023)

Ma, Y., Zhang, S., Wang, J., Wang, X., Zhang, Y., Deng, Z.: Dreamtalk: When expressive talking head generation meets diffusion probabilistic models. arXiv preprint arXiv:2312.097672(3) (2023)

arXiv 2023
[33]

Zhang, Y., Minhao, L., Chen, Z., Wu, B., Zhan, C., He, Y., HUANG, J., Zhou, W., et al.: Musetalk: Real-time high quality lip synchronization with latent space inpainting (2024)

2024
[34]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Wang, J., Qian, X., Zhang, M., Tan, R.T., Li, H.: Seeing what you said: Talking face generation guided by a lip reading expert. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14653–14662 (2023)

2023
[35]

In: Proceedings of the 33rd ACM International Conference on Multimedia, pp

Li, T., Zheng, R., Yang, M., Chen, J., Yang, M.: Ditto: Motion-space diffusion for controllable realtime talking head synthesis. In: Proceedings of the 33rd ACM International Conference on Multimedia, pp. 9704–9713 (2025)

2025
[36]

In: Proceedings of the 32nd ACM International Conference on Multimedia, pp

Liu, T., Chen, F., Fan, S., Du, C., Chen, Q., Chen, X., Yu, K.: Anitalker: animate vivid and diverse talking faces through identity-decoupled facial motion encoding. In: Proceedings of the 32nd ACM International Conference on Multimedia, pp. 6696–6705 (2024)

2024
[37]

arXiv preprint arXiv:2411.09209 (2024)

Cao, X., Wang, G., Shi, S., Zhao, J., Yao, Y., Fei, J., Gao, M.: Joyvasa: portrait and animal image animation with diffusion-based audio-driven facial dynamics and head motion generation. arXiv preprint arXiv:2411.09209 (2024)

Pith/arXiv arXiv 2024
[38]

In: European Conference on Computer Vision, pp

Tan, S., Ji, B., Bi, M., Pan, Y.: Edtalk: Efficient disentanglement for emotional talking head synthesis. In: European Conference on Computer Vision, pp. 398–416 (2024). Springer 26

2024
[39]

arXiv preprint arXiv:2406.02511 (2024)

Wang, C., Tian, K., Zhang, J., Guan, Y., Luo, F., Shen, F., Jiang, Z., Gu, Q., Han, X., Yang, W.: V-express: Conditional dropout for progressive training of portrait video generation. arXiv preprint arXiv:2406.02511 (2024)

arXiv 2024
[40]

In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

Chen, Z., Cao, J., Chen, Z., Li, Y., Ma, C.: Echomimic: Lifelike audio-driven portrait animations through editable landmark conditions. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, pp. 2403–2410 (2025)

2025
[41]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Zhong, W., Fang, C., Cai, Y., Wei, P., Zhao, G., Lin, L., Li, G.: Identity-preserving talking face generation with landmark and appearance priors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2023)

2023
[42]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Ki, T., Min, D., Chae, G.: Float: Generative motion latent flow matching for audio-driven talking portrait. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14699–14710 (2025)

2025
[43]

arXiv preprint arXiv:2406.08801 (2024)

Xu, M., Li, H., Su, Q., Shang, H., Zhang, L., Liu, C., Wang, J., Yao, Y., Zhu, S.: Hallo: Hierarchical audio-driven visual synthesis for portrait image animation. arXiv preprint arXiv:2406.08801 (2024)

arXiv 2024
[44]

arXiv preprint arXiv:2410.07718 (2024)

Cui, J., Li, H., Yao, Y., Zhu, H., Shang, H., Cheng, K., Zhou, H., Zhu, S., Wang, J.: Hallo2: Long-duration and high-resolution audio-driven portrait image animation. arXiv preprint arXiv:2410.07718 (2024)

arXiv 2024
[45]

In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

Cui, J., Li, H., Zhan, Y., Shang, H., Cheng, K., Ma, Y., Mu, S., Zhou, H., Wang, J., Zhu, S.: Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 21086–21095 (2025)

2025
[46]

arXiv preprint arXiv:2412.04448 (2024)

Zheng, L., Zhang, Y., Guo, H., Pan, J., Tan, Z., Lu, J., Tang, C., An, B., Yan, S.: Memo: Memory-guided diffusion for expressive talking video generation. arXiv preprint arXiv:2412.04448 (2024)

arXiv 2024
[47]

In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

Ji, X., Hu, X., Xu, Z., Zhu, J., Lin, C., He, Q., Zhang, J., Luo, D., Chen, Y., Lin, Q.,et al.: Sonic: Shifting focus to global audio perception in portrait animation. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 193–203 (2025)

2025
[48]

In: The Second International Workshop on Transformative Insights in Multifaceted Evaluation at The Web Conference 2026 (2026)

Zhang, Z., Wang, L., Gao, Y., Zhang, Y.: Talking-head generation in practice. In: The Second International Workshop on Transformative Insights in Multifaceted Evaluation at The Web Conference 2026 (2026). https://openreview.net/forum?id=ns3TgZYQTZ

2026
[49]

In: European Conference on Computer Vision, pp

Zhu, H., Wu, W., Zhu, W., Jiang, L., Tang, S., Zhang, L., Liu, Z., Loy, C.C.: Celebv-hq: A large-scale video facial attributes dataset. In: European Conference on Computer Vision, pp. 650–667 (2022). Springer 27

2022
[50]

PloS one13(5), 0196391 (2018)

Livingstone, S.R., Russo, F.A.: The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english. PloS one13(5), 0196391 (2018)

2018
[51]

International Journal of Computer Vision133(10), 7154– 7200 (2025)

Hondru, V., Croitoru, F.A., Minaee, S., Ionescu, R.T., Sebe, N.: Masked image modeling: A survey. International Journal of Computer Vision133(10), 7154– 7200 (2025)

2025
[52]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Zhou, H., Sun, Y., Wu, W., Loy, C.C., Wang, X., Liu, Z.: Pose-controllable talking face generation by implicitly modularized audio-visual representation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4176–4186 (2021)

2021
[53]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595 (2018)

2018
[54]

arXiv preprint arXiv:2407.03168 (2024) 28

Guo, J., Zhang, D., Liu, X., Zhong, Z., Zhang, Y., Wan, P., Zhang, D.: Livepor- trait: Efficient portrait animation with stitching and retargeting control. arXiv preprint arXiv:2407.03168 (2024) 28

arXiv 2024

[1] [1]

arXiv preprint arXiv:2301.037862(4), 5 (2023)

Shen, S., Zhao, W., Meng, Z., Li, W., Zhu, Z., Zhou, J., Lu, J.: Difftalk: Crafting diffusion models for generalized talking head synthesis. arXiv preprint arXiv:2301.037862(4), 5 (2023)

arXiv 2023

[2] [2]

arXiv preprint arXiv:2403.17694 (2024) 23

Wei, H., Yang, Z., Wang, Z.: Aniportrait: Audio-driven synthesis of photorealistic portrait animation. arXiv preprint arXiv:2403.17694 (2024) 23

arXiv 2024

[3] [3]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Zhang, W., Cun, X., Wang, X., Zhang, Y., Shen, X., Guo, Y., Shan, Y., Wang, F.: Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8652–8661 (2023)

2023

[4] [4]

IEEE transactions on pattern analysis and machine intelligence45(9), 10850–10869 (2023)

Croitoru, F.-A., Hondru, V., Ionescu, R.T., Shah, M.: Diffusion models in vision: A survey. IEEE transactions on pattern analysis and machine intelligence45(9), 10850–10869 (2023)

2023

[5] [5]

ACM Computing Surveys57(2), 1–42 (2024)

Xing, Z., Feng, Q., Chen, H., Dai, Q., Hu, H., Xu, H., Wu, Z., Jiang, Y.-G.: A survey on video diffusion models. ACM Computing Surveys57(2), 1–42 (2024)

2024

[6] [6]

In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

Li, J., Zhang, J., Bai, X., Zheng, J., Zhou, J., Gu, L.: Instag: Learning personalized 3d talking head from few-second video. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 10690–10700 (2025)

2025

[7] [7]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Fan, Y., Lin, Z., Saito, J., Wang, W., Komura, T.: Faceformer: Speech-driven 3d facial animation with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18770–18780 (2022)

2022

[8] [8]

arXiv preprint arXiv:2107.09293 (2021)

Wang, S., Li, L., Ding, Y., Fan, C., Yu, X.: Audio2head: Audio-driven one-shot talking-head generation with natural head motion. arXiv preprint arXiv:2107.09293 (2021)

arXiv 2021

[9] [9]

In: Proceedings of the 28th ACM International Conference on Multimedia, pp

Prajwal, K., Mukhopadhyay, R., Namboodiri, V.P., Jawahar, C.: A lip sync expert is all you need for speech to lip generation in the wild. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 484–492 (2020)

2020

[10] [10]

In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

Zhou, H., Liu, Y., Liu, Z., Luo, P., Wang, X.: Talking face generation by adver- sarially disentangled audio-visual representation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 9299–9306 (2019)

2019

[11] [11]

In: Asian Conference on Computer Vision, pp

Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: Asian Conference on Computer Vision, pp. 251–263 (2016). Springer

2016

[12] [12]

In: European Conference on Computer Vision, pp

Thies, J., Elgharib, M., Tewari, A., Theobalt, C., Nießner, M.: Neural voice pup- petry: Audio-driven facial reenactment. In: European Conference on Computer Vision, pp. 716–731 (2020). Springer

2020

[13] [13]

ACM Transactions On Graph- ics (TOG)39(6), 1–15 (2020)

Zhou, Y., Han, X., Shechtman, E., Echevarria, J., Kalogerakis, E., Li, D.: Makelttalk: speaker-aware talking-head animation. ACM Transactions On Graph- ics (TOG)39(6), 1–15 (2020)

2020

[14] [14]

The Asian Journal of Applied Linguistics10(1), 1313–1313 (2026)

Har, F., Javier, D.R.C.: Heygen’s ai video platform for english language teaching. The Asian Journal of Applied Linguistics10(1), 1313–1313 (2026)

2026

[15] [15]

PRESENCE: Virtual and Augmented Reality29, 113–139 (2020)

Jin, A., Deng, Q., Deng, Z.: A live speech-driven avatar-mediated three-party 24 telepresence system: design and evaluation. PRESENCE: Virtual and Augmented Reality29, 113–139 (2020)

2020

[16] [16]

Electronics12(23), 4788 (2023)

Christoff, N., Neshov, N.N., Tonchev, K., Manolova, A.: Application of a 3d talk- ing head as part of telecommunication ar, vr, mr system: Systematic review. Electronics12(23), 4788 (2023)

2023

[17] [17]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Zhu, Y., Zhang, L., Rong, Z., Hu, T., Liang, S., Ge, Z.: Infp: Audio-driven inter- active head generation in dyadic conversations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10667–10677 (2025)

2025

[18] [18]

Visual Intelligence2(1), 24 (2024)

Yan, Y., Zhou, Z., Wang, Z., Gao, J., Yang, X.: Dialoguenerf: Towards realis- tic avatar face-to-face conversation video generation. Visual Intelligence2(1), 24 (2024)

2024

[19] [19]

In: 2025 IEEE 6th International Seminar on Artificial Intelligence, Networking and Information Technology (AINIT), pp

Bai, X., He, X., Ma, M., Wang, X., Jiang, W., Du, T., Huang, Z.: A survey on audio-driven talking face generation. In: 2025 IEEE 6th International Seminar on Artificial Intelligence, Networking and Information Technology (AINIT), pp. 1–6 (2025). IEEE

2025

[20] [20]

Electronics12(1), 218 (2023)

Zhen, R., Song, W., He, Q., Cao, J., Shi, L., Luo, J.: Human-computer interaction system: A survey of talking-head generation. Electronics12(1), 218 (2023)

2023

[21] [21]

arXiv preprint arXiv:2308.16041 (2023)

Gowda, S.N., Pandey, D., Gowda, S.N.: From pixels to portraits: A comprehensive survey of talking head generation techniques and applications. arXiv preprint arXiv:2308.16041 (2023)

arXiv 2023

[22] [22]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Zhang, Z., Li, L., Ding, Y., Fan, C.: Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3661–3670 (2021)

2021

[23] [23]

arXiv preprint arXiv:1806.05622 (2018)

Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622 (2018)

arXiv 2018

[24] [24]

In: European Conference on Computer Vision, pp

Wang, K., Wu, Q., Song, L., Yang, Z., Wu, W., Qian, C., He, R., Qiao, Y., Loy, C.C.: Mead: A large-scale audio-visual dataset for emotional talking-face generation. In: European Conference on Computer Vision, pp. 700–717 (2020). Springer

2020

[25] [25]

In: CVPR (2018)

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)

2018

[26] [26]

In: 2017 IEEE International Conference on Image Processing (ICIP), pp

Snell, J., Ridgeway, K., Liao, R., Roads, B.D., Mozer, M.C., Zemel, R.S.: Learning to generate images with perceptual similarity metrics. In: 2017 IEEE International Conference on Image Processing (ICIP), pp. 4277–4281 (2017). IEEE 25

2017

[27] [27]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4690–4699 (2019)

2019

[28] [28]

In: BMVC (2016)

Chung, J.S., Zisserman, A.: Lip reading in the wild. In: BMVC (2016)

2016

[29] [29]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N.,et al.: Vbench: Comprehensive benchmark suite for video generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21807–21818 (2024)

2024

[30] [30]

Pattern recognition44(3), 678–693 (2011)

Petitjean, F., Ketterlin, A., Gan¸ carski, P.: A global averaging method for dynamic time warping, with applications to clustering. Pattern recognition44(3), 678–693 (2011)

2011

[31] [31]

In: International Conference on Machine Learning, pp

Cuturi, M., Blondel, M.: Soft-dtw: a differentiable loss function for time-series. In: International Conference on Machine Learning, pp. 894–903 (2017). PMLR

2017

[32] [32]

arXiv preprint arXiv:2312.097672(3) (2023)

Ma, Y., Zhang, S., Wang, J., Wang, X., Zhang, Y., Deng, Z.: Dreamtalk: When expressive talking head generation meets diffusion probabilistic models. arXiv preprint arXiv:2312.097672(3) (2023)

arXiv 2023

[33] [33]

Zhang, Y., Minhao, L., Chen, Z., Wu, B., Zhan, C., He, Y., HUANG, J., Zhou, W., et al.: Musetalk: Real-time high quality lip synchronization with latent space inpainting (2024)

2024

[34] [34]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Wang, J., Qian, X., Zhang, M., Tan, R.T., Li, H.: Seeing what you said: Talking face generation guided by a lip reading expert. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14653–14662 (2023)

2023

[35] [35]

In: Proceedings of the 33rd ACM International Conference on Multimedia, pp

Li, T., Zheng, R., Yang, M., Chen, J., Yang, M.: Ditto: Motion-space diffusion for controllable realtime talking head synthesis. In: Proceedings of the 33rd ACM International Conference on Multimedia, pp. 9704–9713 (2025)

2025

[36] [36]

In: Proceedings of the 32nd ACM International Conference on Multimedia, pp

Liu, T., Chen, F., Fan, S., Du, C., Chen, Q., Chen, X., Yu, K.: Anitalker: animate vivid and diverse talking faces through identity-decoupled facial motion encoding. In: Proceedings of the 32nd ACM International Conference on Multimedia, pp. 6696–6705 (2024)

2024

[37] [37]

arXiv preprint arXiv:2411.09209 (2024)

Cao, X., Wang, G., Shi, S., Zhao, J., Yao, Y., Fei, J., Gao, M.: Joyvasa: portrait and animal image animation with diffusion-based audio-driven facial dynamics and head motion generation. arXiv preprint arXiv:2411.09209 (2024)

Pith/arXiv arXiv 2024

[38] [38]

In: European Conference on Computer Vision, pp

Tan, S., Ji, B., Bi, M., Pan, Y.: Edtalk: Efficient disentanglement for emotional talking head synthesis. In: European Conference on Computer Vision, pp. 398–416 (2024). Springer 26

2024

[39] [39]

arXiv preprint arXiv:2406.02511 (2024)

Wang, C., Tian, K., Zhang, J., Guan, Y., Luo, F., Shen, F., Jiang, Z., Gu, Q., Han, X., Yang, W.: V-express: Conditional dropout for progressive training of portrait video generation. arXiv preprint arXiv:2406.02511 (2024)

arXiv 2024

[40] [40]

In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

Chen, Z., Cao, J., Chen, Z., Li, Y., Ma, C.: Echomimic: Lifelike audio-driven portrait animations through editable landmark conditions. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, pp. 2403–2410 (2025)

2025

[41] [41]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Zhong, W., Fang, C., Cai, Y., Wei, P., Zhao, G., Lin, L., Li, G.: Identity-preserving talking face generation with landmark and appearance priors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2023)

2023

[42] [42]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Ki, T., Min, D., Chae, G.: Float: Generative motion latent flow matching for audio-driven talking portrait. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14699–14710 (2025)

2025

[43] [43]

arXiv preprint arXiv:2406.08801 (2024)

Xu, M., Li, H., Su, Q., Shang, H., Zhang, L., Liu, C., Wang, J., Yao, Y., Zhu, S.: Hallo: Hierarchical audio-driven visual synthesis for portrait image animation. arXiv preprint arXiv:2406.08801 (2024)

arXiv 2024

[44] [44]

arXiv preprint arXiv:2410.07718 (2024)

Cui, J., Li, H., Yao, Y., Zhu, H., Shang, H., Cheng, K., Zhou, H., Zhu, S., Wang, J.: Hallo2: Long-duration and high-resolution audio-driven portrait image animation. arXiv preprint arXiv:2410.07718 (2024)

arXiv 2024

[45] [45]

In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

Cui, J., Li, H., Zhan, Y., Shang, H., Cheng, K., Ma, Y., Mu, S., Zhou, H., Wang, J., Zhu, S.: Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 21086–21095 (2025)

2025

[46] [46]

arXiv preprint arXiv:2412.04448 (2024)

Zheng, L., Zhang, Y., Guo, H., Pan, J., Tan, Z., Lu, J., Tang, C., An, B., Yan, S.: Memo: Memory-guided diffusion for expressive talking video generation. arXiv preprint arXiv:2412.04448 (2024)

arXiv 2024

[47] [47]

In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

Ji, X., Hu, X., Xu, Z., Zhu, J., Lin, C., He, Q., Zhang, J., Luo, D., Chen, Y., Lin, Q.,et al.: Sonic: Shifting focus to global audio perception in portrait animation. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 193–203 (2025)

2025

[48] [48]

In: The Second International Workshop on Transformative Insights in Multifaceted Evaluation at The Web Conference 2026 (2026)

Zhang, Z., Wang, L., Gao, Y., Zhang, Y.: Talking-head generation in practice. In: The Second International Workshop on Transformative Insights in Multifaceted Evaluation at The Web Conference 2026 (2026). https://openreview.net/forum?id=ns3TgZYQTZ

2026

[49] [49]

In: European Conference on Computer Vision, pp

Zhu, H., Wu, W., Zhu, W., Jiang, L., Tang, S., Zhang, L., Liu, Z., Loy, C.C.: Celebv-hq: A large-scale video facial attributes dataset. In: European Conference on Computer Vision, pp. 650–667 (2022). Springer 27

2022

[50] [50]

PloS one13(5), 0196391 (2018)

Livingstone, S.R., Russo, F.A.: The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english. PloS one13(5), 0196391 (2018)

2018

[51] [51]

International Journal of Computer Vision133(10), 7154– 7200 (2025)

Hondru, V., Croitoru, F.A., Minaee, S., Ionescu, R.T., Sebe, N.: Masked image modeling: A survey. International Journal of Computer Vision133(10), 7154– 7200 (2025)

2025

[52] [52]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Zhou, H., Sun, Y., Wu, W., Loy, C.C., Wang, X., Liu, Z.: Pose-controllable talking face generation by implicitly modularized audio-visual representation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4176–4186 (2021)

2021

[53] [53]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595 (2018)

2018

[54] [54]

arXiv preprint arXiv:2407.03168 (2024) 28

Guo, J., Zhang, D., Liu, X., Zhong, Z., Zhang, Y., Wan, P., Zhang, D.: Livepor- trait: Efficient portrait animation with stitching and retargeting control. arXiv preprint arXiv:2407.03168 (2024) 28

arXiv 2024