Temporally-Aligned Evaluation for Audio-Driven Talking Head Generation
Pith reviewed 2026-06-28 16:17 UTC · model grok-4.3
The pith
Evaluation of audio-driven talking heads requires sequence alignment to handle natural timing variations in speech.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Reformulating evaluation of audio-driven talking-head videos as a sequence-alignment problem with Soft Dynamic Time Warping supplies robustness to bounded temporal misalignments, leaves the underlying encoders unchanged, treats frame-wise comparison as rigid alignment, and produces more stable results that better expose trade-offs between synchronization, realism, expressiveness, and stability.
What carries the argument
Soft Dynamic Time Warping applied to feature trajectories inside established perceptual, identity, and synchronization evaluation pipelines to align sequences while preserving temporal order.
If this is right
- Aligned metrics reduce sensitivity to harmless timing differences that occur in natural speech.
- Scores become more consistent when the same methods are tested on different datasets.
- Trade-offs between synchronization performance and visual realism appear more clearly.
- Differences between expressiveness and temporal stability in the generated motion become easier to observe.
Where Pith is reading between the lines
- The same alignment step could be added to evaluation protocols for other video generation tasks that tolerate small timing freedom.
- Training losses for talking-head models might incorporate a similar alignment term to encourage robustness rather than exact frame matching.
- Benchmark organizers could adopt sequence alignment to avoid unfairly penalizing methods that produce stylistically varied but still plausible motion.
Load-bearing premise
Bounded temporal misalignments are the primary mismatch between generated and reference videos, and adding Soft DTW leaves the perceptual, identity, and synchronization encoders unchanged.
What would settle it
Run the aligned and frame-wise metrics on the same generated videos after inserting controlled timing shifts of known size; if the method rankings remain identical, the claimed robustness gain is not observed.
read the original abstract
Audio-driven talking-head generation has advanced rapidly, yet existing evaluation protocols mainly rely on frame-wise metrics that assume strict temporal correspondence between generated and reference videos. This assumption does not match speech-driven facial motion, which naturally includes slight timing shifts, different speaking speeds, and stylistic variations. As a result, conventional metrics may treat harmless timing differences as quality errors, making it harder to fairly compare methods and understand their trade-offs. In this work, we argue that evaluation of dynamic generative models should be formulated as a sequence-alignment problem rather than independent frame comparison. We introduce a unified sequence-level reformulation that integrates Soft Dynamic Time Warping into established evaluation pipelines. By aligning feature trajectories while preserving temporal order, the proposed framework provides robustness to bounded temporal misalignments without altering the underlying perceptual, identity, or synchronization encoders. We show that frame-wise evaluation can be viewed as a special case under rigid alignment, while sequence-level alignment provides improved stability, lower sensitivity to timing differences, and clearer separation between modeling paradigms. Building on this principled formulation, we conduct a large-scale benchmark of 20 methods across seven datasets spanning canonical, in-the-wild, and style-diverse scenarios under standardized protocols. Extensive experiments show that temporally aligned metrics are more robust to timing differences, provide more consistent results across datasets, and better reveal systematic trade-offs between modeling paradigms, such as synchronization versus realism and expressiveness versus stability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that frame-wise metrics for audio-driven talking-head generation assume strict temporal correspondence that does not hold for natural speech-driven motion (timing shifts, speed variations). It reformulates evaluation as a sequence-alignment problem by integrating Soft Dynamic Time Warping (Soft DTW) into existing perceptual, identity, and synchronization encoders, treating rigid frame-wise comparison as the special case of zero misalignment. The authors argue this yields robustness to bounded temporal differences without changing the underlying feature extractors, and they support the claim with a standardized benchmark of 20 methods across 7 datasets (canonical, in-the-wild, style-diverse), reporting improved stability, cross-dataset consistency, and clearer separation of modeling trade-offs such as synchronization vs. realism.
Significance. If the empirical results hold under the reported protocols, the work supplies a principled, encoder-preserving alternative to rigid alignment that directly addresses a known mismatch between evaluation assumptions and the generative process. The large-scale benchmark (20 methods, 7 datasets) and the explicit reduction of frame-wise metrics to a rigid-alignment special case are concrete strengths that could influence standard practice in talking-head and other temporally dynamic generation tasks.
major comments (2)
- [§4] §4 (Benchmark setup): the claim that temporally aligned metrics are 'more robust to timing differences' and 'provide more consistent results across datasets' requires explicit quantification (e.g., variance reduction, rank stability, or sensitivity curves) with error bars; the abstract asserts these outcomes but the load-bearing evidence must be shown to be statistically distinguishable from frame-wise baselines.
- [§3.2] §3.2 (Soft DTW integration): the statement that the framework 'does not alter the underlying perceptual, identity, or synchronization encoders' is central; the manuscript must demonstrate that the alignment operates strictly on the already-extracted feature trajectories and does not introduce any re-training or re-parameterization of those encoders.
minor comments (2)
- [§3] Notation: the distinction between rigid alignment (frame-wise) and Soft DTW alignment should be formalized with a single equation that shows the former as the limiting case of the latter (e.g., when the warping path is forced to the diagonal).
- [Tables/Figures in §4] Table captions and axis labels in the benchmark figures should explicitly state whether reported values are means over multiple seeds or single runs, and whether the same random seeds were used for all compared methods.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and recommendation of minor revision. We address each major comment below and will strengthen the manuscript with the requested clarifications and analyses.
read point-by-point responses
-
Referee: [§4] §4 (Benchmark setup): the claim that temporally aligned metrics are 'more robust to timing differences' and 'provide more consistent results across datasets' requires explicit quantification (e.g., variance reduction, rank stability, or sensitivity curves) with error bars; the abstract asserts these outcomes but the load-bearing evidence must be shown to be statistically distinguishable from frame-wise baselines.
Authors: We agree that the claims of improved robustness and cross-dataset consistency require explicit statistical support. In the revised manuscript we will augment Section 4 with variance-reduction percentages, rank-stability tables across the seven datasets, and sensitivity curves (with error bars from multiple random seeds) that directly compare temporally-aligned versus frame-wise metrics, thereby demonstrating statistical distinguishability. revision: yes
-
Referee: [§3.2] §3.2 (Soft DTW integration): the statement that the framework 'does not alter the underlying perceptual, identity, or synchronization encoders' is central; the manuscript must demonstrate that the alignment operates strictly on the already-extracted feature trajectories and does not introduce any re-training or re-parameterization of those encoders.
Authors: The alignment step is applied exclusively to feature sequences that have already been produced by the frozen encoders; no gradients flow back to the encoders and no re-parameterization occurs. We will revise §3.2 to include an explicit pipeline diagram and pseudocode that isolate feature extraction from the subsequent Soft DTW computation, thereby making the separation unambiguous. revision: yes
Circularity Check
No significant circularity; methodological reformulation is self-contained
full rationale
The paper's central contribution is a methodological reframing of evaluation as sequence alignment via integration of the established Soft DTW technique into existing perceptual/identity/synchronization pipelines, with frame-wise metrics positioned as the rigid-alignment special case. No load-bearing equations, parameters, or uniqueness claims reduce by construction to fitted inputs, self-definitions, or self-citation chains. The large-scale benchmark across 20 methods and 7 datasets supplies independent empirical content. The argument does not rely on any of the enumerated circularity patterns and remains externally falsifiable via the reported consistency and trade-off observations.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Speech-driven facial motion naturally includes slight timing shifts, different speaking speeds, and stylistic variations that should not be treated as quality errors.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2301.037862(4), 5 (2023)
Shen, S., Zhao, W., Meng, Z., Li, W., Zhu, Z., Zhou, J., Lu, J.: Difftalk: Crafting diffusion models for generalized talking head synthesis. arXiv preprint arXiv:2301.037862(4), 5 (2023)
arXiv 2023
-
[2]
arXiv preprint arXiv:2403.17694 (2024) 23
Wei, H., Yang, Z., Wang, Z.: Aniportrait: Audio-driven synthesis of photorealistic portrait animation. arXiv preprint arXiv:2403.17694 (2024) 23
arXiv 2024
-
[3]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Zhang, W., Cun, X., Wang, X., Zhang, Y., Shen, X., Guo, Y., Shan, Y., Wang, F.: Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8652–8661 (2023)
2023
-
[4]
IEEE transactions on pattern analysis and machine intelligence45(9), 10850–10869 (2023)
Croitoru, F.-A., Hondru, V., Ionescu, R.T., Shah, M.: Diffusion models in vision: A survey. IEEE transactions on pattern analysis and machine intelligence45(9), 10850–10869 (2023)
2023
-
[5]
ACM Computing Surveys57(2), 1–42 (2024)
Xing, Z., Feng, Q., Chen, H., Dai, Q., Hu, H., Xu, H., Wu, Z., Jiang, Y.-G.: A survey on video diffusion models. ACM Computing Surveys57(2), 1–42 (2024)
2024
-
[6]
In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp
Li, J., Zhang, J., Bai, X., Zheng, J., Zhou, J., Gu, L.: Instag: Learning personalized 3d talking head from few-second video. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 10690–10700 (2025)
2025
-
[7]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Fan, Y., Lin, Z., Saito, J., Wang, W., Komura, T.: Faceformer: Speech-driven 3d facial animation with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18770–18780 (2022)
2022
-
[8]
arXiv preprint arXiv:2107.09293 (2021)
Wang, S., Li, L., Ding, Y., Fan, C., Yu, X.: Audio2head: Audio-driven one-shot talking-head generation with natural head motion. arXiv preprint arXiv:2107.09293 (2021)
arXiv 2021
-
[9]
In: Proceedings of the 28th ACM International Conference on Multimedia, pp
Prajwal, K., Mukhopadhyay, R., Namboodiri, V.P., Jawahar, C.: A lip sync expert is all you need for speech to lip generation in the wild. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 484–492 (2020)
2020
-
[10]
In: Proceedings of the AAAI Conference on Artificial Intelligence, vol
Zhou, H., Liu, Y., Liu, Z., Luo, P., Wang, X.: Talking face generation by adver- sarially disentangled audio-visual representation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 9299–9306 (2019)
2019
-
[11]
In: Asian Conference on Computer Vision, pp
Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: Asian Conference on Computer Vision, pp. 251–263 (2016). Springer
2016
-
[12]
In: European Conference on Computer Vision, pp
Thies, J., Elgharib, M., Tewari, A., Theobalt, C., Nießner, M.: Neural voice pup- petry: Audio-driven facial reenactment. In: European Conference on Computer Vision, pp. 716–731 (2020). Springer
2020
-
[13]
ACM Transactions On Graph- ics (TOG)39(6), 1–15 (2020)
Zhou, Y., Han, X., Shechtman, E., Echevarria, J., Kalogerakis, E., Li, D.: Makelttalk: speaker-aware talking-head animation. ACM Transactions On Graph- ics (TOG)39(6), 1–15 (2020)
2020
-
[14]
The Asian Journal of Applied Linguistics10(1), 1313–1313 (2026)
Har, F., Javier, D.R.C.: Heygen’s ai video platform for english language teaching. The Asian Journal of Applied Linguistics10(1), 1313–1313 (2026)
2026
-
[15]
PRESENCE: Virtual and Augmented Reality29, 113–139 (2020)
Jin, A., Deng, Q., Deng, Z.: A live speech-driven avatar-mediated three-party 24 telepresence system: design and evaluation. PRESENCE: Virtual and Augmented Reality29, 113–139 (2020)
2020
-
[16]
Electronics12(23), 4788 (2023)
Christoff, N., Neshov, N.N., Tonchev, K., Manolova, A.: Application of a 3d talk- ing head as part of telecommunication ar, vr, mr system: Systematic review. Electronics12(23), 4788 (2023)
2023
-
[17]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Zhu, Y., Zhang, L., Rong, Z., Hu, T., Liang, S., Ge, Z.: Infp: Audio-driven inter- active head generation in dyadic conversations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10667–10677 (2025)
2025
-
[18]
Visual Intelligence2(1), 24 (2024)
Yan, Y., Zhou, Z., Wang, Z., Gao, J., Yang, X.: Dialoguenerf: Towards realis- tic avatar face-to-face conversation video generation. Visual Intelligence2(1), 24 (2024)
2024
-
[19]
In: 2025 IEEE 6th International Seminar on Artificial Intelligence, Networking and Information Technology (AINIT), pp
Bai, X., He, X., Ma, M., Wang, X., Jiang, W., Du, T., Huang, Z.: A survey on audio-driven talking face generation. In: 2025 IEEE 6th International Seminar on Artificial Intelligence, Networking and Information Technology (AINIT), pp. 1–6 (2025). IEEE
2025
-
[20]
Electronics12(1), 218 (2023)
Zhen, R., Song, W., He, Q., Cao, J., Shi, L., Luo, J.: Human-computer interaction system: A survey of talking-head generation. Electronics12(1), 218 (2023)
2023
-
[21]
arXiv preprint arXiv:2308.16041 (2023)
Gowda, S.N., Pandey, D., Gowda, S.N.: From pixels to portraits: A comprehensive survey of talking head generation techniques and applications. arXiv preprint arXiv:2308.16041 (2023)
arXiv 2023
-
[22]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Zhang, Z., Li, L., Ding, Y., Fan, C.: Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3661–3670 (2021)
2021
-
[23]
arXiv preprint arXiv:1806.05622 (2018)
Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622 (2018)
arXiv 2018
-
[24]
In: European Conference on Computer Vision, pp
Wang, K., Wu, Q., Song, L., Yang, Z., Wu, W., Qian, C., He, R., Qiao, Y., Loy, C.C.: Mead: A large-scale audio-visual dataset for emotional talking-face generation. In: European Conference on Computer Vision, pp. 700–717 (2020). Springer
2020
-
[25]
In: CVPR (2018)
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)
2018
-
[26]
In: 2017 IEEE International Conference on Image Processing (ICIP), pp
Snell, J., Ridgeway, K., Liao, R., Roads, B.D., Mozer, M.C., Zemel, R.S.: Learning to generate images with perceptual similarity metrics. In: 2017 IEEE International Conference on Image Processing (ICIP), pp. 4277–4281 (2017). IEEE 25
2017
-
[27]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4690–4699 (2019)
2019
-
[28]
In: BMVC (2016)
Chung, J.S., Zisserman, A.: Lip reading in the wild. In: BMVC (2016)
2016
-
[29]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N.,et al.: Vbench: Comprehensive benchmark suite for video generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21807–21818 (2024)
2024
-
[30]
Pattern recognition44(3), 678–693 (2011)
Petitjean, F., Ketterlin, A., Gan¸ carski, P.: A global averaging method for dynamic time warping, with applications to clustering. Pattern recognition44(3), 678–693 (2011)
2011
-
[31]
In: International Conference on Machine Learning, pp
Cuturi, M., Blondel, M.: Soft-dtw: a differentiable loss function for time-series. In: International Conference on Machine Learning, pp. 894–903 (2017). PMLR
2017
-
[32]
arXiv preprint arXiv:2312.097672(3) (2023)
Ma, Y., Zhang, S., Wang, J., Wang, X., Zhang, Y., Deng, Z.: Dreamtalk: When expressive talking head generation meets diffusion probabilistic models. arXiv preprint arXiv:2312.097672(3) (2023)
arXiv 2023
-
[33]
Zhang, Y., Minhao, L., Chen, Z., Wu, B., Zhan, C., He, Y., HUANG, J., Zhou, W., et al.: Musetalk: Real-time high quality lip synchronization with latent space inpainting (2024)
2024
-
[34]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Wang, J., Qian, X., Zhang, M., Tan, R.T., Li, H.: Seeing what you said: Talking face generation guided by a lip reading expert. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14653–14662 (2023)
2023
-
[35]
In: Proceedings of the 33rd ACM International Conference on Multimedia, pp
Li, T., Zheng, R., Yang, M., Chen, J., Yang, M.: Ditto: Motion-space diffusion for controllable realtime talking head synthesis. In: Proceedings of the 33rd ACM International Conference on Multimedia, pp. 9704–9713 (2025)
2025
-
[36]
In: Proceedings of the 32nd ACM International Conference on Multimedia, pp
Liu, T., Chen, F., Fan, S., Du, C., Chen, Q., Chen, X., Yu, K.: Anitalker: animate vivid and diverse talking faces through identity-decoupled facial motion encoding. In: Proceedings of the 32nd ACM International Conference on Multimedia, pp. 6696–6705 (2024)
2024
-
[37]
arXiv preprint arXiv:2411.09209 (2024)
Cao, X., Wang, G., Shi, S., Zhao, J., Yao, Y., Fei, J., Gao, M.: Joyvasa: portrait and animal image animation with diffusion-based audio-driven facial dynamics and head motion generation. arXiv preprint arXiv:2411.09209 (2024)
Pith/arXiv arXiv 2024
-
[38]
In: European Conference on Computer Vision, pp
Tan, S., Ji, B., Bi, M., Pan, Y.: Edtalk: Efficient disentanglement for emotional talking head synthesis. In: European Conference on Computer Vision, pp. 398–416 (2024). Springer 26
2024
-
[39]
arXiv preprint arXiv:2406.02511 (2024)
Wang, C., Tian, K., Zhang, J., Guan, Y., Luo, F., Shen, F., Jiang, Z., Gu, Q., Han, X., Yang, W.: V-express: Conditional dropout for progressive training of portrait video generation. arXiv preprint arXiv:2406.02511 (2024)
arXiv 2024
-
[40]
In: Proceedings of the AAAI Conference on Artificial Intelligence, vol
Chen, Z., Cao, J., Chen, Z., Li, Y., Ma, C.: Echomimic: Lifelike audio-driven portrait animations through editable landmark conditions. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, pp. 2403–2410 (2025)
2025
-
[41]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Zhong, W., Fang, C., Cai, Y., Wei, P., Zhao, G., Lin, L., Li, G.: Identity-preserving talking face generation with landmark and appearance priors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2023)
2023
-
[42]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp
Ki, T., Min, D., Chae, G.: Float: Generative motion latent flow matching for audio-driven talking portrait. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14699–14710 (2025)
2025
-
[43]
arXiv preprint arXiv:2406.08801 (2024)
Xu, M., Li, H., Su, Q., Shang, H., Zhang, L., Liu, C., Wang, J., Yao, Y., Zhu, S.: Hallo: Hierarchical audio-driven visual synthesis for portrait image animation. arXiv preprint arXiv:2406.08801 (2024)
arXiv 2024
-
[44]
arXiv preprint arXiv:2410.07718 (2024)
Cui, J., Li, H., Yao, Y., Zhu, H., Shang, H., Cheng, K., Zhou, H., Zhu, S., Wang, J.: Hallo2: Long-duration and high-resolution audio-driven portrait image animation. arXiv preprint arXiv:2410.07718 (2024)
arXiv 2024
-
[45]
In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp
Cui, J., Li, H., Zhan, Y., Shang, H., Cheng, K., Ma, Y., Mu, S., Zhou, H., Wang, J., Zhu, S.: Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 21086–21095 (2025)
2025
-
[46]
arXiv preprint arXiv:2412.04448 (2024)
Zheng, L., Zhang, Y., Guo, H., Pan, J., Tan, Z., Lu, J., Tang, C., An, B., Yan, S.: Memo: Memory-guided diffusion for expressive talking video generation. arXiv preprint arXiv:2412.04448 (2024)
arXiv 2024
-
[47]
In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp
Ji, X., Hu, X., Xu, Z., Zhu, J., Lin, C., He, Q., Zhang, J., Luo, D., Chen, Y., Lin, Q.,et al.: Sonic: Shifting focus to global audio perception in portrait animation. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 193–203 (2025)
2025
-
[48]
In: The Second International Workshop on Transformative Insights in Multifaceted Evaluation at The Web Conference 2026 (2026)
Zhang, Z., Wang, L., Gao, Y., Zhang, Y.: Talking-head generation in practice. In: The Second International Workshop on Transformative Insights in Multifaceted Evaluation at The Web Conference 2026 (2026). https://openreview.net/forum?id=ns3TgZYQTZ
2026
-
[49]
In: European Conference on Computer Vision, pp
Zhu, H., Wu, W., Zhu, W., Jiang, L., Tang, S., Zhang, L., Liu, Z., Loy, C.C.: Celebv-hq: A large-scale video facial attributes dataset. In: European Conference on Computer Vision, pp. 650–667 (2022). Springer 27
2022
-
[50]
PloS one13(5), 0196391 (2018)
Livingstone, S.R., Russo, F.A.: The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english. PloS one13(5), 0196391 (2018)
2018
-
[51]
International Journal of Computer Vision133(10), 7154– 7200 (2025)
Hondru, V., Croitoru, F.A., Minaee, S., Ionescu, R.T., Sebe, N.: Masked image modeling: A survey. International Journal of Computer Vision133(10), 7154– 7200 (2025)
2025
-
[52]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Zhou, H., Sun, Y., Wu, W., Loy, C.C., Wang, X., Liu, Z.: Pose-controllable talking face generation by implicitly modularized audio-visual representation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4176–4186 (2021)
2021
-
[53]
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595 (2018)
2018
-
[54]
arXiv preprint arXiv:2407.03168 (2024) 28
Guo, J., Zhang, D., Liu, X., Zhong, Z., Zhang, Y., Wan, P., Zhang, D.: Livepor- trait: Efficient portrait animation with stitching and retargeting control. arXiv preprint arXiv:2407.03168 (2024) 28
arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.