JoyVASA: Portrait and Animal Image Animation with Diffusion-Based Audio-Driven Facial Dynamics and Head Motion Generation

Guoxin Wang; Jintao Fei; Jun Zhao; Minyu Gao; Pei Xie; Sheng Shi; Xuyang Cao; Yang Yao

arxiv: 2411.09209 · v5 · submitted 2024-11-14 · 💻 cs.CV

JoyVASA: Portrait and Animal Image Animation with Diffusion-Based Audio-Driven Facial Dynamics and Head Motion Generation

Xuyang Cao , Guoxin Wang , Sheng Shi , Jun Zhao , Yang Yao , Jintao Fei , Minyu Gao , Pei Xie This is my paper

Pith reviewed 2026-05-23 16:52 UTC · model grok-4.3

classification 💻 cs.CV

keywords audio-driven facial animationdiffusion transformerdecoupled representationhead motion generationportrait animationanimal face animationmultilingual support

0 comments

The pith

Decoupling static 3D faces from audio-driven motions enables animation of any portrait or animal face.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a two-stage method for audio-driven facial animation. First, it separates static 3D facial structure from dynamic expressions to allow flexible combination. Second, it uses a diffusion transformer to create motion sequences from audio without depending on the character's identity. This setup supports longer videos and extends the animation to animal faces using the same process. A generator then renders the final video from the static representation and motions.

Core claim

JoyVASA separates dynamic facial expressions from static 3D facial representations in the first stage, allowing any static 3D face to pair with generated motions. In the second stage, a diffusion transformer generates motion sequences directly from audio in an identity-independent manner. The generator then renders high-quality animations, extending the method to animal faces seamlessly.

What carries the argument

The decoupled facial representation framework that separates dynamic facial expressions from static 3D facial representations, combined with an identity-independent diffusion transformer for motion generation from audio.

If this is right

Longer videos become possible by reusing the same motion sequence with different static representations.
Animal faces can be animated using the same audio-to-motion generator without retraining.
Multilingual audio support is achieved through training on mixed Chinese and English data.
Inter-frame continuity improves because motions are generated as sequences rather than frame-by-frame.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar decoupling could apply to animating full human bodies or objects if 3D representations are available.
The identity-independent motion might allow mixing motions from different audio sources for creative editing.

Load-bearing premise

That combining any static 3D facial representation with the generated motion sequences produces high-quality animations without introducing artifacts or breaking consistency between frames.

What would settle it

A test where a generated motion sequence is applied to a new static 3D animal or human face and the resulting video shows visible artifacts, flickering, or mismatched expressions.

Figures

Figures reproduced from arXiv: 2411.09209 by Guoxin Wang, Jintao Fei, Jun Zhao, Minyu Gao, Pei Xie, Sheng Shi, Xuyang Cao, Yang Yao.

**Figure 2.** Figure 2: Training process of the audio-driven motion sequence generation. The audio feature and real motion sequences [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization results of different methods on the celebV-HQ test dataset. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization results of different portraits driven by the same audio input on the Openset dataset. Note that [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Audio-driven portrait animation has made significant advances with diffusion-based models, improving video quality and lipsync accuracy. However, the increasing complexity of these models has led to inefficiencies in training and inference, as well as constraints on video length and inter-frame continuity. In this paper, we propose JoyVASA, a diffusion-based method for generating facial dynamics and head motion in audio-driven facial animation. Specifically, in the first stage, we introduce a decoupled facial representation framework that separates dynamic facial expressions from static 3D facial representations. This decoupling allows the system to generate longer videos by combining any static 3D facial representation with dynamic motion sequences. Then, in the second stage, a diffusion transformer is trained to generate motion sequences directly from audio cues, independent of character identity. Finally, a generator trained in the first stage uses the 3D facial representation and the generated motion sequences as inputs to render high-quality animations. With the decoupled facial representation and the identity-independent motion generation process, JoyVASA extends beyond human portraits to animate animal faces seamlessly. The model is trained on a hybrid dataset of private Chinese and public English data, enabling multilingual support. Experimental results validate the effectiveness of our approach. Future work will focus on improving real-time performance and refining expression control, further expanding the applications in portrait animation. The code is available at: https://github.com/jdh-algo/JoyVASA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The decoupled static 3D face plus identity-free diffusion motion generator lets them claim longer clips and animal animation, but the abstract supplies zero metrics or ablations to check if it works.

read the letter

The paper's main move is a two-stage split: first a framework that pulls apart static 3D facial structure from dynamic expressions, then a diffusion transformer that turns audio into motion sequences without tying them to any particular identity. That separation is meant to let you plug any static 3D model into the generated motions, which they say supports longer videos and swaps in animal faces without retraining everything.

Referee Report

2 major / 1 minor

Summary. The paper presents JoyVASA, a two-stage diffusion-based method for audio-driven facial animation of portraits and animals. Stage 1 introduces a decoupled facial representation that separates static 3D facial features from dynamic expressions, allowing any static representation to be paired with generated motions for longer videos. Stage 2 trains a diffusion transformer to produce identity-independent motion sequences directly from audio. A generator then renders the final animation. The method is trained on a hybrid private Chinese and public English dataset for multilingual support and claims seamless extension to animal faces. The abstract states that experimental results validate the approach, and code is released.

Significance. If the central claims hold, the decoupling of static 3D representation from identity-independent motion generation could improve training efficiency, support longer sequences, and enable cross-species animation without per-identity retraining. The hybrid dataset for multilingual capability and public code release are concrete strengths that would aid reproducibility and adoption if quantitative validation is supplied.

major comments (2)

[Abstract] Abstract: The manuscript states that 'Experimental results validate the effectiveness of our approach' but supplies no quantitative metrics, ablation studies, error analysis, comparison tables, or figures. This absence is load-bearing because the central claims concern improved video quality, lipsync accuracy, inter-frame continuity, and seamless animal-face extension.
[Abstract] Abstract (decoupled representation and motion generation): The claim that 'combining any static 3D facial representation with dynamic motion sequences' yields high-quality animations without artifacts or loss of continuity rests on an unexamined assumption. No analysis, experiments, or failure-case discussion addresses inter-frame consistency or artifact introduction when swapping static representations, which directly underpins the extension to animal faces and longer videos.

minor comments (1)

[Abstract] Abstract: The statement that the model 'extends beyond human portraits to animate animal faces seamlessly' is presented without any supporting examples, qualitative results, or discussion of domain-specific challenges (e.g., differing facial topology).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and the central claims. We address each major comment point-by-point below.

read point-by-point responses

Referee: [Abstract] Abstract: The manuscript states that 'Experimental results validate the effectiveness of our approach' but supplies no quantitative metrics, ablation studies, error analysis, comparison tables, or figures. This absence is load-bearing because the central claims concern improved video quality, lipsync accuracy, inter-frame continuity, and seamless animal-face extension.

Authors: The abstract serves as a concise summary; the full manuscript contains a dedicated Experiments section with quantitative metrics, ablation studies, comparison tables, and figures evaluating video quality, lipsync accuracy, and continuity. To make the validation explicit in the abstract itself, we will revise it to reference key results (e.g., specific metrics on lipsync and quality). revision: yes
Referee: [Abstract] Abstract (decoupled representation and motion generation): The claim that 'combining any static 3D facial representation with dynamic motion sequences' yields high-quality animations without artifacts or loss of continuity rests on an unexamined assumption. No analysis, experiments, or failure-case discussion addresses inter-frame consistency or artifact introduction when swapping static representations, which directly underpins the extension to animal faces and longer videos.

Authors: The two-stage design trains the diffusion transformer on identity-independent motions and uses a generator that accepts arbitrary static 3D representations as input, enabling the claimed flexibility. We agree that explicit analysis of swapping is needed to support the animal-face and long-video claims. We will add a new subsection with quantitative inter-frame consistency metrics, artifact analysis, and failure cases for cross-identity and cross-species swaps. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and description present JoyVASA as a two-stage framework: a decoupled facial representation separating static 3D identity from dynamic expressions, followed by an identity-independent diffusion transformer generating motion sequences from audio. No equations, fitted parameters, or self-citations are shown that would reduce any claimed prediction or result to an input quantity by construction. The extension to animal faces is asserted as a direct consequence of the decoupling, without evidence of self-definitional loops or renamed empirical patterns. The derivation chain is self-contained against external benchmarks and does not rely on load-bearing self-citations or ansatzes smuggled via prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5806 in / 1010 out tokens · 39596 ms · 2026-05-23T16:52:29.688069+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 1 internal anchor

[1]

Geneface++: Generalized and stable real-time audio-driven 3d talking face generation

Zhenhui Ye, Jinzheng He, Ziyue Jiang, Rongjie Huang, Jiawei Huang, Jinglin Liu, Yi Ren, Xiang Yin, Zejun Ma, and Zhou Zhao. Geneface++: Generalized and stable real-time audio-driven 3d talking face generation. arXiv preprint arXiv:2305.00787, 2023

work page arXiv 2023
[2]

Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation,

Jiahao Cui, Hui Li, Yao Yao, Hao Zhu, Hanlin Shang, Kaihui Cheng, Hang Zhou, Siyu Zhu, and Jingdong Wang. Hallo2: Long-duration and high-resolution audio-driven portrait image animation. arXiv preprint arXiv:2410.07718, 2024

work page arXiv 2024
[3]

AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation,

Huawei Wei, Zejun Yang, and Zhisheng Wang. Aniportrait: Audio-driven synthesis of photorealistic portrait animation. arXiv preprint arXiv:2403.17694, 2024

work page arXiv 2024
[4]

Loopy: Taming audio- driven portrait avatar with long-term motion dependency

Jianwen Jiang, Chao Liang, Jiaqi Yang, Gaojie Lin, Tianyun Zhong, and Yanbo Zheng. Loopy: Taming audio- driven portrait avatar with long-term motion dependency. arXiv preprint arXiv:2409.02634, 2024

work page arXiv 2024
[5]

Vlogger: Multimodal diffusion for embodied avatar synthesis

Enric Corona, Andrei Zanfir, Eduard Gabriel Bazavan, Nikos Kolotouros, Thiemo Alldieck, and Cristian Smin- chisescu. Vlogger: Multimodal diffusion for embodied avatar synthesis. arXiv preprint arXiv:2403.08764 , 2024

work page arXiv 2024
[6]

Emo: Emote portrait alive - generating expressive portrait videos with audio2video diffusion model under weak conditions, 2024

Linrui Tian, Qi Wang, Bang Zhang, and Liefeng Bo. Emo: Emote portrait alive - generating expressive portrait videos with audio2video diffusion model under weak conditions, 2024. 9 JoyV ASA A PREPRINT

work page 2024
[7]

Emotalker: Emotionally editable talking face generation via diffusion model

Bingyuan Zhang, Xulong Zhang, Ning Cheng, Jun Yu, Jing Xiao, and Jianzong Wang. Emotalker: Emotionally editable talking face generation via diffusion model. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8276–8280. IEEE, 2024

work page 2024
[8]

Digital avatars: Promoting independent living for older adults

Manuel F Bertoa, Nathalie Moreno, Alejandro Perez-Vereda, David Bandera, José M Álvarez-Palomo, and Carlos Canal. Digital avatars: Promoting independent living for older adults. Wireless Communications and Mobile Computing, 2020(1):8891002, 2020

work page 2020
[9]

Talking face generation with multilingual tts

Hyoung-Kyu Song, Sang Hoon Woo, Junhyeok Lee, Seungmin Yang, Hyunjae Cho, Youseong Lee, Dongho Choi, and Kang-wook Kim. Talking face generation with multilingual tts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21425–21430, 2022

work page 2022
[10]

Improving user experience of virtual health assistants: scoping review

Rachel G Curtis, Bethany Bartel, Ty Ferguson, Henry T Blake, Celine Northcott, Rosa Virgara, and Carol A Maher. Improving user experience of virtual health assistants: scoping review. Journal of medical Internet research, 23(12):e31737, 2021

work page 2021
[11]

Chatanything: Facetime chat with llm-enhanced personas

Yilin Zhao, Xinbin Yuan, Shanghua Gao, Zhijie Lin, Qibin Hou, Jiashi Feng, and Daquan Zhou. Chatanything: Facetime chat with llm-enhanced personas. arXiv preprint arXiv:2311.06772, 2023

work page arXiv 2023
[12]

Building llm-based ai agents in social virtual reality

Hongyu Wan, Jinda Zhang, Abdulaziz Arif Suria, Bingsheng Yao, Dakuo Wang, Yvonne Coady, and Mirjana Prpa. Building llm-based ai agents in social virtual reality. In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, pages 1–7, 2024

work page 2024
[13]

Stylesync: High-fidelity generalized and personalized lip sync in style-based generator, 2023

Jiazhi Guan, Zhanwang Zhang, Hang Zhou, Tianshu Hu, Kaisiyuan Wang, Dongliang He, Haocheng Feng, Jingtuo Liu, Errui Ding, Ziwei Liu, and Jingdong Wang. Stylesync: High-fidelity generalized and personalized lip sync in style-based generator, 2023

work page 2023
[14]

Echomimic: Lifelike audio-driven portrait animations through editable landmark conditioning, 2024

Zhiyuan Chen, Jiajiong Cao, Zhiquan Chen, Yuming Li, and Chenguang Ma. Echomimic: Lifelike audio-driven portrait animations through editable landmark conditioning, 2024

work page 2024
[15]

Hallo: Hierarchical audio-driven visual synthesis for portrait image animation, 2024

Mingwang Xu, Hui Li, Qingkun Su, Hanlin Shang, Liwei Zhang, Ce Liu, Jingdong Wang, Yao Yao, and Siyu zhu. Hallo: Hierarchical audio-driven visual synthesis for portrait image animation, 2024

work page 2024
[16]

Joyhallo: Digital human model for mandarin

Sheng Shi, Xuyang Cao, Jun Zhao, and Guoxin Wang. Joyhallo: Digital human model for mandarin. arXiv preprint arXiv:2409.13268, 2024

work page arXiv 2024
[17]

Liveportrait: Efficient portrait animation with stitching and retargeting control

Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, and Di Zhang. Liveportrait: Efficient portrait animation with stitching and retargeting control. arXiv preprint arXiv:2407.03168, 2024

work page arXiv 2024
[18]

A lip sync expert is all you need for speech to lip generation in the wild

KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM international conference on multimedia, pages 484–492, 2020

work page 2020
[19]

Dinet: Deformation inpainting network for realistic face visually dubbing on high resolution video

Zhimeng Zhang, Zhipeng Hu, Wenjin Deng, Changjie Fan, Tangjie Lv, and Yu Ding. Dinet: Deformation inpainting network for realistic face visually dubbing on high resolution video. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 3543–3551, 2023

work page 2023
[20]

Synctalkface: Talking face generation with precise lip-syncing via audio-lip memory

Se Jin Park, Minsu Kim, Joanna Hong, Jeongsoo Choi, and Yong Man Ro. Synctalkface: Talking face generation with precise lip-syncing via audio-lip memory. In AAAI, pages 2062–2070. Association for the Advancement of Artificial Intelligence, 2022

work page 2062
[21]

Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation

Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, Ying Shan, and Fei Wang. Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. In CVPR, pages 8652–8661, 2023

work page 2023
[22]

DreamTalk: When Emotional Talking Head Generation Meets Diffusion Probabilistic Models,

Yifeng Ma, Shiwei Zhang, Jiayu Wang, Xiang Wang, Yingya Zhang, and Zhidong Deng. Dreamtalk: When expressive talking head generation meets diffusion probabilistic models. arXiv preprint arXiv:2312.09767, 2023

work page arXiv 2023
[23]

VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior,

Xusen Sun, Longhao Zhang, Hao Zhu, Peng Zhang, Bang Zhang, Xinya Ji, Kangneng Zhou, Daiheng Gao, Liefeng Bo, and Xun Cao. Vividtalk: One-shot audio-driven talking head generation based on 3d hybrid prior. arXiv preprint arXiv:2312.01841, 2023

work page arXiv 2023
[24]

Moda: Mapping-once audio-driven portrait animation with dual attentions

Yunfei Liu, Lijian Lin, Fei Yu, Changyin Zhou, and Yu Li. Moda: Mapping-once audio-driven portrait animation with dual attentions. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 23020–23029, 2023

work page 2023
[25]

Hierarchical cross-modal talking face generation with dynamic pixel-wise loss

Lele Chen, Ross K Maddox, Zhiyao Duan, and Chenliang Xu. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In CVPR, pages 7832–7841, 2019

work page 2019
[26]

Neural voice puppetry: Audio-driven facial reenactment

Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, and Matthias Nießner. Neural voice puppetry: Audio-driven facial reenactment. In Computer Vision–ECCV 2020, pages 716–731. Springer, 2020. 10 JoyV ASA A PREPRINT

work page 2020
[27]

Audio-driven talking face video generation with learning-based personalized head pose, 2020

Ran Yi, Zipeng Ye, Juyong Zhang, Hujun Bao, and Yong-Jin Liu. Audio-driven talking face video generation with learning-based personalized head pose, 2020

work page 2020
[28]

One-shot free-view neural talking-head synthesis for video conferencing

Ting-Chun Wang, Arun Mallya, and Ming-Yu Liu. One-shot free-view neural talking-head synthesis for video conferencing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10039–10049, 2021

work page 2021
[29]

Pirenderer: Controllable portrait image generation via semantic neural rendering

Yurui Ren, Ge Li, Yuanqi Chen, Thomas H Li, and Shan Liu. Pirenderer: Controllable portrait image generation via semantic neural rendering. In Proceedings of the IEEE/CVF international conference on computer vision, pages 13759–13768, 2021

work page 2021
[30]

V ASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time,

Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, and Baining Guo. Vasa-1: Lifelike audio-driven talking faces generated in real time. arXiv preprint arXiv:2404.10667, 2024

work page arXiv 2024
[31]

Megaportraits: One-shot megapixel neural head avatars

Nikita Drobyshev, Jenya Chelishev, Taras Khakhulin, Aleksei Ivakhnenko, Victor Lempitsky, and Egor Zakharov. Megaportraits: One-shot megapixel neural head avatars. In Proceedings of the 30th ACM International Conference on Multimedia, pages 2663–2671, 2022

work page 2022
[32]

A morphable model for the synthesis of 3d faces

V olker Blanz and Thomas Vetter. A morphable model for the synthesis of 3d faces. InSeminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 157–164. 2023

work page 2023
[33]

Expression invariant 3d face recognition with a morphable model

Brian Amberg, Reinhard Knothe, and Thomas Vetter. Expression invariant 3d face recognition with a morphable model. In 2008 IEEE International Conference on Automatic Face & Gesture Recognition, pages 1–6. IEEE, 2008

work page 2008
[34]

Learning a model of facial shape and expression from 4d scans

Tianye Li, Timo Bolkart, Michael J Black, Hao Li, and Javier Romero. Learning a model of facial shape and expression from 4d scans. ACM Trans. Graph., 36(6):194–1, 2017

work page 2017
[35]

Disentangled representation learning for 3d face shape, 2019

Zi-Hang Jiang, Qianyi Wu, Keyu Chen, and Juyong Zhang. Disentangled representation learning for 3d face shape, 2019

work page 2019
[36]

Emoportraits: Emotion-enhanced multimodal one-shot head avatars

Nikita Drobyshev, Antoni Bigata Casademunt, Konstantinos V ougioukas, Zoe Landgraf, Stavros Petridis, and Maja Pantic. Emoportraits: Emotion-enhanced multimodal one-shot head avatars. In CVPR, pages 8498–8507, 2024

work page 2024
[37]

First order motion model for image animation

Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. First order motion model for image animation. Advances in neural information processing systems, 32, 2019

work page 2019
[38]

wav2vec 2.0: A framework for self-supervised learning of speech representations

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449– 12460, 2020

work page 2020
[39]

Capture, learning, and synthesis of 3d speaking styles

Daniel Cudeiro, Timo Bolkart, Cassidy Laidlaw, Anurag Ranjan, and Michael J Black. Capture, learning, and synthesis of 3d speaking styles. In CVPR, pages 10101–10111, 2019

work page 2019
[40]

Diffposetalk: Speech-driven stylistic 3d facial animation and head pose generation via diffusion models

Zhiyao Sun, Tian Lv, Sheng Ye, Matthieu Lin, Jenny Sheng, Yu-Hui Wen, Minjing Yu, and Yong-jin Liu. Diffposetalk: Speech-driven stylistic 3d facial animation and head pose generation via diffusion models. ACM Transactions on Graphics (TOG), 43(4):1–9, 2024

work page 2024
[41]

Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset

Zhimeng Zhang, Lincheng Li, Yu Ding, and Changjie Fan. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In CVPR, Jun 2021

work page 2021
[42]

Celebv-hq: A large-scale video facial attributes dataset

Hao Zhu, Wayne Wu, Wentao Zhu, Liming Jiang, Siwei Tang, Li Zhang, Ziwei Liu, and Chen Change Loy. Celebv-hq: A large-scale video facial attributes dataset. In ECCV, pages 650–667. Springer, 2022

work page 2022
[43]

Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels

Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Yixuan Gao, Annan Wang, Erli Zhang, Wenxiu Sun, et al. Q-align: Teaching lmms for visual scoring via discrete text-defined levels. arXiv preprint arXiv:2312.17090, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Out of time: Automated lip sync in the wild

Joon Son Chung and Andrew Zisserman. Out of time: Automated lip sync in the wild. In ACCV Workshops, 2016

work page 2016
[45]

FVD: A new metric for video generation

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. FVD: A new metric for video generation. In Deep Generative Models for Highly Structured Data, ICLR 2019 Workshop, New Orleans, Louisiana, United States, May 6, 2019. OpenReview.net, 2019. 11

work page 2019

[1] [1]

Geneface++: Generalized and stable real-time audio-driven 3d talking face generation

Zhenhui Ye, Jinzheng He, Ziyue Jiang, Rongjie Huang, Jiawei Huang, Jinglin Liu, Yi Ren, Xiang Yin, Zejun Ma, and Zhou Zhao. Geneface++: Generalized and stable real-time audio-driven 3d talking face generation. arXiv preprint arXiv:2305.00787, 2023

work page arXiv 2023

[2] [2]

Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation,

Jiahao Cui, Hui Li, Yao Yao, Hao Zhu, Hanlin Shang, Kaihui Cheng, Hang Zhou, Siyu Zhu, and Jingdong Wang. Hallo2: Long-duration and high-resolution audio-driven portrait image animation. arXiv preprint arXiv:2410.07718, 2024

work page arXiv 2024

[3] [3]

AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation,

Huawei Wei, Zejun Yang, and Zhisheng Wang. Aniportrait: Audio-driven synthesis of photorealistic portrait animation. arXiv preprint arXiv:2403.17694, 2024

work page arXiv 2024

[4] [4]

Loopy: Taming audio- driven portrait avatar with long-term motion dependency

Jianwen Jiang, Chao Liang, Jiaqi Yang, Gaojie Lin, Tianyun Zhong, and Yanbo Zheng. Loopy: Taming audio- driven portrait avatar with long-term motion dependency. arXiv preprint arXiv:2409.02634, 2024

work page arXiv 2024

[5] [5]

Vlogger: Multimodal diffusion for embodied avatar synthesis

Enric Corona, Andrei Zanfir, Eduard Gabriel Bazavan, Nikos Kolotouros, Thiemo Alldieck, and Cristian Smin- chisescu. Vlogger: Multimodal diffusion for embodied avatar synthesis. arXiv preprint arXiv:2403.08764 , 2024

work page arXiv 2024

[6] [6]

Emo: Emote portrait alive - generating expressive portrait videos with audio2video diffusion model under weak conditions, 2024

Linrui Tian, Qi Wang, Bang Zhang, and Liefeng Bo. Emo: Emote portrait alive - generating expressive portrait videos with audio2video diffusion model under weak conditions, 2024. 9 JoyV ASA A PREPRINT

work page 2024

[7] [7]

Emotalker: Emotionally editable talking face generation via diffusion model

Bingyuan Zhang, Xulong Zhang, Ning Cheng, Jun Yu, Jing Xiao, and Jianzong Wang. Emotalker: Emotionally editable talking face generation via diffusion model. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8276–8280. IEEE, 2024

work page 2024

[8] [8]

Digital avatars: Promoting independent living for older adults

Manuel F Bertoa, Nathalie Moreno, Alejandro Perez-Vereda, David Bandera, José M Álvarez-Palomo, and Carlos Canal. Digital avatars: Promoting independent living for older adults. Wireless Communications and Mobile Computing, 2020(1):8891002, 2020

work page 2020

[9] [9]

Talking face generation with multilingual tts

Hyoung-Kyu Song, Sang Hoon Woo, Junhyeok Lee, Seungmin Yang, Hyunjae Cho, Youseong Lee, Dongho Choi, and Kang-wook Kim. Talking face generation with multilingual tts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21425–21430, 2022

work page 2022

[10] [10]

Improving user experience of virtual health assistants: scoping review

Rachel G Curtis, Bethany Bartel, Ty Ferguson, Henry T Blake, Celine Northcott, Rosa Virgara, and Carol A Maher. Improving user experience of virtual health assistants: scoping review. Journal of medical Internet research, 23(12):e31737, 2021

work page 2021

[11] [11]

Chatanything: Facetime chat with llm-enhanced personas

Yilin Zhao, Xinbin Yuan, Shanghua Gao, Zhijie Lin, Qibin Hou, Jiashi Feng, and Daquan Zhou. Chatanything: Facetime chat with llm-enhanced personas. arXiv preprint arXiv:2311.06772, 2023

work page arXiv 2023

[12] [12]

Building llm-based ai agents in social virtual reality

Hongyu Wan, Jinda Zhang, Abdulaziz Arif Suria, Bingsheng Yao, Dakuo Wang, Yvonne Coady, and Mirjana Prpa. Building llm-based ai agents in social virtual reality. In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, pages 1–7, 2024

work page 2024

[13] [13]

Stylesync: High-fidelity generalized and personalized lip sync in style-based generator, 2023

Jiazhi Guan, Zhanwang Zhang, Hang Zhou, Tianshu Hu, Kaisiyuan Wang, Dongliang He, Haocheng Feng, Jingtuo Liu, Errui Ding, Ziwei Liu, and Jingdong Wang. Stylesync: High-fidelity generalized and personalized lip sync in style-based generator, 2023

work page 2023

[14] [14]

Echomimic: Lifelike audio-driven portrait animations through editable landmark conditioning, 2024

Zhiyuan Chen, Jiajiong Cao, Zhiquan Chen, Yuming Li, and Chenguang Ma. Echomimic: Lifelike audio-driven portrait animations through editable landmark conditioning, 2024

work page 2024

[15] [15]

Hallo: Hierarchical audio-driven visual synthesis for portrait image animation, 2024

Mingwang Xu, Hui Li, Qingkun Su, Hanlin Shang, Liwei Zhang, Ce Liu, Jingdong Wang, Yao Yao, and Siyu zhu. Hallo: Hierarchical audio-driven visual synthesis for portrait image animation, 2024

work page 2024

[16] [16]

Joyhallo: Digital human model for mandarin

Sheng Shi, Xuyang Cao, Jun Zhao, and Guoxin Wang. Joyhallo: Digital human model for mandarin. arXiv preprint arXiv:2409.13268, 2024

work page arXiv 2024

[17] [17]

Liveportrait: Efficient portrait animation with stitching and retargeting control

Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, and Di Zhang. Liveportrait: Efficient portrait animation with stitching and retargeting control. arXiv preprint arXiv:2407.03168, 2024

work page arXiv 2024

[18] [18]

A lip sync expert is all you need for speech to lip generation in the wild

KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM international conference on multimedia, pages 484–492, 2020

work page 2020

[19] [19]

Dinet: Deformation inpainting network for realistic face visually dubbing on high resolution video

Zhimeng Zhang, Zhipeng Hu, Wenjin Deng, Changjie Fan, Tangjie Lv, and Yu Ding. Dinet: Deformation inpainting network for realistic face visually dubbing on high resolution video. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 3543–3551, 2023

work page 2023

[20] [20]

Synctalkface: Talking face generation with precise lip-syncing via audio-lip memory

Se Jin Park, Minsu Kim, Joanna Hong, Jeongsoo Choi, and Yong Man Ro. Synctalkface: Talking face generation with precise lip-syncing via audio-lip memory. In AAAI, pages 2062–2070. Association for the Advancement of Artificial Intelligence, 2022

work page 2062

[21] [21]

Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation

Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, Ying Shan, and Fei Wang. Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. In CVPR, pages 8652–8661, 2023

work page 2023

[22] [22]

DreamTalk: When Emotional Talking Head Generation Meets Diffusion Probabilistic Models,

Yifeng Ma, Shiwei Zhang, Jiayu Wang, Xiang Wang, Yingya Zhang, and Zhidong Deng. Dreamtalk: When expressive talking head generation meets diffusion probabilistic models. arXiv preprint arXiv:2312.09767, 2023

work page arXiv 2023

[23] [23]

VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior,

Xusen Sun, Longhao Zhang, Hao Zhu, Peng Zhang, Bang Zhang, Xinya Ji, Kangneng Zhou, Daiheng Gao, Liefeng Bo, and Xun Cao. Vividtalk: One-shot audio-driven talking head generation based on 3d hybrid prior. arXiv preprint arXiv:2312.01841, 2023

work page arXiv 2023

[24] [24]

Moda: Mapping-once audio-driven portrait animation with dual attentions

Yunfei Liu, Lijian Lin, Fei Yu, Changyin Zhou, and Yu Li. Moda: Mapping-once audio-driven portrait animation with dual attentions. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 23020–23029, 2023

work page 2023

[25] [25]

Hierarchical cross-modal talking face generation with dynamic pixel-wise loss

Lele Chen, Ross K Maddox, Zhiyao Duan, and Chenliang Xu. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In CVPR, pages 7832–7841, 2019

work page 2019

[26] [26]

Neural voice puppetry: Audio-driven facial reenactment

Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, and Matthias Nießner. Neural voice puppetry: Audio-driven facial reenactment. In Computer Vision–ECCV 2020, pages 716–731. Springer, 2020. 10 JoyV ASA A PREPRINT

work page 2020

[27] [27]

Audio-driven talking face video generation with learning-based personalized head pose, 2020

Ran Yi, Zipeng Ye, Juyong Zhang, Hujun Bao, and Yong-Jin Liu. Audio-driven talking face video generation with learning-based personalized head pose, 2020

work page 2020

[28] [28]

One-shot free-view neural talking-head synthesis for video conferencing

Ting-Chun Wang, Arun Mallya, and Ming-Yu Liu. One-shot free-view neural talking-head synthesis for video conferencing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10039–10049, 2021

work page 2021

[29] [29]

Pirenderer: Controllable portrait image generation via semantic neural rendering

Yurui Ren, Ge Li, Yuanqi Chen, Thomas H Li, and Shan Liu. Pirenderer: Controllable portrait image generation via semantic neural rendering. In Proceedings of the IEEE/CVF international conference on computer vision, pages 13759–13768, 2021

work page 2021

[30] [30]

V ASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time,

Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, and Baining Guo. Vasa-1: Lifelike audio-driven talking faces generated in real time. arXiv preprint arXiv:2404.10667, 2024

work page arXiv 2024

[31] [31]

Megaportraits: One-shot megapixel neural head avatars

Nikita Drobyshev, Jenya Chelishev, Taras Khakhulin, Aleksei Ivakhnenko, Victor Lempitsky, and Egor Zakharov. Megaportraits: One-shot megapixel neural head avatars. In Proceedings of the 30th ACM International Conference on Multimedia, pages 2663–2671, 2022

work page 2022

[32] [32]

A morphable model for the synthesis of 3d faces

V olker Blanz and Thomas Vetter. A morphable model for the synthesis of 3d faces. InSeminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 157–164. 2023

work page 2023

[33] [33]

Expression invariant 3d face recognition with a morphable model

Brian Amberg, Reinhard Knothe, and Thomas Vetter. Expression invariant 3d face recognition with a morphable model. In 2008 IEEE International Conference on Automatic Face & Gesture Recognition, pages 1–6. IEEE, 2008

work page 2008

[34] [34]

Learning a model of facial shape and expression from 4d scans

Tianye Li, Timo Bolkart, Michael J Black, Hao Li, and Javier Romero. Learning a model of facial shape and expression from 4d scans. ACM Trans. Graph., 36(6):194–1, 2017

work page 2017

[35] [35]

Disentangled representation learning for 3d face shape, 2019

Zi-Hang Jiang, Qianyi Wu, Keyu Chen, and Juyong Zhang. Disentangled representation learning for 3d face shape, 2019

work page 2019

[36] [36]

Emoportraits: Emotion-enhanced multimodal one-shot head avatars

Nikita Drobyshev, Antoni Bigata Casademunt, Konstantinos V ougioukas, Zoe Landgraf, Stavros Petridis, and Maja Pantic. Emoportraits: Emotion-enhanced multimodal one-shot head avatars. In CVPR, pages 8498–8507, 2024

work page 2024

[37] [37]

First order motion model for image animation

Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. First order motion model for image animation. Advances in neural information processing systems, 32, 2019

work page 2019

[38] [38]

wav2vec 2.0: A framework for self-supervised learning of speech representations

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449– 12460, 2020

work page 2020

[39] [39]

Capture, learning, and synthesis of 3d speaking styles

Daniel Cudeiro, Timo Bolkart, Cassidy Laidlaw, Anurag Ranjan, and Michael J Black. Capture, learning, and synthesis of 3d speaking styles. In CVPR, pages 10101–10111, 2019

work page 2019

[40] [40]

Diffposetalk: Speech-driven stylistic 3d facial animation and head pose generation via diffusion models

Zhiyao Sun, Tian Lv, Sheng Ye, Matthieu Lin, Jenny Sheng, Yu-Hui Wen, Minjing Yu, and Yong-jin Liu. Diffposetalk: Speech-driven stylistic 3d facial animation and head pose generation via diffusion models. ACM Transactions on Graphics (TOG), 43(4):1–9, 2024

work page 2024

[41] [41]

Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset

Zhimeng Zhang, Lincheng Li, Yu Ding, and Changjie Fan. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In CVPR, Jun 2021

work page 2021

[42] [42]

Celebv-hq: A large-scale video facial attributes dataset

Hao Zhu, Wayne Wu, Wentao Zhu, Liming Jiang, Siwei Tang, Li Zhang, Ziwei Liu, and Chen Change Loy. Celebv-hq: A large-scale video facial attributes dataset. In ECCV, pages 650–667. Springer, 2022

work page 2022

[43] [43]

Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels

Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Yixuan Gao, Annan Wang, Erli Zhang, Wenxiu Sun, et al. Q-align: Teaching lmms for visual scoring via discrete text-defined levels. arXiv preprint arXiv:2312.17090, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [44]

Out of time: Automated lip sync in the wild

Joon Son Chung and Andrew Zisserman. Out of time: Automated lip sync in the wild. In ACCV Workshops, 2016

work page 2016

[45] [45]

FVD: A new metric for video generation

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. FVD: A new metric for video generation. In Deep Generative Models for Highly Structured Data, ICLR 2019 Workshop, New Orleans, Louisiana, United States, May 6, 2019. OpenReview.net, 2019. 11

work page 2019