arxiv: 2605.07478 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

AudioFace: Language-Assisted Speech-Driven Facial Animation with Multimodal Language Models

Kai Zheng , Zejian Kang , Rui Mao , Hongyuan Zou , Yuanchen Fei , Xuanyang Xu , Xiangru Huang

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:47 UTC · model grok-4.3

classification 💻 cs.CV

keywords speech-driven facial animationblendshape generationmultimodal language modelsphoneme-level cueslinguistic guidanceaudio to face mappingarticulation modeling

0 comments

The pith

Adding linguistic and phonetic cues from multimodal language models improves how speech audio maps to accurate facial blendshapes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AudioFace as a way to generate facial animations from speech by treating the task as structured generation rather than pure acoustic mapping. It supplies multimodal language models with transcripts and phoneme sequences so their built-in knowledge of language and articulation can guide mouth and face movements. This addresses the common problem that direct audio-to-face methods produce mismatched or unnatural lip shapes because they ignore the linguistic structure of speech. If the approach works, it would allow more reliable automatic animation for virtual characters, video dubbing, and real-time avatars without needing manual correction of mouth motions.

Core claim

AudioFace is a language-assisted framework for speech-driven blendshape generation that treats mouth-related facial coefficient prediction as a structured generation problem guided by linguistic and articulatory information. Instead of relying solely on acoustic features, the method leverages the prior knowledge of multimodal large language models and introduces transcript- and phoneme-level cues to bridge speech signals with interpretable facial actions. Extensive experiments show that AudioFace achieves superior performance across multiple evaluation metrics.

What carries the argument

The transcript- and phoneme-guided conditioning of multimodal language model priors that structures the mapping from audio to blendshape coefficients.

If this is right

More accurate prediction of articulation-specific mouth shapes during speech.
Higher scores on standard quantitative and qualitative metrics for facial animation quality.
Better alignment between phonetic content and visible facial motion without extra manual input.
Demonstration that multimodal priors can be applied directly to low-level audio-to-visual conversion tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same cueing strategy could be tested on non-English speech to check whether phoneme guidance transfers across languages.
If the model runs efficiently, it might support live animation in video calls or games by reducing the need for post-processing.
This technique suggests a broader pattern where language model knowledge can refine other signal-to-signal mappings such as audio to gesture or text to motion.

Load-bearing premise

That the prior knowledge inside multimodal large language models, when supplied with transcript- and phoneme-level cues, will reliably improve the mapping from acoustic signals to interpretable facial actions.

What would settle it

A controlled comparison on a standard speech-animation benchmark in which the full AudioFace system shows no statistically significant gain over a strong acoustic-only baseline in lip synchronization error or perceptual naturalness scores.

Figures

Figures reproduced from arXiv: 2605.07478 by Hongyuan Zou, Kai Zheng, Rui Mao, Xiangru Huang, Xuanyang Xu, Yuanchen Fei, Zejian Kang.

**Figure 2.** Figure 2: Qualitative comparison of audio-driven facial animation results. The leftmost column [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Additional qualitative results of AudioFace. We show more rendered frames generated from predicted ARKit coefficients. The results include diverse phoneme-related mouth configurations and smooth expression variations, demonstrating the articulation-aware generation ability of our language-assisted speech-driven facial animation framework. B Evaluation Metrics Ditails Mean Squared Error (MSE) and Mean Absol… view at source ↗

read the original abstract

Speech-driven facial animation requires accurate correspondence between acoustic signals and facial motion, especially for articulation-related mouth movements. However, directly mapping speech audio to facial coefficients often overlooks the linguistic and phonetic structure underlying speech production. In this paper, we propose AudioFace, a language-assisted framework for speech-driven blendshape generation that treats mouth-related facial coefficient prediction as a structured generation problem guided by linguistic and articulatory information. Instead of relying solely on acoustic features, our method leverages the prior knowledge of multimodal large language models and introduces transcript- and phoneme-level cues to bridge speech signals with interpretable facial actions. Extensive experiments show that AudioFace achieves superior performance across multiple evaluation metrics, validating the effectiveness of language-assisted and multimodal-prior-guided speech-driven facial animation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AudioFace adds linguistic cues via multimodal LLMs to speech-driven blendshape generation, but the reported gains likely trace to those extra inputs rather than the model priors themselves.

read the letter

The main takeaway is that AudioFace frames mouth-related facial coefficient prediction as a structured generation task that pulls in transcript and phoneme cues through a multimodal LLM. This is a direct, practical move in the speech-driven animation area rather than a deep theoretical shift. The paper does a reasonable job of spelling out why pure acoustic mapping falls short on articulation and why language-model priors could supply useful structure for interpretable facial actions. That framing is clear and could matter for downstream uses like avatars or dubbing where lip accuracy counts. What the work actually ships is an empirical pipeline that conditions on those linguistic signals, and the abstract presents it as an improvement over standard audio-to-coefficient baselines. The idea itself is not hard to follow and sits squarely inside existing multimodal LLM applications. The soft spots sit in the evidence. The superiority claims rest on metrics that the abstract does not break down with baselines, error bars, or dataset details, and there is no sign of ablations that give the same transcript and phoneme cues to the comparison models. Without those controls, any lift could come from the added conditioning rather than the LLM prior or the generation framing. The paper shows no equations or derivations, so the argument stays entirely empirical and depends on how cleanly the experiments isolate the claimed contribution. This is the sort of paper that would interest people already working on speech-to-face pipelines or multimodal conditioning in graphics. A reader who needs a concrete method to try on blendshape data could extract the high-level recipe and test it themselves, but they would have to add their own controls first. It deserves peer review because the problem is real and the proposed direction is coherent, even if the current write-up leaves the central isolation question open for referees to press.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes AudioFace, a language-assisted framework for speech-driven facial animation. It uses multimodal large language models (MLLMs) to incorporate transcript- and phoneme-level cues to guide the generation of blendshape coefficients from speech audio, treating it as a structured generation problem rather than direct acoustic mapping. The central claim is that this approach achieves superior performance across multiple evaluation metrics, demonstrating the effectiveness of language-assisted and multimodal-prior-guided animation.

Significance. If validated with proper controls, the work could contribute to the field by showing how priors from MLLMs can improve the accuracy and interpretability of speech-to-facial motion mappings. The idea of bridging acoustic signals with linguistic structure via MLLMs is a promising direction for applications in animation and virtual agents.

major comments (2)

[Abstract] The abstract asserts superior performance on multiple metrics but provides no experimental details, baselines, error bars, datasets, or quantitative results. This absence makes it impossible to assess whether the evidence supports the superiority claim.
[Method] The framework explicitly conditions on transcript- and phoneme-level cues in addition to audio. Without matched-input ablations comparing to baselines that receive the same linguistic inputs (or confirming that standard baselines do not), it is unclear whether gains are attributable to the MLLM priors or simply to the additional conditioning signals. This is load-bearing for the claim that the multimodal-prior-guided approach is effective.

minor comments (1)

[Abstract] The phrasing 'extensive experiments show' is vague without specifics; consider adding a brief mention of key metrics or datasets if space allows.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will make revisions to improve clarity and strengthen the evidence for our claims.

read point-by-point responses

Referee: [Abstract] The abstract asserts superior performance on multiple metrics but provides no experimental details, baselines, error bars, datasets, or quantitative results. This absence makes it impossible to assess whether the evidence supports the superiority claim.

Authors: We agree that the abstract is high-level and lacks specific details to fully support the superiority claims. In the revised version, we will expand the abstract to include key quantitative results (e.g., improvements on metrics like lip synchronization error and perceptual quality), the primary datasets used, main baselines, and mention of error bars where applicable. The full experimental protocol, including all metrics, statistical details, and implementation specifics, remains in the Experiments section. This revision will make the abstract more self-contained without exceeding typical length constraints. revision: yes
Referee: [Method] The framework explicitly conditions on transcript- and phoneme-level cues in addition to audio. Without matched-input ablations comparing to baselines that receive the same linguistic inputs (or confirming that standard baselines do not), it is unclear whether gains are attributable to the MLLM priors or simply to the additional conditioning signals. This is load-bearing for the claim that the multimodal-prior-guided approach is effective.

Authors: This is a valid and important point for isolating the contribution of the MLLM priors. Our baselines follow standard practices in speech-driven animation literature and operate on audio features alone. The MLLM component provides structured linguistic and articulatory priors that go beyond raw conditioning. To directly address the concern, we will add matched-input ablation studies in the revised manuscript: we will augment the audio-only baselines with the identical transcript- and phoneme-level cues extracted via the same process and report comparative results. This will demonstrate that performance gains arise from the multimodal-prior-guided structured generation rather than input signals alone. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims with no derivation chain

full rationale

The paper presents AudioFace as an empirical framework for speech-driven blendshape generation that incorporates transcript- and phoneme-level cues alongside multimodal LLM priors. All central claims concern measured performance improvements across evaluation metrics rather than any mathematical derivation, equation, or prediction that reduces to its inputs by construction. No self-definitional steps, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The method is described as a design choice that supplies additional linguistic conditioning, but this does not constitute a tautological reduction; the superiority claim rests on experimental outcomes, which are independent of any internal derivation loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that multimodal LLMs already encode useful articulatory and linguistic priors that can be transferred to facial animation without additional heavy supervision.

axioms (1)

domain assumption Multimodal large language models contain prior knowledge of linguistic and articulatory structure that can be leveraged to improve speech-to-face mapping.
Directly stated in the abstract when the method is described as leveraging the prior knowledge of multimodal LLMs.

pith-pipeline@v0.9.0 · 5438 in / 1138 out tokens · 51048 ms · 2026-05-11T01:47:36.777180+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 4 internal anchors

[1]

https://developer.apple.com/documentation/arkit, 2023

Arkit. https://developer.apple.com/documentation/arkit, 2023. Apple Developer Documentation

work page 2023
[2]

Learning audio-driven viseme dynamics for 3d face animation.arXiv preprint arXiv:2301.06059, 2023

Linchao Bao, Haoxian Zhang, Yue Qian, Tangli Xue, Changhai Chen, Xuefei Zhe, and Di Kang. Learning audio-driven viseme dynamics for 3d face animation.arXiv preprint arXiv:2301.06059, 2023

work page arXiv 2023
[3]

A morphable model for the synthesis of 3d faces

V olker Blanz and Thomas Vetter. A morphable model for the synthesis of 3d faces. InSeminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 157–164. 2023

work page 2023
[4]

Echomimic: Lifelike audio-driven portrait animations through editable landmark conditions

Zhiyuan Chen, Jiajiong Cao, Zhiquan Chen, Yuming Li, and Chenguang Ma. Echomimic: Lifelike audio-driven portrait animations through editable landmark conditions. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 2403–2410, 2025

work page 2025
[5]

Audio2face-3d: Audio-driven realistic facial animation for digital avatars.arXiv preprint arXiv:2508.16401, 2025

Chaeyeon Chung, Ilya Fedorov, Michael Huang, Aleksey Karmanov, Dmitry Korobchenko, Roger Ribera, Yeongho Seol, et al. Audio2face-3d: Audio-driven realistic facial animation for digital avatars.arXiv preprint arXiv:2508.16401, 2025

work page arXiv 2025
[6]

Modeling coarticulation in synthetic visual speech

Michael M Cohen and Dominic W Massaro. Modeling coarticulation in synthetic visual speech. InModels and techniques in computer animation, pages 139–156. Springer, 1993

work page 1993
[7]

Capture, learning, and synthesis of 3D speaking styles.Computer Vision and Pattern Recognition (CVPR), pages 10101–10111, 2019

Daniel Cudeiro, Timo Bolkart, Cassidy Laidlaw, Anurag Ranjan, and Michael Black. Capture, learning, and synthesis of 3D speaking styles.Computer Vision and Pattern Recognition (CVPR), pages 10101–10111, 2019

work page 2019
[8]

Capture, learning, and synthesis of 3d speaking styles

Daniel Cudeiro, Timo Bolkart, Cassidy Laidlaw, Anurag Ranjan, and Michael J Black. Capture, learning, and synthesis of 3d speaking styles. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10101–10111, 2019

work page 2019
[9]

Emotional speech-driven animation with content-emotion disentanglement

Radek Danˇeˇcek, Kiran Chhatre, Shashank Tripathi, Yandong Wen, Michael Black, and Timo Bolkart. Emotional speech-driven animation with content-emotion disentanglement. InSIG- GRAPH Asia 2023 Conference Papers, pages 1–13, 2023

work page 2023
[10]

Jali: an animator-centric viseme model for expressive lip synchronization.ACM Transactions on graphics (TOG), 35(4):1–11, 2016

Pif Edwards, Chris Landreth, Eugene Fiume, and Karan Singh. Jali: an animator-centric viseme model for expressive lip synchronization.ACM Transactions on graphics (TOG), 35(4):1–11, 2016

work page 2016
[11]

Metahuman: High-fidelity digital humans made easy

Epic Games. Metahuman: High-fidelity digital humans made easy. https://www.metahuman. com/en-US, 2025. Accessed: 2025-06-30

work page 2025
[12]

Unitalker: Scaling up audio- driven 3d facial animation through a unified model

Xiangyu Fan, Jiaqi Li, Zhiqian Lin, Weiye Xiao, and Lei Yang. Unitalker: Scaling up audio- driven 3d facial animation through a unified model. InEuropean Conference on Computer Vision, pages 204–221. Springer, 2024

work page 2024
[13]

Faceformer: Speech- driven 3d facial animation with transformers

Yingruo Fan, Zhaojiang Lin, Jun Saito, Wenping Wang, and Taku Komura. Faceformer: Speech- driven 3d facial animation with transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18770–18780, 2022

work page 2022
[14]

Joint audio-text model for expressive speech-driven 3d facial animation.Proceedings of the ACM on Computer Graphics and Interactive Techniques, 5(1):1–15, 2022

Yingruo Fan, Zhaojiang Lin, Jun Saito, Wenping Wang, and Taku Komura. Joint audio-text model for expressive speech-driven 3d facial animation.Proceedings of the ACM on Computer Graphics and Interactive Techniques, 5(1):1–15, 2022

work page 2022
[15]

Segment any text: A universal approach for robust, efficient and adaptable sentence segmentation

Markus Frohmann, Igor Sterner, Ivan Vuli´c, Benjamin Minixhofer, and Markus Schedl. Segment any text: A universal approach for robust, efficient and adaptable sentence segmentation. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 11908–11941, 2024

work page 2024
[16]

Efficient emotional adapta- tion for audio-driven talking-head generation

Yuan Gan, Zongxin Yang, Xihang Yue, Lingyun Sun, and Yi Yang. Efficient emotional adapta- tion for audio-driven talking-head generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22634–22645, 2023. 11

work page 2023
[17]

Funasr: A fundamental end-to-end speech recognition toolkit

Zhifu Gao, Zerui Li, Jiaming Wang, Haoneng Luo, Xian Shi, Mengzhe Chen, Yabin Li, Lingyun Zuo, Zhihao Du, Zhangyu Xiao, et al. Funasr: A fundamental end-to-end speech recognition toolkit.arXiv preprint arXiv:2305.11013, 2023

work page arXiv 2023
[18]

Ad-nerf: Audio driven neural radiance fields for talking head synthesis

Yudong Guo, Keyu Chen, Sen Liang, Yong-Jin Liu, Hujun Bao, and Juyong Zhang. Ad-nerf: Audio driven neural radiance fields for talking head synthesis. InProceedings of the IEEE/CVF international conference on computer vision, pages 5784–5794, 2021

work page 2021
[19]

Lam: large avatar model for one-shot animatable gaussian head

Yisheng He, Xiaodong Gu, Xiaodan Ye, Chao Xu, Zhengyi Zhao, Yuan Dong, Weihao Yuan, Zilong Dong, and Liefeng Bo. Lam: large avatar model for one-shot animatable gaussian head. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–13, 2025

work page 2025
[20]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[21]

Audio-driven emotional video portraits

Xinya Ji, Hang Zhou, Kaisiyuan Wang, Wayne Wu, Chen Change Loy, Xun Cao, and Feng Xu. Audio-driven emotional video portraits. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14080–14089, 2021

work page 2021
[22]

Loopy: Taming audio-driven portrait avatar with long-term motion dependency.arXiv preprint arXiv:2409.02634, 2024

Jianwen Jiang, Chao Liang, Jiaqi Yang, Gaojie Lin, Tianyun Zhong, and Yanbo Zheng. Loopy: Taming audio-driven portrait avatar with long-term motion dependency.arXiv preprint arXiv:2409.02634, 2024

work page arXiv 2024
[23]

SentiAvatar: Towards Expressive and Interactive Digital Humans

Chuhao Jin, Rui Zhang, Qingzhe Gao, Haoyu Shi, Dayu Wu, Yichen Jiang, Yihan Wu, and Ruihua Song. Sentiavatar: Towards expressive and interactive digital humans.arXiv preprint arXiv:2604.02908, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

Semanticface: Semantic facial action estimation via semantic distillation in interpretable space

Zejian Kang, Kai Zheng, Yuanchen Fei, Wentao Yang, Hongyuan Zou, and Xiangru Huang. Semanticface: Semantic facial action estimation via semantic distillation in interpretable space. arXiv preprint arXiv:2603.14827, 2026

work page arXiv 2026
[25]

Learning a model of facial shape and expression from 4d scans.ACM Trans

Tianye Li, Timo Bolkart, Michael J Black, Hao Li, and Javier Romero. Learning a model of facial shape and expression from 4d scans.ACM Trans. Graph., 36(6):194–1, 2017

work page 2017
[26]

Cyberhost: A one-stage diffusion framework for audio-driven talking body generation

Gaojie Lin, Jianwen Jiang, Chao Liang, Tianyun Zhong, Jiaqi Yang, Zerong Zheng, and Yanbo Zheng. Cyberhost: A one-stage diffusion framework for audio-driven talking body generation. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[27]

Beat: A large-scale semantic and emotional multi-modal dataset for conversa- tional gestures synthesis

Haiyang Liu, Zihao Zhu, Naoya Iwamoto, Yichen Peng, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. Beat: A large-scale semantic and emotional multi-modal dataset for conversa- tional gestures synthesis. InEuropean conference on computer vision, pages 612–630. Springer, 2022

work page 2022
[28]

Automated blendshape personalization for faithful face animations using commodity smartphones

Timo Menzel, Mario Botsch, and Marc Erich Latoschik. Automated blendshape personalization for faithful face animations using commodity smartphones. InProceedings of the 28th ACM Symposium on virtual reality software and technology, pages 1–9, 2022

work page 2022
[29]

Said: Speech-driven blendshape facial animation with diffusion

Inkyu Park and Jaewoong Cho. Said: Speech-driven blendshape facial animation with diffusion. arXiv preprint arXiv:2401.08655, 2023

work page arXiv 2023
[30]

Emotalk: Speech-driven emotional disentanglement for 3d face animation

Ziqiao Peng, Haoyu Wu, Zhenbo Song, Hao Xu, Xiangyu Zhu, Jun He, Hongyan Liu, and Zhaoxin Fan. Emotalk: Speech-driven emotional disentanglement for 3d face animation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 20687–20697, 2023

work page 2023
[31]

Meshtalk: 3d face animation from speech using cross-modality disentanglement

Alexander Richard, Michael Zollhöfer, Yandong Wen, Fernando De la Torre, and Yaser Sheikh. Meshtalk: 3d face animation from speech using cross-modality disentanglement. InProceedings of the IEEE/CVF international conference on computer vision, pages 1173–1182, 2021

work page 2021
[32]

Facediffuser: Speech-driven 3d facial animation synthesis using diffusion

Stefan Stan, Kazi Injamamul Haque, and Zerrin Yumak. Facediffuser: Speech-driven 3d facial animation synthesis using diffusion. InProceedings of the 16th ACM SIGGRAPH Conference on Motion, Interaction and Games, pages 1–11, 2023. 12

work page 2023
[33]

A deep learning approach for generalized speech animation.ACM Transactions On Graphics (TOG), 36(4):1–11, 2017

Sarah Taylor, Taehwan Kim, Yisong Yue, Moshe Mahler, James Krahe, Anastasio Garcia Rodriguez, Jessica Hodgins, and Iain Matthews. A deep learning approach for generalized speech animation.ACM Transactions On Graphics (TOG), 36(4):1–11, 2017

work page 2017
[34]

Dynamic units of visual speech

Sarah L Taylor, Moshe Mahler, Barry-John Theobald, and Iain Matthews. Dynamic units of visual speech. InProceedings of the 11th ACM SIGGRAPH/Eurographics conference on Computer Animation, pages 275–284, 2012

work page 2012
[35]

Keyframeface: From text to expressive facial keyframes.arXiv preprint arXiv:2512.11321, 2025

Jingchao Wu, Zejian Kang, Haibo Liu, Yuanchen Fei, and Xiangru Huang. Keyframeface: From text to expressive facial keyframes.arXiv preprint arXiv:2512.11321, 2025

work page internal anchor Pith review arXiv 2025
[36]

Codetalker: Speech-driven 3d facial animation with discrete motion prior

Jinbo Xing, Menghan Xia, Yuechen Zhang, Xiaodong Cun, Jue Wang, and Tien-Tsin Wong. Codetalker: Speech-driven 3d facial animation with discrete motion prior. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12780–12790, 2023

work page 2023
[37]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Vasa-1: Lifelike audio-driven talking faces generated in real time.Advances in Neural Information Processing Systems, 37:660–684, 2024

Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, and Baining Guo. Vasa-1: Lifelike audio-driven talking faces generated in real time.Advances in Neural Information Processing Systems, 37:660–684, 2024

work page 2024
[39]

Kmtalk: Speech-driven 3d facial animation with key motion embedding

Zhihao Xu, Shengjie Gong, Jiapeng Tang, Lingyu Liang, Yining Huang, Haojie Li, and Shuangping Huang. Kmtalk: Speech-driven 3d facial animation with key motion embedding. InEuropean Conference on Computer Vision, pages 236–253. Springer, 2024

work page 2024
[40]

Paddlespeech: An easy-to-use all-in-one speech toolkit

Hui Zhang, Tian Yuan, Junkun Chen, Xintong Li, Renjie Zheng, Yuxin Huang, Xiaojie Chen, Enlei Gong, Zeyu Chen, Xiaoguang Hu, et al. Paddlespeech: An easy-to-use all-in-one speech toolkit. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: System Demonstrations, ...

work page 2022
[41]

Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation

Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, Ying Shan, and Fei Wang. Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8652–8661, 2023

work page 2023
[42]

Teller: Real-time streaming audio-driven portrait animation with autoregressive motion generation

Dingcheng Zhen, Shunshun Yin, Shiyang Qin, Hou Yi, Ziwei Zhang, Siyuan Liu, Gan Qi, and Ming Tao. Teller: Real-time streaming audio-driven portrait animation with autoregressive motion generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21075–21085, 2025

work page 2025
[43]

Visemenet: Audio-driven animator-centric speech animation.ACM Transactions on Graphics (ToG), 37(4):1–10, 2018

Yang Zhou, Zhan Xu, Chris Landreth, Evangelos Kalogerakis, Subhransu Maji, and Karan Singh. Visemenet: Audio-driven animator-centric speech animation.ACM Transactions on Graphics (ToG), 37(4):1–10, 2018. 13 AudioFace: Language-Assisted Audio-to-ARKit Generation with Multimodal Language Models Supplementary Material A Additional Qualitative Results We prov...

work page 2018