pith. machine review for the scientific record. sign in

arxiv: 2605.07478 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

AudioFace: Language-Assisted Speech-Driven Facial Animation with Multimodal Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:47 UTC · model grok-4.3

classification 💻 cs.CV
keywords speech-driven facial animationblendshape generationmultimodal language modelsphoneme-level cueslinguistic guidanceaudio to face mappingarticulation modeling
0
0 comments X

The pith

Adding linguistic and phonetic cues from multimodal language models improves how speech audio maps to accurate facial blendshapes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AudioFace as a way to generate facial animations from speech by treating the task as structured generation rather than pure acoustic mapping. It supplies multimodal language models with transcripts and phoneme sequences so their built-in knowledge of language and articulation can guide mouth and face movements. This addresses the common problem that direct audio-to-face methods produce mismatched or unnatural lip shapes because they ignore the linguistic structure of speech. If the approach works, it would allow more reliable automatic animation for virtual characters, video dubbing, and real-time avatars without needing manual correction of mouth motions.

Core claim

AudioFace is a language-assisted framework for speech-driven blendshape generation that treats mouth-related facial coefficient prediction as a structured generation problem guided by linguistic and articulatory information. Instead of relying solely on acoustic features, the method leverages the prior knowledge of multimodal large language models and introduces transcript- and phoneme-level cues to bridge speech signals with interpretable facial actions. Extensive experiments show that AudioFace achieves superior performance across multiple evaluation metrics.

What carries the argument

The transcript- and phoneme-guided conditioning of multimodal language model priors that structures the mapping from audio to blendshape coefficients.

If this is right

  • More accurate prediction of articulation-specific mouth shapes during speech.
  • Higher scores on standard quantitative and qualitative metrics for facial animation quality.
  • Better alignment between phonetic content and visible facial motion without extra manual input.
  • Demonstration that multimodal priors can be applied directly to low-level audio-to-visual conversion tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same cueing strategy could be tested on non-English speech to check whether phoneme guidance transfers across languages.
  • If the model runs efficiently, it might support live animation in video calls or games by reducing the need for post-processing.
  • This technique suggests a broader pattern where language model knowledge can refine other signal-to-signal mappings such as audio to gesture or text to motion.

Load-bearing premise

That the prior knowledge inside multimodal large language models, when supplied with transcript- and phoneme-level cues, will reliably improve the mapping from acoustic signals to interpretable facial actions.

What would settle it

A controlled comparison on a standard speech-animation benchmark in which the full AudioFace system shows no statistically significant gain over a strong acoustic-only baseline in lip synchronization error or perceptual naturalness scores.

Figures

Figures reproduced from arXiv: 2605.07478 by Hongyuan Zou, Kai Zheng, Rui Mao, Xiangru Huang, Xuanyang Xu, Yuanchen Fei, Zejian Kang.

Figure 1
Figure 1. Figure 1: Overview of the proposed AudioFace framework. Given a speech sequence, we first [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative comparison of audio-driven facial animation results. The leftmost column [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Additional qualitative results of AudioFace. We show more rendered frames generated from predicted ARKit coefficients. The results include diverse phoneme-related mouth configurations and smooth expression variations, demonstrating the articulation-aware generation ability of our language-assisted speech-driven facial animation framework. B Evaluation Metrics Ditails Mean Squared Error (MSE) and Mean Absol… view at source ↗
read the original abstract

Speech-driven facial animation requires accurate correspondence between acoustic signals and facial motion, especially for articulation-related mouth movements. However, directly mapping speech audio to facial coefficients often overlooks the linguistic and phonetic structure underlying speech production. In this paper, we propose AudioFace, a language-assisted framework for speech-driven blendshape generation that treats mouth-related facial coefficient prediction as a structured generation problem guided by linguistic and articulatory information. Instead of relying solely on acoustic features, our method leverages the prior knowledge of multimodal large language models and introduces transcript- and phoneme-level cues to bridge speech signals with interpretable facial actions. Extensive experiments show that AudioFace achieves superior performance across multiple evaluation metrics, validating the effectiveness of language-assisted and multimodal-prior-guided speech-driven facial animation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes AudioFace, a language-assisted framework for speech-driven facial animation. It uses multimodal large language models (MLLMs) to incorporate transcript- and phoneme-level cues to guide the generation of blendshape coefficients from speech audio, treating it as a structured generation problem rather than direct acoustic mapping. The central claim is that this approach achieves superior performance across multiple evaluation metrics, demonstrating the effectiveness of language-assisted and multimodal-prior-guided animation.

Significance. If validated with proper controls, the work could contribute to the field by showing how priors from MLLMs can improve the accuracy and interpretability of speech-to-facial motion mappings. The idea of bridging acoustic signals with linguistic structure via MLLMs is a promising direction for applications in animation and virtual agents.

major comments (2)
  1. [Abstract] The abstract asserts superior performance on multiple metrics but provides no experimental details, baselines, error bars, datasets, or quantitative results. This absence makes it impossible to assess whether the evidence supports the superiority claim.
  2. [Method] The framework explicitly conditions on transcript- and phoneme-level cues in addition to audio. Without matched-input ablations comparing to baselines that receive the same linguistic inputs (or confirming that standard baselines do not), it is unclear whether gains are attributable to the MLLM priors or simply to the additional conditioning signals. This is load-bearing for the claim that the multimodal-prior-guided approach is effective.
minor comments (1)
  1. [Abstract] The phrasing 'extensive experiments show' is vague without specifics; consider adding a brief mention of key metrics or datasets if space allows.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will make revisions to improve clarity and strengthen the evidence for our claims.

read point-by-point responses
  1. Referee: [Abstract] The abstract asserts superior performance on multiple metrics but provides no experimental details, baselines, error bars, datasets, or quantitative results. This absence makes it impossible to assess whether the evidence supports the superiority claim.

    Authors: We agree that the abstract is high-level and lacks specific details to fully support the superiority claims. In the revised version, we will expand the abstract to include key quantitative results (e.g., improvements on metrics like lip synchronization error and perceptual quality), the primary datasets used, main baselines, and mention of error bars where applicable. The full experimental protocol, including all metrics, statistical details, and implementation specifics, remains in the Experiments section. This revision will make the abstract more self-contained without exceeding typical length constraints. revision: yes

  2. Referee: [Method] The framework explicitly conditions on transcript- and phoneme-level cues in addition to audio. Without matched-input ablations comparing to baselines that receive the same linguistic inputs (or confirming that standard baselines do not), it is unclear whether gains are attributable to the MLLM priors or simply to the additional conditioning signals. This is load-bearing for the claim that the multimodal-prior-guided approach is effective.

    Authors: This is a valid and important point for isolating the contribution of the MLLM priors. Our baselines follow standard practices in speech-driven animation literature and operate on audio features alone. The MLLM component provides structured linguistic and articulatory priors that go beyond raw conditioning. To directly address the concern, we will add matched-input ablation studies in the revised manuscript: we will augment the audio-only baselines with the identical transcript- and phoneme-level cues extracted via the same process and report comparative results. This will demonstrate that performance gains arise from the multimodal-prior-guided structured generation rather than input signals alone. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims with no derivation chain

full rationale

The paper presents AudioFace as an empirical framework for speech-driven blendshape generation that incorporates transcript- and phoneme-level cues alongside multimodal LLM priors. All central claims concern measured performance improvements across evaluation metrics rather than any mathematical derivation, equation, or prediction that reduces to its inputs by construction. No self-definitional steps, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The method is described as a design choice that supplies additional linguistic conditioning, but this does not constitute a tautological reduction; the superiority claim rests on experimental outcomes, which are independent of any internal derivation loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that multimodal LLMs already encode useful articulatory and linguistic priors that can be transferred to facial animation without additional heavy supervision.

axioms (1)
  • domain assumption Multimodal large language models contain prior knowledge of linguistic and articulatory structure that can be leveraged to improve speech-to-face mapping.
    Directly stated in the abstract when the method is described as leveraging the prior knowledge of multimodal LLMs.

pith-pipeline@v0.9.0 · 5438 in / 1138 out tokens · 51048 ms · 2026-05-11T01:47:36.777180+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 4 internal anchors

  1. [1]

    https://developer.apple.com/documentation/arkit, 2023

    Arkit. https://developer.apple.com/documentation/arkit, 2023. Apple Developer Documentation

  2. [2]

    Learning audio-driven viseme dynamics for 3d face animation.arXiv preprint arXiv:2301.06059, 2023

    Linchao Bao, Haoxian Zhang, Yue Qian, Tangli Xue, Changhai Chen, Xuefei Zhe, and Di Kang. Learning audio-driven viseme dynamics for 3d face animation.arXiv preprint arXiv:2301.06059, 2023

  3. [3]

    A morphable model for the synthesis of 3d faces

    V olker Blanz and Thomas Vetter. A morphable model for the synthesis of 3d faces. InSeminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 157–164. 2023

  4. [4]

    Echomimic: Lifelike audio-driven portrait animations through editable landmark conditions

    Zhiyuan Chen, Jiajiong Cao, Zhiquan Chen, Yuming Li, and Chenguang Ma. Echomimic: Lifelike audio-driven portrait animations through editable landmark conditions. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 2403–2410, 2025

  5. [5]

    Audio2face-3d: Audio-driven realistic facial animation for digital avatars.arXiv preprint arXiv:2508.16401, 2025

    Chaeyeon Chung, Ilya Fedorov, Michael Huang, Aleksey Karmanov, Dmitry Korobchenko, Roger Ribera, Yeongho Seol, et al. Audio2face-3d: Audio-driven realistic facial animation for digital avatars.arXiv preprint arXiv:2508.16401, 2025

  6. [6]

    Modeling coarticulation in synthetic visual speech

    Michael M Cohen and Dominic W Massaro. Modeling coarticulation in synthetic visual speech. InModels and techniques in computer animation, pages 139–156. Springer, 1993

  7. [7]

    Capture, learning, and synthesis of 3D speaking styles.Computer Vision and Pattern Recognition (CVPR), pages 10101–10111, 2019

    Daniel Cudeiro, Timo Bolkart, Cassidy Laidlaw, Anurag Ranjan, and Michael Black. Capture, learning, and synthesis of 3D speaking styles.Computer Vision and Pattern Recognition (CVPR), pages 10101–10111, 2019

  8. [8]

    Capture, learning, and synthesis of 3d speaking styles

    Daniel Cudeiro, Timo Bolkart, Cassidy Laidlaw, Anurag Ranjan, and Michael J Black. Capture, learning, and synthesis of 3d speaking styles. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10101–10111, 2019

  9. [9]

    Emotional speech-driven animation with content-emotion disentanglement

    Radek Danˇeˇcek, Kiran Chhatre, Shashank Tripathi, Yandong Wen, Michael Black, and Timo Bolkart. Emotional speech-driven animation with content-emotion disentanglement. InSIG- GRAPH Asia 2023 Conference Papers, pages 1–13, 2023

  10. [10]

    Jali: an animator-centric viseme model for expressive lip synchronization.ACM Transactions on graphics (TOG), 35(4):1–11, 2016

    Pif Edwards, Chris Landreth, Eugene Fiume, and Karan Singh. Jali: an animator-centric viseme model for expressive lip synchronization.ACM Transactions on graphics (TOG), 35(4):1–11, 2016

  11. [11]

    Metahuman: High-fidelity digital humans made easy

    Epic Games. Metahuman: High-fidelity digital humans made easy. https://www.metahuman. com/en-US, 2025. Accessed: 2025-06-30

  12. [12]

    Unitalker: Scaling up audio- driven 3d facial animation through a unified model

    Xiangyu Fan, Jiaqi Li, Zhiqian Lin, Weiye Xiao, and Lei Yang. Unitalker: Scaling up audio- driven 3d facial animation through a unified model. InEuropean Conference on Computer Vision, pages 204–221. Springer, 2024

  13. [13]

    Faceformer: Speech- driven 3d facial animation with transformers

    Yingruo Fan, Zhaojiang Lin, Jun Saito, Wenping Wang, and Taku Komura. Faceformer: Speech- driven 3d facial animation with transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18770–18780, 2022

  14. [14]

    Joint audio-text model for expressive speech-driven 3d facial animation.Proceedings of the ACM on Computer Graphics and Interactive Techniques, 5(1):1–15, 2022

    Yingruo Fan, Zhaojiang Lin, Jun Saito, Wenping Wang, and Taku Komura. Joint audio-text model for expressive speech-driven 3d facial animation.Proceedings of the ACM on Computer Graphics and Interactive Techniques, 5(1):1–15, 2022

  15. [15]

    Segment any text: A universal approach for robust, efficient and adaptable sentence segmentation

    Markus Frohmann, Igor Sterner, Ivan Vuli´c, Benjamin Minixhofer, and Markus Schedl. Segment any text: A universal approach for robust, efficient and adaptable sentence segmentation. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 11908–11941, 2024

  16. [16]

    Efficient emotional adapta- tion for audio-driven talking-head generation

    Yuan Gan, Zongxin Yang, Xihang Yue, Lingyun Sun, and Yi Yang. Efficient emotional adapta- tion for audio-driven talking-head generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22634–22645, 2023. 11

  17. [17]

    Funasr: A fundamental end-to-end speech recognition toolkit

    Zhifu Gao, Zerui Li, Jiaming Wang, Haoneng Luo, Xian Shi, Mengzhe Chen, Yabin Li, Lingyun Zuo, Zhihao Du, Zhangyu Xiao, et al. Funasr: A fundamental end-to-end speech recognition toolkit.arXiv preprint arXiv:2305.11013, 2023

  18. [18]

    Ad-nerf: Audio driven neural radiance fields for talking head synthesis

    Yudong Guo, Keyu Chen, Sen Liang, Yong-Jin Liu, Hujun Bao, and Juyong Zhang. Ad-nerf: Audio driven neural radiance fields for talking head synthesis. InProceedings of the IEEE/CVF international conference on computer vision, pages 5784–5794, 2021

  19. [19]

    Lam: large avatar model for one-shot animatable gaussian head

    Yisheng He, Xiaodong Gu, Xiaodan Ye, Chao Xu, Zhengyi Zhao, Yuan Dong, Weihao Yuan, Zilong Dong, and Liefeng Bo. Lam: large avatar model for one-shot animatable gaussian head. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–13, 2025

  20. [20]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021

  21. [21]

    Audio-driven emotional video portraits

    Xinya Ji, Hang Zhou, Kaisiyuan Wang, Wayne Wu, Chen Change Loy, Xun Cao, and Feng Xu. Audio-driven emotional video portraits. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14080–14089, 2021

  22. [22]

    Loopy: Taming audio-driven portrait avatar with long-term motion dependency.arXiv preprint arXiv:2409.02634, 2024

    Jianwen Jiang, Chao Liang, Jiaqi Yang, Gaojie Lin, Tianyun Zhong, and Yanbo Zheng. Loopy: Taming audio-driven portrait avatar with long-term motion dependency.arXiv preprint arXiv:2409.02634, 2024

  23. [23]

    SentiAvatar: Towards Expressive and Interactive Digital Humans

    Chuhao Jin, Rui Zhang, Qingzhe Gao, Haoyu Shi, Dayu Wu, Yichen Jiang, Yihan Wu, and Ruihua Song. Sentiavatar: Towards expressive and interactive digital humans.arXiv preprint arXiv:2604.02908, 2026

  24. [24]

    Semanticface: Semantic facial action estimation via semantic distillation in interpretable space

    Zejian Kang, Kai Zheng, Yuanchen Fei, Wentao Yang, Hongyuan Zou, and Xiangru Huang. Semanticface: Semantic facial action estimation via semantic distillation in interpretable space. arXiv preprint arXiv:2603.14827, 2026

  25. [25]

    Learning a model of facial shape and expression from 4d scans.ACM Trans

    Tianye Li, Timo Bolkart, Michael J Black, Hao Li, and Javier Romero. Learning a model of facial shape and expression from 4d scans.ACM Trans. Graph., 36(6):194–1, 2017

  26. [26]

    Cyberhost: A one-stage diffusion framework for audio-driven talking body generation

    Gaojie Lin, Jianwen Jiang, Chao Liang, Tianyun Zhong, Jiaqi Yang, Zerong Zheng, and Yanbo Zheng. Cyberhost: A one-stage diffusion framework for audio-driven talking body generation. InThe Thirteenth International Conference on Learning Representations, 2025

  27. [27]

    Beat: A large-scale semantic and emotional multi-modal dataset for conversa- tional gestures synthesis

    Haiyang Liu, Zihao Zhu, Naoya Iwamoto, Yichen Peng, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. Beat: A large-scale semantic and emotional multi-modal dataset for conversa- tional gestures synthesis. InEuropean conference on computer vision, pages 612–630. Springer, 2022

  28. [28]

    Automated blendshape personalization for faithful face animations using commodity smartphones

    Timo Menzel, Mario Botsch, and Marc Erich Latoschik. Automated blendshape personalization for faithful face animations using commodity smartphones. InProceedings of the 28th ACM Symposium on virtual reality software and technology, pages 1–9, 2022

  29. [29]

    Said: Speech-driven blendshape facial animation with diffusion

    Inkyu Park and Jaewoong Cho. Said: Speech-driven blendshape facial animation with diffusion. arXiv preprint arXiv:2401.08655, 2023

  30. [30]

    Emotalk: Speech-driven emotional disentanglement for 3d face animation

    Ziqiao Peng, Haoyu Wu, Zhenbo Song, Hao Xu, Xiangyu Zhu, Jun He, Hongyan Liu, and Zhaoxin Fan. Emotalk: Speech-driven emotional disentanglement for 3d face animation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 20687–20697, 2023

  31. [31]

    Meshtalk: 3d face animation from speech using cross-modality disentanglement

    Alexander Richard, Michael Zollhöfer, Yandong Wen, Fernando De la Torre, and Yaser Sheikh. Meshtalk: 3d face animation from speech using cross-modality disentanglement. InProceedings of the IEEE/CVF international conference on computer vision, pages 1173–1182, 2021

  32. [32]

    Facediffuser: Speech-driven 3d facial animation synthesis using diffusion

    Stefan Stan, Kazi Injamamul Haque, and Zerrin Yumak. Facediffuser: Speech-driven 3d facial animation synthesis using diffusion. InProceedings of the 16th ACM SIGGRAPH Conference on Motion, Interaction and Games, pages 1–11, 2023. 12

  33. [33]

    A deep learning approach for generalized speech animation.ACM Transactions On Graphics (TOG), 36(4):1–11, 2017

    Sarah Taylor, Taehwan Kim, Yisong Yue, Moshe Mahler, James Krahe, Anastasio Garcia Rodriguez, Jessica Hodgins, and Iain Matthews. A deep learning approach for generalized speech animation.ACM Transactions On Graphics (TOG), 36(4):1–11, 2017

  34. [34]

    Dynamic units of visual speech

    Sarah L Taylor, Moshe Mahler, Barry-John Theobald, and Iain Matthews. Dynamic units of visual speech. InProceedings of the 11th ACM SIGGRAPH/Eurographics conference on Computer Animation, pages 275–284, 2012

  35. [35]

    Keyframeface: From text to expressive facial keyframes.arXiv preprint arXiv:2512.11321, 2025

    Jingchao Wu, Zejian Kang, Haibo Liu, Yuanchen Fei, and Xiangru Huang. Keyframeface: From text to expressive facial keyframes.arXiv preprint arXiv:2512.11321, 2025

  36. [36]

    Codetalker: Speech-driven 3d facial animation with discrete motion prior

    Jinbo Xing, Menghan Xia, Yuechen Zhang, Xiaodong Cun, Jue Wang, and Tien-Tsin Wong. Codetalker: Speech-driven 3d facial animation with discrete motion prior. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12780–12790, 2023

  37. [37]

    Qwen3-Omni Technical Report

    Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo ...

  38. [38]

    Vasa-1: Lifelike audio-driven talking faces generated in real time.Advances in Neural Information Processing Systems, 37:660–684, 2024

    Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, and Baining Guo. Vasa-1: Lifelike audio-driven talking faces generated in real time.Advances in Neural Information Processing Systems, 37:660–684, 2024

  39. [39]

    Kmtalk: Speech-driven 3d facial animation with key motion embedding

    Zhihao Xu, Shengjie Gong, Jiapeng Tang, Lingyu Liang, Yining Huang, Haojie Li, and Shuangping Huang. Kmtalk: Speech-driven 3d facial animation with key motion embedding. InEuropean Conference on Computer Vision, pages 236–253. Springer, 2024

  40. [40]

    Paddlespeech: An easy-to-use all-in-one speech toolkit

    Hui Zhang, Tian Yuan, Junkun Chen, Xintong Li, Renjie Zheng, Yuxin Huang, Xiaojie Chen, Enlei Gong, Zeyu Chen, Xiaoguang Hu, et al. Paddlespeech: An easy-to-use all-in-one speech toolkit. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: System Demonstrations, ...

  41. [41]

    Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation

    Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, Ying Shan, and Fei Wang. Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8652–8661, 2023

  42. [42]

    Teller: Real-time streaming audio-driven portrait animation with autoregressive motion generation

    Dingcheng Zhen, Shunshun Yin, Shiyang Qin, Hou Yi, Ziwei Zhang, Siyuan Liu, Gan Qi, and Ming Tao. Teller: Real-time streaming audio-driven portrait animation with autoregressive motion generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21075–21085, 2025

  43. [43]

    Visemenet: Audio-driven animator-centric speech animation.ACM Transactions on Graphics (ToG), 37(4):1–10, 2018

    Yang Zhou, Zhan Xu, Chris Landreth, Evangelos Kalogerakis, Subhransu Maji, and Karan Singh. Visemenet: Audio-driven animator-centric speech animation.ACM Transactions on Graphics (ToG), 37(4):1–10, 2018. 13 AudioFace: Language-Assisted Audio-to-ARKit Generation with Multimodal Language Models Supplementary Material A Additional Qualitative Results We prov...