Polyglot: Multilingual Style Preserving Speech-Driven Facial Animation

Akin Caliskan; Claudio Ferrari; David Ferman; Federico Nocentini; Hyeongwoo Kim; Kwanggyoon Seo; Pablo Garrido; Qingju Liu; Stefano Berretti

arxiv: 2604.16108 · v1 · submitted 2026-04-17 · 💻 cs.CV

Polyglot: Multilingual Style Preserving Speech-Driven Facial Animation

Federico Nocentini , Kwanggyoon Seo , Qingju Liu , Claudio Ferrari , Stefano Berretti , David Ferman , Hyeongwoo Kim , Pablo Garrido

show 1 more author

Akin Caliskan

This is my paper

Pith reviewed 2026-05-10 08:24 UTC · model grok-4.3

classification 💻 cs.CV

keywords languagefacialmultilingualstylesdfapolyglotanimationconditioning

0 comments

The pith

Polyglot introduces a unified diffusion model for multilingual speech-driven facial animation that jointly conditions on language via transcript embeddings and personal style via reference sequences without requiring explicit labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Speech-driven facial animation creates moving digital faces that match spoken audio. Current systems usually work for only one language because different languages have unique sounds, rhythms, and mouth movements. They also often ignore that each person has their own habits like how much they raise eyebrows or tilt their head when talking. Polyglot tries to fix both problems at once. It takes the written transcript of the speech to understand the language and pulls style information from short example videos of the speaker. These two pieces of information are fed into a diffusion model, which is a type of AI good at creating smooth, realistic sequences over time. The system learns everything without being told in advance which language or which person it is seeing. This self-supervised approach lets it handle new languages and new speakers. The result is supposed to be animations that match the speech timing and the individual's natural expressions, staying consistent from frame to frame. The paper claims this works better than previous methods that handled either language or style but not both together.

Core claim

By jointly conditioning on language and style, it captures expressive traits such as rhythm, articulation, and habitual facial movements, producing temporally coherent and realistic animations. Experiments show improved performance in both monolingual and multilingual settings, providing a unified framework for modeling language and personal style in SDFA.

Load-bearing premise

That transcript embeddings sufficiently encode language-specific phonetic and rhythmic information and that style embeddings extracted from reference facial sequences can capture individual speaking characteristics, with their combination in the diffusion model generalizing across unseen languages and speakers via self-supervised learning alone.

Figures

Figures reproduced from arXiv: 2604.16108 by Akin Caliskan, Claudio Ferrari, David Ferman, Federico Nocentini, Hyeongwoo Kim, Kwanggyoon Seo, Pablo Garrido, Qingju Liu, Stefano Berretti.

**Figure 1.** Figure 1: Polyglot, a deep learning architecture for speech-driven facial animation that preserves language and personal speaking styles during animation. II. RELATED WORKS In recent years, a wide range of models and methods have been developed to tackle the challenge of synchronizing facial animation with speech. Early approaches focused on procedural techniques [10], [24], [6], [43], [47], relying heavily on visem… view at source ↗

**Figure 2.** Figure 2: Left: Polyglot architecture. Audio A0:T is processed by mHuBERT EA, Whisper ER, and CLIP ET to extract features, transcripts, and language embeddings. A style embedding S is computed from input motion M0 0:T via style encoder ES. Conditioned on identity β, language tˆ, style S, and timestep n, the diffusion decoder T D denoises noisy parameters Mn into motion M0 . Right: The style encoder ES extracts per-f… view at source ↗

**Figure 3.** Figure 3: Japanese [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 6.** Figure 6: t-SNE visualization of text embeddings produced by concatenating [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: t-SNE plot of personal style embeddings S for 20 identities across Ukrainian and Thai. The clusters show that the personal speaking style encoder ES consistently captures identity-specific personal speaking styles [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: t-SNE visualization of personal style embeddings [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

read the original abstract

Speech-Driven Facial Animation (SDFA) has gained significant attention due to its applications in movies, video games, and virtual reality. However, most existing models are trained on single-language data, limiting their effectiveness in real-world multilingual scenarios. In this work, we address multilingual SDFA, which is essential for realistic generation since language influences phonetics, rhythm, intonation, and facial expressions. Speaking style is also shaped by individual differences, not only by language. Existing methods typically rely on either language-specific or speaker-specific conditioning, but not both, limiting their ability to model their interaction. We introduce Polyglot, a unified diffusion-based architecture for personalized multilingual SDFA. Our method uses transcript embeddings to encode language information and style embeddings extracted from reference facial sequences to capture individual speaking characteristics. Polyglot does not require predefined language or speaker labels, enabling generalization across languages and speakers through self-supervised learning. By jointly conditioning on language and style, it captures expressive traits such as rhythm, articulation, and habitual facial movements, producing temporally coherent and realistic animations. Experiments show improved performance in both monolingual and multilingual settings, providing a unified framework for modeling language and personal style in SDFA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on standard machine learning assumptions about the representational power of embeddings and diffusion models rather than any explicitly stated axioms or invented entities in the abstract.

pith-pipeline@v0.9.0 · 5532 in / 1228 out tokens · 67268 ms · 2026-05-10T08:24:42.891338+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages

[1]

T. Ao, Z. Zhang, and L. Liu. Gesturediffuclip: Gesture diffusion model with clip latents, 2023

work page 2023
[2]

Blanz and T

V . Blanz and T. Vetter. A morphable model for the synthesis of 3d faces. In26th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH 1999), pages 187–194. ACM Press, 1999

work page 1999
[3]

M. Z. Boito, V . Iyer, N. Lagos, L. Besacier, and I. Calapodescu. mHuBERT-147: A Compact Multilingual HuBERT Model. InInter- speech 2024, 2024

work page 2024
[4]

Bouritsas, S

G. Bouritsas, S. Bokhnyak, S. Ploumpis, M. Bronstein, and S. Zafeiriou. Neural 3d morphable models: Spiral convolutional networks for 3d shape representation learning and generation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7213–7222, 2019

work page 2019
[5]

Brooks, A

T. Brooks, A. Holynski, and A. A. Efros. Instructpix2pix: Learning to follow image editing instructions, 2023

work page 2023
[6]

P. Cosi, E. Caldognetto, G. Perin, and C. Zmarich. Labial coarticu- lation modeling for realistic facial animation. InProceedings. F ourth IEEE International Conference on Multimodal Interfaces, pages 505– 510, 2002

work page 2002
[7]

Cudeiro, T

D. Cudeiro, T. Bolkart, C. Laidlaw, A. Ranjan, and M. Black. Capture, learning, and synthesis of 3D speaking styles. InIEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 10101– 10111, 2019

work page 2019
[8]

Dan ˇeˇcek, K

R. Dan ˇeˇcek, K. Chhatre, S. Tripathi, Y . Wen, M. J. Black, and T. Bolkart. Emotional speech-driven animation with content-emotion disentanglement.arXiv preprint arXiv:2306.08990, 2023

work page arXiv 2023
[9]

Dong and D

X. Dong and D. S. Williamson. An attention enhanced multi-task model for objective speech assessment in real-world environments. InICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 911–915. IEEE, 2020

work page 2020
[10]

Edwards, C

P. Edwards, C. Landreth, E. Fiume, and K. Singh. Jali: an animator- centric viseme model for expressive lip synchronization.ACM Trans. Graph., 35(4), jul 2016

work page 2016
[11]

X. Fan, J. Li, Z. Lin, W. Xiao, and L. Yang. Unitalker: Scaling up audio-driven 3d facial animation through a unified model, 2024

work page 2024
[12]

Y . Fan, Z. Lin, J. Saito, W. Wang, and T. Komura. Faceformer: Speech-driven 3d facial animation with transformers. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), page 18749–18758, New Orleans, LA, USA, Jun 2022. IEEE

work page 2022
[13]

Ferrari, S

C. Ferrari, S. Berretti, P. Pala, and A. Del Bimbo. A sparse and locally coherent morphable face model for dense semantic correspondence across heterogeneous 3d faces.IEEE transactions on pattern analysis and machine intelligence, 44(10):6667–6682, 2021

work page 2021
[14]

Ferrari, G

C. Ferrari, G. Lisanti, S. Berretti, and A. Del Bimbo. Dictionary learning based 3d morphable model construction for face recognition with varying expression and pose. In2015 International Conference on 3D Vision, pages 509–517. IEEE, 2015

work page 2015
[15]

J. Guan, Z. Xu, H. Zhou, K. Wang, S. He, Z. Zhang, B. Liang, H. Feng, E. Ding, J. Liu, J. Wang, Y . Zhao, and Z. Liu. Resyncer: Rewiring style-based generator for unified audio-visually synced facial performer, 2024

work page 2024
[16]

K. I. Haque and Z. Yumak. Facexhubert: Text-less speech-driven e(x)pressive 3d facial animation synthesis using self-supervised speech representation learning. InINTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION (ICMI ’23), New York, NY , USA,

work page
[17]

Ho and T

J. Ho and T. Salimans. Classifier-free diffusion guidance, 2022

work page 2022
[18]

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM Trans. Audio, Speech and Lang. Proc., 29:3451–3460, oct 2021

work page 2021
[19]

Kumar, K

A. Kumar, K. Tan, Z. Ni, P. Manocha, X. Zhang, E. Henderson, and B. Xu. Torchaudio-squim: Reference-less speech quality and intelligibility measures in torchaudio. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023. Arabic Catalan Croatian LVE↓MVE↓DTW↓MOD↓LVE↓MVE↓DTW↓MOD↓LVE↓MVE↓DTW↓...

work page 2023
[20]

Ladefoged and S

P. Ladefoged and S. F. Disner. V owels and consonants.Manchu Grammar, 2000

work page 2000
[21]

R. Li, K. Bladin, Y . Zhao, C. Chinara, O. Ingraham, P. Xiang, X. Ren, P. Prasad, B. Kishore, J. Xing, and H. Li. Learning formation of physically-based face attributes, 2020

work page 2020
[22]

T. Li, T. Bolkart, M. J. Black, H. Li, and J. Romero. Learning a model of facial shape and expression from 4d scans.ACM Trans. Graph., 36(6):194–1, 2017

work page 2017
[23]

L ¨uthi, T

M. L ¨uthi, T. Gerig, C. Jud, and T. Vetter. Gaussian process mor- phable models.IEEE transactions on pattern analysis and machine intelligence, 40(8):1860–1873, 2017

work page 2017
[24]

Massaro, M

D. Massaro, M. Cohen, M. Tabain, J. Beskow, and R. Clark. Animated speech: Research progress and applications.Audiovisual Speech Processing, 01 2001

work page 2001
[25]

Neumann, K

T. Neumann, K. Varanasi, S. Wenger, M. Wacker, M. Magnor, and C. Theobalt. Sparse localized deformation components.ACM Trans- actions on Graphics (TOG), 32(6):1–10, 2013

work page 2013
[26]

Nocentini, T

F. Nocentini, T. Besnier, C. Ferrari, S. Arguillere, S. Berretti, and M. Daoudi. Beyond fixed topologies: Unregistered training and comprehensive evaluation metrics for 3d talking heads, 2024

work page 2024
[27]

Nocentini, T

F. Nocentini, T. Besnier, C. Ferrari, S. Arguillere, S. Berretti, and M. Daoudi. Scantalk: 3d talking heads from unregistered scans. InEuropean Conference on Computer Vision (ECCV), pages 19–36, Cham, 2024. Springer Nature Switzerland

work page 2024
[28]

Nocentini, T

F. Nocentini, T. Besnier, C. Ferrari, S. Berretti, and M. Daoudi. Freetalk: Emotional topology-free 3d talking heads, 2026

work page 2026
[29]

Nocentini, C

F. Nocentini, C. Ferrari, and S. Berretti. Learning landmarks motion from speech for speaker-agnostic 3D talking heads generation. In G. L. Foresti, A. Fusiello, and E. Hancock, editors,International Conference on Image Analysis and Processing (ICIAP), pages 340–351, Cham,

work page
[30]

Springer Nature Switzerland

work page
[31]

Nocentini, C

F. Nocentini, C. Ferrari, and S. Berretti. Emovoca: Speech-driven emo- tional 3D talking heads. InIEEE Winter Conference on Applications of Computer Vision (WACV), 2025

work page 2025
[32]

Paysan, R

P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter. A 3d face model for pose and illumination invariant face recognition. In2009 sixth IEEE international conference on advanced video and signal based surveillance, pages 296–301. Ieee, 2009

work page 2009
[33]

Z. Peng, Y . Luo, Y . Shi, H. Xu, X. Zhu, H. Liu, J. He, and Z. Fan. Selftalk: A self-supervised commutative training diagram to comprehend 3d talking faces. InProceedings of the 31st ACM International Conference on Multimedia, page 5292–5301, 2023

work page 2023
[34]

Z. Peng, H. Wu, Z. Song, H. Xu, X. Zhu, J. He, H. Liu, and Z. Fan. Emotalk: Speech-driven emotional disentanglement for 3d face animation. 2023

work page 2023
[36]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision, 2021

work page 2021
[37]

Richard, M

A. Richard, M. Zollh ¨ofer, Y . Wen, F. de la Torre, and Y . Sheikh. Meshtalk: 3d face animation from speech using cross-modality dis- entanglement. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 1173–1182, October 2021

work page 2021
[38]

Schneider, A

S. Schneider, A. Baevski, R. Collobert, and M. Auli. wav2vec: Unsupervised pre-training for speech recognition. InInterspeech 2019, page 3465–3469. ISCA, Sept. 2019

work page 2019
[39]

S. Stan, K. I. Haque, and Z. Yumak. Facediffuser: Speech-driven 3d facial animation synthesis using diffusion. InACM SIGGRAPH Conference on Motion, Interaction and Games (MIG ’23), November 15–17, 2023, Rennes, France, New York, NY , USA, 2023. ACM

work page 2023
[40]

Z. Sun, T. Lv, S. Ye, M. Lin, J. Sheng, Y .-H. Wen, M. Yu, and Y .-j. Liu. Diffposetalk: Speech-driven stylistic 3d facial animation and head pose generation via diffusion models.ACM Transactions on Graphics (TOG), 43(4):1–9, 2024

work page 2024
[41]

Sung-Bin, L

K. Sung-Bin, L. Chae-Yeon, G. Son, O. Hyun-Bin, J. Ju, S. Nam, and T.-H. Oh. Multitalk: Enhancing 3d talking head generation across languages with multilingual video dataset, 2024

work page 2024
[42]

Thambiraja, S

B. Thambiraja, S. Aliakbarian, D. Cosker, and J. Thies. 3diface: Diffusion-based speech-driven 3d facial animation and editing, 2023

work page 2023
[43]

Thambiraja, I

B. Thambiraja, I. Habibie, S. Aliakbarian, D. Cosker, C. Theobalt, and J. Thies. Imitator: Personalized speech-driven 3d facial animation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 20621–20631, October 2023

work page 2023
[44]

A. Wang, M. Emmi, and P. Faloutsos. Assembling an expressive facial animation system. InProceedings of the 2007 ACM SIGGRAPH Symposium on Video Games, Sandbox ’07, page 21–26, New York, NY , USA, 2007. Association for Computing Machinery

work page 2007
[45]

S. Wu, K. I. Haque, and Z. Yumak. Probtalk3d: Non-deterministic emotion controllable speech-driven 3d facial animation synthesis using vq-vae. InThe 17th ACM SIGGRAPH Conference on Motion, Interaction, and Games, MIG ’24, page 1–12. ACM, Nov. 2024

work page 2024
[46]

C.-h. Wuu, N. Zheng, S. Ardisson, R. Bali, D. Belko, E. Brockmeyer, L. Evans, T. Godisart, H. Ha, X. Huang, A. Hypes, T. Koska, S. Krenn, S. Lombardi, X. Luo, K. McPhail, L. Millerschoen, M. Perdoch, M. Pitts, A. Richard, J. Saragih, J. Saragih, T. Shiratori, T. Simon, M. Stewart, A. Trimble, X. Weng, D. Whitewolf, C. Wu, S.-I. Yu, and Y . Sheikh. Multifa...

work page 2022
[47]

J. Xing, M. Xia, Y . Zhang, X. Cun, J. Wang, and T.-T. Wong. Codetalker: Speech-driven 3d facial animation with discrete motion prior. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12780–12790, 2023

work page 2023
[48]

Y . Xu, A. W. Feng, S. Marsella, and A. Shapiro. A practical and configurable lip sync method for games. InProceedings of Motion on Games, MIG ’13, page 131–140, New York, NY , USA, 2013. Association for Computing Machinery

work page 2013
[49]

Z. Xu, S. Gong, J. Tang, L. Liang, Y . Huang, H. Li, and S. Huang. Kmtalk: Speech-driven 3d facial animation with key motion embed- ding, 2024

work page 2024
[50]

Zhang, S

L. Zhang, S. Liang, Z. Ge, and T. Hu. Personatalk: Bring attention to your persona in visual dubbing, 2024

work page 2024
[51]

Zhang, X

W. Zhang, X. Cun, X. Wang, Y . Zhang, X. Shen, Y . Guo, Y . Shan, and F. Wang. Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8652–8661, 2023

work page 2023

[1] [1]

T. Ao, Z. Zhang, and L. Liu. Gesturediffuclip: Gesture diffusion model with clip latents, 2023

work page 2023

[2] [2]

Blanz and T

V . Blanz and T. Vetter. A morphable model for the synthesis of 3d faces. In26th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH 1999), pages 187–194. ACM Press, 1999

work page 1999

[3] [3]

M. Z. Boito, V . Iyer, N. Lagos, L. Besacier, and I. Calapodescu. mHuBERT-147: A Compact Multilingual HuBERT Model. InInter- speech 2024, 2024

work page 2024

[4] [4]

Bouritsas, S

G. Bouritsas, S. Bokhnyak, S. Ploumpis, M. Bronstein, and S. Zafeiriou. Neural 3d morphable models: Spiral convolutional networks for 3d shape representation learning and generation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7213–7222, 2019

work page 2019

[5] [5]

Brooks, A

T. Brooks, A. Holynski, and A. A. Efros. Instructpix2pix: Learning to follow image editing instructions, 2023

work page 2023

[6] [6]

P. Cosi, E. Caldognetto, G. Perin, and C. Zmarich. Labial coarticu- lation modeling for realistic facial animation. InProceedings. F ourth IEEE International Conference on Multimodal Interfaces, pages 505– 510, 2002

work page 2002

[7] [7]

Cudeiro, T

D. Cudeiro, T. Bolkart, C. Laidlaw, A. Ranjan, and M. Black. Capture, learning, and synthesis of 3D speaking styles. InIEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 10101– 10111, 2019

work page 2019

[8] [8]

Dan ˇeˇcek, K

R. Dan ˇeˇcek, K. Chhatre, S. Tripathi, Y . Wen, M. J. Black, and T. Bolkart. Emotional speech-driven animation with content-emotion disentanglement.arXiv preprint arXiv:2306.08990, 2023

work page arXiv 2023

[9] [9]

Dong and D

X. Dong and D. S. Williamson. An attention enhanced multi-task model for objective speech assessment in real-world environments. InICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 911–915. IEEE, 2020

work page 2020

[10] [10]

Edwards, C

P. Edwards, C. Landreth, E. Fiume, and K. Singh. Jali: an animator- centric viseme model for expressive lip synchronization.ACM Trans. Graph., 35(4), jul 2016

work page 2016

[11] [11]

X. Fan, J. Li, Z. Lin, W. Xiao, and L. Yang. Unitalker: Scaling up audio-driven 3d facial animation through a unified model, 2024

work page 2024

[12] [12]

Y . Fan, Z. Lin, J. Saito, W. Wang, and T. Komura. Faceformer: Speech-driven 3d facial animation with transformers. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), page 18749–18758, New Orleans, LA, USA, Jun 2022. IEEE

work page 2022

[13] [13]

Ferrari, S

C. Ferrari, S. Berretti, P. Pala, and A. Del Bimbo. A sparse and locally coherent morphable face model for dense semantic correspondence across heterogeneous 3d faces.IEEE transactions on pattern analysis and machine intelligence, 44(10):6667–6682, 2021

work page 2021

[14] [14]

Ferrari, G

C. Ferrari, G. Lisanti, S. Berretti, and A. Del Bimbo. Dictionary learning based 3d morphable model construction for face recognition with varying expression and pose. In2015 International Conference on 3D Vision, pages 509–517. IEEE, 2015

work page 2015

[15] [15]

J. Guan, Z. Xu, H. Zhou, K. Wang, S. He, Z. Zhang, B. Liang, H. Feng, E. Ding, J. Liu, J. Wang, Y . Zhao, and Z. Liu. Resyncer: Rewiring style-based generator for unified audio-visually synced facial performer, 2024

work page 2024

[16] [16]

K. I. Haque and Z. Yumak. Facexhubert: Text-less speech-driven e(x)pressive 3d facial animation synthesis using self-supervised speech representation learning. InINTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION (ICMI ’23), New York, NY , USA,

work page

[17] [17]

Ho and T

J. Ho and T. Salimans. Classifier-free diffusion guidance, 2022

work page 2022

[18] [18]

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM Trans. Audio, Speech and Lang. Proc., 29:3451–3460, oct 2021

work page 2021

[19] [19]

Kumar, K

A. Kumar, K. Tan, Z. Ni, P. Manocha, X. Zhang, E. Henderson, and B. Xu. Torchaudio-squim: Reference-less speech quality and intelligibility measures in torchaudio. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023. Arabic Catalan Croatian LVE↓MVE↓DTW↓MOD↓LVE↓MVE↓DTW↓MOD↓LVE↓MVE↓DTW↓...

work page 2023

[20] [20]

Ladefoged and S

P. Ladefoged and S. F. Disner. V owels and consonants.Manchu Grammar, 2000

work page 2000

[21] [21]

R. Li, K. Bladin, Y . Zhao, C. Chinara, O. Ingraham, P. Xiang, X. Ren, P. Prasad, B. Kishore, J. Xing, and H. Li. Learning formation of physically-based face attributes, 2020

work page 2020

[22] [22]

T. Li, T. Bolkart, M. J. Black, H. Li, and J. Romero. Learning a model of facial shape and expression from 4d scans.ACM Trans. Graph., 36(6):194–1, 2017

work page 2017

[23] [23]

L ¨uthi, T

M. L ¨uthi, T. Gerig, C. Jud, and T. Vetter. Gaussian process mor- phable models.IEEE transactions on pattern analysis and machine intelligence, 40(8):1860–1873, 2017

work page 2017

[24] [24]

Massaro, M

D. Massaro, M. Cohen, M. Tabain, J. Beskow, and R. Clark. Animated speech: Research progress and applications.Audiovisual Speech Processing, 01 2001

work page 2001

[25] [25]

Neumann, K

T. Neumann, K. Varanasi, S. Wenger, M. Wacker, M. Magnor, and C. Theobalt. Sparse localized deformation components.ACM Trans- actions on Graphics (TOG), 32(6):1–10, 2013

work page 2013

[26] [26]

Nocentini, T

F. Nocentini, T. Besnier, C. Ferrari, S. Arguillere, S. Berretti, and M. Daoudi. Beyond fixed topologies: Unregistered training and comprehensive evaluation metrics for 3d talking heads, 2024

work page 2024

[27] [27]

Nocentini, T

F. Nocentini, T. Besnier, C. Ferrari, S. Arguillere, S. Berretti, and M. Daoudi. Scantalk: 3d talking heads from unregistered scans. InEuropean Conference on Computer Vision (ECCV), pages 19–36, Cham, 2024. Springer Nature Switzerland

work page 2024

[28] [28]

Nocentini, T

F. Nocentini, T. Besnier, C. Ferrari, S. Berretti, and M. Daoudi. Freetalk: Emotional topology-free 3d talking heads, 2026

work page 2026

[29] [29]

Nocentini, C

F. Nocentini, C. Ferrari, and S. Berretti. Learning landmarks motion from speech for speaker-agnostic 3D talking heads generation. In G. L. Foresti, A. Fusiello, and E. Hancock, editors,International Conference on Image Analysis and Processing (ICIAP), pages 340–351, Cham,

work page

[30] [30]

Springer Nature Switzerland

work page

[31] [31]

Nocentini, C

F. Nocentini, C. Ferrari, and S. Berretti. Emovoca: Speech-driven emo- tional 3D talking heads. InIEEE Winter Conference on Applications of Computer Vision (WACV), 2025

work page 2025

[32] [32]

Paysan, R

P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter. A 3d face model for pose and illumination invariant face recognition. In2009 sixth IEEE international conference on advanced video and signal based surveillance, pages 296–301. Ieee, 2009

work page 2009

[33] [33]

Z. Peng, Y . Luo, Y . Shi, H. Xu, X. Zhu, H. Liu, J. He, and Z. Fan. Selftalk: A self-supervised commutative training diagram to comprehend 3d talking faces. InProceedings of the 31st ACM International Conference on Multimedia, page 5292–5301, 2023

work page 2023

[34] [34]

Z. Peng, H. Wu, Z. Song, H. Xu, X. Zhu, J. He, H. Liu, and Z. Fan. Emotalk: Speech-driven emotional disentanglement for 3d face animation. 2023

work page 2023

[35] [36]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision, 2021

work page 2021

[36] [37]

Richard, M

A. Richard, M. Zollh ¨ofer, Y . Wen, F. de la Torre, and Y . Sheikh. Meshtalk: 3d face animation from speech using cross-modality dis- entanglement. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 1173–1182, October 2021

work page 2021

[37] [38]

Schneider, A

S. Schneider, A. Baevski, R. Collobert, and M. Auli. wav2vec: Unsupervised pre-training for speech recognition. InInterspeech 2019, page 3465–3469. ISCA, Sept. 2019

work page 2019

[38] [39]

S. Stan, K. I. Haque, and Z. Yumak. Facediffuser: Speech-driven 3d facial animation synthesis using diffusion. InACM SIGGRAPH Conference on Motion, Interaction and Games (MIG ’23), November 15–17, 2023, Rennes, France, New York, NY , USA, 2023. ACM

work page 2023

[39] [40]

Z. Sun, T. Lv, S. Ye, M. Lin, J. Sheng, Y .-H. Wen, M. Yu, and Y .-j. Liu. Diffposetalk: Speech-driven stylistic 3d facial animation and head pose generation via diffusion models.ACM Transactions on Graphics (TOG), 43(4):1–9, 2024

work page 2024

[40] [41]

Sung-Bin, L

K. Sung-Bin, L. Chae-Yeon, G. Son, O. Hyun-Bin, J. Ju, S. Nam, and T.-H. Oh. Multitalk: Enhancing 3d talking head generation across languages with multilingual video dataset, 2024

work page 2024

[41] [42]

Thambiraja, S

B. Thambiraja, S. Aliakbarian, D. Cosker, and J. Thies. 3diface: Diffusion-based speech-driven 3d facial animation and editing, 2023

work page 2023

[42] [43]

Thambiraja, I

B. Thambiraja, I. Habibie, S. Aliakbarian, D. Cosker, C. Theobalt, and J. Thies. Imitator: Personalized speech-driven 3d facial animation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 20621–20631, October 2023

work page 2023

[43] [44]

A. Wang, M. Emmi, and P. Faloutsos. Assembling an expressive facial animation system. InProceedings of the 2007 ACM SIGGRAPH Symposium on Video Games, Sandbox ’07, page 21–26, New York, NY , USA, 2007. Association for Computing Machinery

work page 2007

[44] [45]

S. Wu, K. I. Haque, and Z. Yumak. Probtalk3d: Non-deterministic emotion controllable speech-driven 3d facial animation synthesis using vq-vae. InThe 17th ACM SIGGRAPH Conference on Motion, Interaction, and Games, MIG ’24, page 1–12. ACM, Nov. 2024

work page 2024

[45] [46]

C.-h. Wuu, N. Zheng, S. Ardisson, R. Bali, D. Belko, E. Brockmeyer, L. Evans, T. Godisart, H. Ha, X. Huang, A. Hypes, T. Koska, S. Krenn, S. Lombardi, X. Luo, K. McPhail, L. Millerschoen, M. Perdoch, M. Pitts, A. Richard, J. Saragih, J. Saragih, T. Shiratori, T. Simon, M. Stewart, A. Trimble, X. Weng, D. Whitewolf, C. Wu, S.-I. Yu, and Y . Sheikh. Multifa...

work page 2022

[46] [47]

J. Xing, M. Xia, Y . Zhang, X. Cun, J. Wang, and T.-T. Wong. Codetalker: Speech-driven 3d facial animation with discrete motion prior. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12780–12790, 2023

work page 2023

[47] [48]

Y . Xu, A. W. Feng, S. Marsella, and A. Shapiro. A practical and configurable lip sync method for games. InProceedings of Motion on Games, MIG ’13, page 131–140, New York, NY , USA, 2013. Association for Computing Machinery

work page 2013

[48] [49]

Z. Xu, S. Gong, J. Tang, L. Liang, Y . Huang, H. Li, and S. Huang. Kmtalk: Speech-driven 3d facial animation with key motion embed- ding, 2024

work page 2024

[49] [50]

Zhang, S

L. Zhang, S. Liang, Z. Ge, and T. Hu. Personatalk: Bring attention to your persona in visual dubbing, 2024

work page 2024

[50] [51]

Zhang, X

W. Zhang, X. Cun, X. Wang, Y . Zhang, X. Shen, Y . Guo, Y . Shan, and F. Wang. Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8652–8661, 2023

work page 2023