pith. sign in

arxiv: 2512.11321 · v4 · submitted 2025-12-12 · 💻 cs.CV

KeyframeFace: Language-Driven Facial Animation via Semantic Keyframes

Pith reviewed 2026-05-16 22:43 UTC · model grok-4.3

classification 💻 cs.CV
keywords facial animationlanguage-driven animationsemantic keyframesARKit control spaceLLM priorsexpression fidelitymultimodal dataset
0
0 comments X

The pith

KeyframeFace generates facial animations from language by predicting sequences of semantic keyframes in ARKit space rather than dense motion trajectories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a shift from direct text-to-continuous-frame regression to a keyframe-based paradigm for facial animation. This lets the model produce interpretable sequences of semantically meaningful control points that align with natural language descriptions and emotion cues. A new multimodal dataset pairs 2100 expression scripts with videos, ARKit coefficients, and manually annotated keyframes to train the language-driven model. Experiments demonstrate that semantic keyframe supervision and language priors yield higher expression fidelity and better semantic alignment than methods without explicit facial action semantics. The approach aims to bring the sparse, controllable structure of traditional animation production into language-driven generation.

Core claim

Instead of regressing dense facial motion trajectories, KeyframeFace represents animation as a sequence of semantically meaningful keyframes in an interpretable ARKit-based facial control space. A language-driven model uses large language model priors to generate these keyframes so they align with contextual text descriptions and emotion cues, supported by a dataset of 2100 expression scripts paired with monocular videos, per-frame ARKit coefficients, and manually annotated semantic keyframes.

What carries the argument

Semantic keyframe sequences in the ARKit facial control space, which supply explicit semantic structure and enable precise alignment with language intent.

If this is right

  • Enables precise editing and higher interpretability because each keyframe is a discrete, semantically labeled control point.
  • Improves expression fidelity and semantic alignment over direct dense regression baselines.
  • Supports more efficient content creation by letting users drive animation from natural language scripts.
  • Provides a structured way to inject LLM priors without entangling high-level intent with low-level motion parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The keyframe formulation could integrate directly with existing professional animation pipelines that already rely on sparse keyframes for artist control.
  • Expanding the dataset to include more varied cultural or contextual expressions might reduce current limits on what language can reliably trigger.
  • If inference speed is optimized, the method could support interactive applications such as real-time character response in games or virtual environments.
  • The explicit semantic structure may help diagnose and correct mismatches between text input and output motion more easily than opaque dense models.

Load-bearing premise

The ARKit facial control space together with manually annotated semantic keyframes is assumed to fully capture and align with the semantic intent expressed in natural language descriptions without significant loss of expressiveness or ambiguity.

What would settle it

A controlled test set of language descriptions containing subtle emotional nuances or expressions outside the ARKit parameter vocabulary, where the generated keyframes produce visibly mismatched animations, would show the alignment claim does not hold.

Figures

Figures reproduced from arXiv: 2512.11321 by Haibo Liu, Jingchao Wu, Xiangru Huang, Yuanchen Fei, Zejian Kang.

Figure 1
Figure 1. Figure 1: Our text-to-animation model enabled by the KeyframeFace dataset. Given descriptions of contextual back￾ground and desired emotions, our text-to-animation framework produces keyframe-level facial descriptions and generates the corresponding ARKit coefficients that can be directly converted into expressive facial animations and videos via tools like MetaHuman. Abstract Generating dynamic 3D facial animation … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our data pipeline. (a) script generation with contextual keyframe descriptions, (b) ARKit-based motion capture, (c) synchronized video and coefficient recording, (d) manual keyframe selection, and (e) multi-perspective augmentation using LLM and MLLM. 3 KeyframeFace Dataset In this section, we first detail how KeyframeFace is con￾structed in Data Collection (section 3.1), and then compare it wi… view at source ↗
Figure 3
Figure 3. Figure 3: Data visualization examples from the KeyframeFace dataset. Each row represents a distinct expressive scenario with three keyframes. Text boxes above and below depict hierarchical annotations, combining script-level, ARKit-based, and image-based annotations. This structure enables consistent semantic-to-visual alignment and supports controllable text-to￾expression modeling. tive samples in fig. 3. Each samp… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of our LLM-based Text-driven Facial Animation Framework. Our approach involves: (a) Input Standardization stage that transforms diverse user inputs into unified keyframe descriptions through LLM-based analysis, and (b) Text-To-Animation that generates and renders facial animations via fine-tuned LLM. 4.2 Text-To-Animation Model In the second stage, we fine-tune an LLM to convert standard￾ized keyf… view at source ↗
Figure 5
Figure 5. Figure 5: Text-To-Animation Model Architecture. The model converts standardized keyframe descriptions into ARKit pa￾rameters through prompt engineering and recursive generation strategy to produce 61-dimensional coefficient vectors for facial animation. Algorithm 1: Text2ARKit Generation Procedure Input: Scipt S; System Prompt Psys Output: ARKit Parameter Sequence Aseq = {A1, A2, . . . , An} Extract keyframes from S… view at source ↗
Figure 6
Figure 6. Figure 6: Generating Ground-Truth-Like Facial Expressions from Text: Our Method vs. Diffusion Baseline. Our method faithfully captures the background context, emotional cues, and keyframe descriptions in the text prompt, enabling it to generate facial expressions that closely resemble the ground truth. In contrast, the diffusion baseline fails to capture the intended semantics in certain cases. tional alignment, and… view at source ↗
Figure 7
Figure 7. Figure 7: illustrates several challenging cases where our model fails to fully reproduce fine-grained facial details. In particular, we observe occasional inaccuracies in mouth artic￾ulation and nuanced gaze control when the textual description involves subtle emotional cues. Despite these limitations, our method still maintains stronger semantic alignment compared [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of Keyframes per Video Clip. The dataset exhibits a reasonable keyframe distribution pattern with most clips containing 1-3 keyframes, reflecting typical emotional expression structures. corresponding text descriptions and concise emotional intent labels. D.2 Keyframe Extraction and Distribution Characteristics In facial expression analysis, the number of keyframes di￾rectly affects the granul… view at source ↗
Figure 10
Figure 10. Figure 10: Video Clip Duration Distribution. Histogram showing the distribution of clip durations across the dataset, with median duration of 8.3 seconds. D.5 Frame-level Emotion Analysis To comprehensively characterize the emotional dynamics within our dataset, we conduct fine-grained emotion analysis at the keyframe level. Since character emotions evolve with narrative development in scripts, single video-level la… view at source ↗
Figure 9
Figure 9. Figure 9: Recording Duration per Actor. Distribution of total recording minutes across all 21 anonymized ac￾tors (A01–A21), showing variation in individual contribution while maintaining balanced clip counts. D.3 Data Distribution and Balance As shown in [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: Emotion Distribution and Visualization. (a) t-SNE projection of keyframes colored by emotion intensities. (b) Distribution of dominant emotions across all keyframes. tions. This provides high-quality, diverse training data for downstream tasks such as facial expression generation and emotion recognition, offering significant research value and application potential. 17 [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗
read the original abstract

Facial animation is a core component for creating digital characters in Computer Graphics (CG) industry. A typical production workflow relies on sparse, semantically meaningful keyframes to precisely control facial expressions. Enabling such animation directly from natural-language descriptions could significantly improve content creation efficiency and accessibility. However, most existing methods adopt a text-to-continuous-frames paradigm, directly regressing dense facial motion trajectories from language. This formulation entangles high-level semantic intent with low-level motion, lacks explicit semantic control structure, and limits precise editing and interpretability. Inspired by the keyframe paradigm in animation production, we propose KeyframeFace, a framework for semantic facial animation from language via interpretable keyframes. Instead of predicting dense motion trajectories, our method represents animation as a sequence of semantically meaningful keyframes in an interpretable ARKit-based facial control space. A language-driven model leverages large language model (LLM) priors to generate keyframes that align with contextual text descriptions and emotion cues. To support this formulation, we construct a multimodal dataset comprising 2,100 expression scripts paired with monocular videos, per-frame ARKit coefficients, and manually annotated semantic keyframes. Experiments show that incorporating semantic keyframe supervision and language priors significantly improves expression fidelity and semantic alignment compared to methods that do not use facial action semantics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes KeyframeFace, a framework for language-driven facial animation that predicts sequences of interpretable semantic keyframes in an ARKit blendshape control space instead of regressing dense motion trajectories. It constructs a new multimodal dataset of 2,100 expression scripts paired with videos, per-frame ARKit coefficients, and manually annotated semantic keyframes, and leverages LLM priors to align keyframes with contextual text and emotion descriptions. The central claim is that semantic keyframe supervision plus language priors yields significantly better expression fidelity and semantic alignment than non-semantic baselines.

Significance. If the experimental claims are substantiated with quantitative evidence, the work could meaningfully advance language-to-animation pipelines by aligning with professional keyframe workflows, improving interpretability, editability, and semantic control. The introduction of a dedicated multimodal dataset with ARKit annotations represents a concrete resource contribution for the community.

major comments (3)
  1. [Abstract] Abstract: The headline claim that 'experiments show that incorporating semantic keyframe supervision and language priors significantly improves expression fidelity and semantic alignment' supplies no quantitative metrics, baseline methods, error analysis, or dataset statistics. This absence is load-bearing for the central claim and prevents verification of the reported gains.
  2. [§3] §3 (Dataset and Keyframe Annotation): The approach assumes that the fixed ARKit blendshape basis (~52 units) together with 2,100 manually annotated semantic keyframes faithfully encodes the semantic intent of natural-language prompts without material loss of expressiveness for nuanced or compound expressions. No validation (inter-annotator agreement, expressiveness ablation, or comparison to higher-dimensional spaces) is described, which directly affects whether the observed improvements reflect true semantic alignment or artifacts of the restricted control space.
  3. [§4] §4 (Experiments): The results must specify the exact baselines (e.g., direct text-to-continuous-frame regression methods), evaluation metrics (e.g., coefficient MSE, semantic alignment scores, user studies), and statistical significance tests. Without these details the improvement claim cannot be assessed and the comparison to 'methods that do not use facial action semantics' remains undefined.
minor comments (1)
  1. [Abstract] The abstract would benefit from a concise statement of the dataset size and the primary quantitative metrics used to support the improvement claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We agree that the abstract and experimental sections require additional quantitative details and clarifications to fully substantiate the claims. We address each major comment below and will incorporate revisions where appropriate to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline claim that 'experiments show that incorporating semantic keyframe supervision and language priors significantly improves expression fidelity and semantic alignment' supplies no quantitative metrics, baseline methods, error analysis, or dataset statistics. This absence is load-bearing for the central claim and prevents verification of the reported gains.

    Authors: We acknowledge that the abstract does not include specific quantitative metrics or baseline details. The full manuscript (Section 4) reports concrete improvements using coefficient MSE for expression fidelity, semantic alignment scores based on LLM priors, and user study results, with the dataset comprising 2,100 expression scripts. We will revise the abstract to explicitly state key quantitative gains (e.g., relative reductions in MSE and alignment improvements) and include dataset statistics to make the central claim verifiable. revision: yes

  2. Referee: [§3] §3 (Dataset and Keyframe Annotation): The approach assumes that the fixed ARKit blendshape basis (~52 units) together with 2,100 manually annotated semantic keyframes faithfully encodes the semantic intent of natural-language prompts without material loss of expressiveness for nuanced or compound expressions. No validation (inter-annotator agreement, expressiveness ablation, or comparison to higher-dimensional spaces) is described, which directly affects whether the observed improvements reflect true semantic alignment or artifacts of the restricted control space.

    Authors: The ARKit blendshape basis (~52 units) is an industry-standard interpretable space for facial control, and our dataset provides 2,100 manually annotated semantic keyframes aligned with text and emotion descriptions. We agree that explicit validation is needed; we will add inter-annotator agreement metrics and an expressiveness ablation study in the revised Section 3. A comparison to higher-dimensional control spaces lies outside the paper's scope, which focuses on semantic keyframing within the standard ARKit representation used in production pipelines. revision: partial

  3. Referee: [§4] §4 (Experiments): The results must specify the exact baselines (e.g., direct text-to-continuous-frame regression methods), evaluation metrics (e.g., coefficient MSE, semantic alignment scores, user studies), and statistical significance tests. Without these details the improvement claim cannot be assessed and the comparison to 'methods that do not use facial action semantics' remains undefined.

    Authors: Section 4 already compares against direct text-to-continuous regression baselines without semantic supervision. We will revise the section to explicitly enumerate all baselines, detail the metrics (coefficient MSE, semantic alignment scores, and perceptual user studies), and include statistical significance tests such as paired t-tests. These elements support the claim of improved fidelity and alignment over non-semantic methods and will be clarified for unambiguous assessment. revision: yes

Circularity Check

0 steps flagged

No circularity detected in KeyframeFace derivation chain

full rationale

The paper introduces a new multimodal dataset of 2,100 scripts with videos, ARKit coefficients, and manually annotated semantic keyframes, then trains a language-driven model using external LLM priors to output interpretable keyframes instead of dense trajectories. No equation or claim reduces a reported prediction to a fitted parameter defined by the target result, nor does any load-bearing step rely on self-citation for uniqueness or ansatz smuggling. The improvement over non-semantic baselines is measured on the newly constructed data and is therefore independent of the final claim.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the sufficiency of the ARKit coefficient space for semantic expression and the alignment of LLM priors with manually annotated keyframes; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption ARKit-based facial control space is sufficient to represent semantically meaningful expressions from language descriptions
    Method represents all animation via ARKit coefficients for keyframes.

pith-pipeline@v0.9.0 · 5535 in / 1080 out tokens · 29511 ms · 2026-05-16T22:43:46.136004+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AudioFace: Language-Assisted Speech-Driven Facial Animation with Multimodal Language Models

    cs.CV 2026-05 unverdicted novelty 5.0

    AudioFace improves speech-driven facial animation by guiding blendshape prediction with linguistic and articulatory information extracted via multimodal language models.

  2. SuperFace: Preference-Aligned Facial Expression Estimation Beyond Pseudo Supervision

    cs.CV 2026-05 unverdicted novelty 5.0

    SuperFace refines ARKit facial expression estimation by using human preference feedback on rendered faces to optimize beyond noisy pseudo-label supervision from capture software.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · cited by 2 Pith papers · 3 internal anchors

  1. [1]

    https://www.metahuman.com/en-US?lang= en-US

    Metahuman — high-fidelity digital humans made easy. https://www.metahuman.com/en-US?lang= en-US. [Accessed: 2025-06-30]

  2. [2]

    Arkit face tracking blendshape names

    Apple Inc. Arkit face tracking blendshape names. https://developer.apple.com/documentation/ arkit/arfaceanchor/blendshapelocation. Ac- cessed: 2025-11-04

  3. [3]

    Motiondirector: Motion customization of diffusion models via motion control prompts

    Haoran Chen, Xiaoqiang Zhu, Yixuan Li, et al. Motiondirector: Motion customization of diffusion models via motion control prompts. InInternational Conference on Learning Representa- tions (ICLR), 2024

  4. [4]

    Geneface: General- ized and high-fidelity audio-driven 3d talking face synthesis

    Jia Chen, Zhenyu Wang, Yating Liu, et al. Geneface: General- ized and high-fidelity audio-driven 3d talking face synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  5. [5]

    Videocrafter2: Open dif- fusion models for high-quality video generation.arXiv preprint arXiv:2405.05224, 2024

    Jian Chen, Peng Liu, Yue Zhang, et al. Videocrafter2: Open dif- fusion models for high-quality video generation.arXiv preprint arXiv:2405.05224, 2024

  6. [6]

    4dfab: A large scale 4d database for facial expres- sion analysis and biometric applications

    Shiyang Cheng, Irene Kotsia, Maja Pantic, and Stefanos Zafeiriou. 4dfab: A large scale 4d database for facial expres- sion analysis and biometric applications. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), pages 5117–5126, 2018

  7. [7]

    A FACS valid 3D dynamic action unit database with applications to 3D dynamic morphable facial modeling

    Darren Cosker, Eva Krumhuber, and Adrian Hilton. A FACS valid 3D dynamic action unit database with applications to 3D dynamic morphable facial modeling. In2011 International Conference on Computer Vision (ICCV), pages 2296–2303. IEEE, 2011

  8. [8]

    Daniel Cudeiro, Timo Bolkart, Cassidy Laidlaw, Anurag Ran- jan, and Michael J. Black. Capture, learning, and synthesis of 3d speaking styles. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10101–10111, 2019. Introduces VOCA/VOCASET dataset

  9. [9]

    Nova: Autoregressive video generation without vector quantization

    Haoge Deng, Xinyu Zhou, Yu Tian, et al. Nova: Autoregressive video generation without vector quantization. InInternational Conference on Learning Representations (ICLR), 2025

  10. [10]

    Arkit-avatar: Generating human an- imation from arkit facial capture using metahuman

    John Doe and Jane Smith. Arkit-avatar: Generating human an- imation from arkit facial capture using metahuman. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 285–292, 2023

  11. [11]

    Facial action coding sys- tem.Environmental Psychology & Nonverbal Behavior, 1978

    Paul Ekman and Wallace V Friesen. Facial action coding sys- tem.Environmental Psychology & Nonverbal Behavior, 1978

  12. [12]

    Faceformer: Speech-driven 3d facial animation with transformer

    Yuming Fan, Jia Zhang, Hancheng Xu, et al. Faceformer: Speech-driven 3d facial animation with transformer. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  13. [13]

    A 3-d audio-visual corpus of af- fective communication.IEEE Transactions on Multimedia, 12(6):591–598, 2010

    Gabriele Fanelli, Jurgen Gall, Harald Romsdorfer, Thibaut Weise, and Luc Van Gool. A 3-d audio-visual corpus of af- fective communication.IEEE Transactions on Multimedia, 12(6):591–598, 2010

  14. [14]

    Long-Context Autoregressive Video Modeling with Next-Frame Prediction

    Yuchao Gu, Weijia Mao, and Mike Zheng Shou. Long-context autoregressive video modeling with next-frame prediction (far). arXiv preprint arXiv:2503.19325, 2025

  15. [15]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    D. Guo et al. Deepseek-r1: Incentivizing reasoning ca- pability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  16. [16]

    Mead-3d: 3d reconstructions of the mead dataset.https://github.com/haonanhe/MEAD-3D,

    Haonan He. Mead-3d: 3d reconstructions of the mead dataset.https://github.com/haonanhe/MEAD-3D,

  17. [17]

    GitHub repository; accessed 2025-11-03

  18. [18]

    Prompt-to-prompt image and video editing with cross-attention control

    Amir Hertz, Ofir Perel, Rotem Tzaban, et al. Prompt-to-prompt image and video editing with cross-attention control. InACM SIGGRAPH, 2023

  19. [19]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Lu Wang, and Weizhu Chen. Lora: Low-rank adap- tation of large language models. InInternational Conference on Learning Representations (ICLR), 2022

  20. [20]

    Emo: Emote portrait alive

    Yujun Ji, Lin Song, Liang Gao, et al. Emo: Emote portrait alive. InProceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), 2023

  21. [21]

    Audio-driven speech animation with text-guided expression control

    Sungho Jung, Jae Kim, and Seungyong Lee. Audio-driven speech animation with text-guided expression control. InEu- rographics Short Papers, 2024

  22. [22]

    Kinetix: Keyframe- based video generation with temporal attention

    Zhiyuan Li, Zhong He, Yichen Song, et al. Kinetix: Keyframe- based video generation with temporal attention. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  23. [23]

    Vidtome: Text-to- video editing with multi-track temporal prompts

    Rui Liu, Haoran Zhang, Tianyu Wu, et al. Vidtome: Text-to- video editing with multi-track temporal prompts. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), 2024

  24. [24]

    Talkclip: Talking head generation with text-guided expressive speaking styles.arXiv:2304.00334, 2023

    Yifeng Ma, Suzhen Wang, Yu Ding, Bowen Ma, Tangjie Lv, Changjie Fan, Zhipeng Hu, Zhidong Deng, and Xin Yu. Talkclip: Talking head generation with text-guided expressive speaking styles.arXiv:2304.00334, 2023

  25. [25]

    Diffs- peaker: Speech-driven 3d facial animation with diffusion trans- former,

    Zhen Ma, Jiarong Zhu, Xian Wang, Yijia Zhang, and Xiaowei Zhou. Diffspeaker: Speech-driven 3d facial animation with dif- fusion transformer.arXiv:2402.05712, 2024

  26. [26]

    Matuszewski, Wei Quan, Lik-Kwan Shark, Alison S

    Bogdan J. Matuszewski, Wei Quan, Lik-Kwan Shark, Alison S. McLoughlin, Catherine E. Lightbody, Hedley C. A. Emsley, and Caroline L. Watkins. Hi4d-adsip 3-d dynamic facial articulation database.Image and Vision Computing, 30(10):713–727, 2012

  27. [27]

    Gpt-5 technical report.OpenAI Technical Report, 2025

    OpenAI. Gpt-5 technical report.OpenAI Technical Report, 2025

  28. [28]

    Emotalk: Speech- driven emotional disentanglement for 3d face animation

    Ziqiao Peng, Haoyu Wu, Zhenbo Song, Hao Xu, Xiangyu Zhu, Jun He, Hongyan Liu, and Zhaoxin Fan. Emotalk: Speech- driven emotional disentanglement for 3d face animation. InPro- ceedings of the IEEE/CVF International Conference on Com- puter Vision (ICCV), pages 20687–20697. IEEE, 2023. 10

  29. [29]

    The flo- rence 4d facial expression dataset

    Filippo Principi, Stefano Berretti, Claudio Ferrari, Naima Ot- berdout, Mohamed Daoudi, and Alberto Del Bimbo. The flo- rence 4d facial expression dataset. In2023 IEEE 17th Interna- tional Conference on Automatic Face and Gesture Recognition (FG), pages 1–6. IEEE, 2023

  30. [30]

    Autoregressive video generation beyond next frames prediction.arXiv preprint arXiv:2509.24081, 2025

    Sucheng Ren, Jun Wang, and Lei Zhang. Autoregressive video generation beyond next-frame prediction.arXiv preprint arXiv:2509.24081, 2025

  31. [31]

    Zhuge, Y

    NVIDIA Research. Audio2face-3d: High-fidelity speech- driven 3d facial animation with arkit-compatible blendshapes. arXiv preprint arXiv:2501.01234, 2025

  32. [32]

    Meshtalk: 3d face animation from speech using cross-modality attention

    Alexander Richard, Michael Zollh ¨ofer, Yandong Wen, et al. Meshtalk: 3d face animation from speech using cross-modality attention. InECCV, 2022

  33. [33]

    Diffposetalk: Speech- driven stylistic 3d facial animation and head pose generation via diffusion models.arXiv:2310.00434, 2024

    Zhiyao Sun, Tian Lv, Sheng Ye, Matthieu Lin, Jenny Sheng, Yu- Hui Wen, Minjing Yu, and Yong-Jin Liu. Diffposetalk: Speech- driven stylistic 3d facial animation and head pose generation via diffusion models.arXiv:2310.00434, 2024

  34. [34]

    Larp: Tokeniz- ing videos with a learned autoregressive generative prior.arXiv preprint arXiv:2410.21264, 2024

    Hanyu Wang, Qiang Liu, Yifan Zhang, et al. Larp: Tokeniz- ing videos with a learned autoregressive generative prior.arXiv preprint arXiv:2410.21264, 2024

  35. [35]

    Mead: A large-scale audio-visual dataset for emotional talking-face generation

    Kaisiyuan Wang, Qianyi Wu, Linsen Song, Zhuoqian Yang, Wayne Wu, Chen Qian, Ran He, Yu Qiao, and Chen Change Loy. Mead: A large-scale audio-visual dataset for emotional talking-face generation. InEuropean Conf. Computer Vision (ECCV), pages 700–717. Springer, 2020. Dataset page updated 2021

  36. [36]

    Liveportrait: Effi- cient portrait animation via sparse keyframe warping

    Ling Wang, Shun Zhang, Jin Huang, et al. Liveportrait: Effi- cient portrait animation via sparse keyframe warping. InEuro- pean Conference on Computer Vision (ECCV), 2024

  37. [37]

    Mmhead: Towards fine- grained multi-modal 3d facial animation.arXiv preprint arXiv:2410.07757, 2024

    Sijing Wang, Yao Zhou, Yuhao Zhang, Zhuo Zhang, Xin Li, Zeyu Liu, and Baoyuan Chen. Mmhead: Towards fine- grained multi-modal 3d facial animation.arXiv preprint arXiv:2410.07757, 2024

  38. [38]

    Mmface4d: A large-scale multi-modal 4d face dataset for audio-driven 3d face animation, 2023

    Haozhe Wu, Jia Jia, Junliang Xing, Hongwei Xu, Xiangyuan Wang, and Jelo Wang. Mmface4d: A large-scale multi-modal 4d face dataset for audio-driven 3d face animation, 2023

  39. [39]

    Dreampose: Fash- ion image animation via keypose interpolation

    Chenyang Xu, Wen Zhao, Hao Li, et al. Dreampose: Fash- ion image animation via keypose interpolation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

  40. [40]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  41. [41]

    Keyframe: Video dif- fusion with sparse temporal modeling

    Yifan Yang, Jing Xu, Bin Zhou, et al. Keyframe: Video dif- fusion with sparse temporal modeling. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), 2024

  42. [42]

    A high-resolution 3d dynamic facial expression database

    Lijun Yin, Xiaochen Chen, Yi Sun, Tony Worm, and Michael Reale. A high-resolution 3d dynamic facial expression database. In8th IEEE International Conference on Automatic Face and Gesture Recognition (FG), pages 1–6, 2008

  43. [43]

    Cohn, et al

    Zhanhong Zhang, Lijun Yin, Jeffrey F. Cohn, et al. Multimodal spontaneous emotion corpus for human behavior analysis. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 3438–3446, 2016. BP4D+ description

  44. [44]

    Flow-guided one-shot talking face generation with a high-resolution audio- visual dataset

    Zizheng Zhang, Lianzhi Tan, Xinyu Chen, et al. Flow-guided one-shot talking face generation with a high-resolution audio- visual dataset. InIEEE/CVF Conf. Computer Vision and Pat- tern Recognition (CVPR), pages 3661–3670, 2021. Introduces HDTF dataset

  45. [45]

    Media2face: Co-speech facial animation generation with multi-modality guidance

    Qingcheng Zhao, Pengyu Long, Qixuan Zhang, Dafei Qin, Han Liang, Longwen Zhang, Yingliang Zhang, Jingyi Yu, and Lan Xu. Media2face: Co-speech facial animation generation with multi-modality guidance. InSIGGRAPH Conference Papers,

  46. [46]

    the patch adds X

    Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yun- lin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, Wenmeng Zhou, and Yingda Chen. Swift: A scal- able lightweight infrastructure for fine-tuning.arXiv preprint arXiv:2408.05517, 2024

  47. [47]

    Expclip: Bridging text and facial expressions via semantic alignment.arXiv:2308.14448, 2023

    Yicheng Zhong, Huawei Wei, Peiji Yang, and Zhisheng Wang. Expclip: Bridging text and facial expressions via semantic alignment.arXiv:2308.14448, 2023. AAAI 2024 version avail- able

  48. [48]

    Videocomposer: Compositional video synthesis with motion controllability

    Liang Zhou, Yifan Yin, Xinlong Wang, et al. Videocomposer: Compositional video synthesis with motion controllability. In Advances in Neural Information Processing Systems (NeurIPS), 2023

  49. [49]

    Training Status-Model Type-Annotation Type-Representation Format

    Hanwen Zou, Yiming Zhou, Weizhi Zhang, Yan Huang, Zhongqian Li, Yi Yang, and Baoyuan Chen. Express4d: Ex- pressive, friendly, and extensible 4d facial motion generation benchmark.arXiv preprint arXiv:2508.12438, 2025. 11 A Comparison with Express4D-MDM A.1 Express4D-MDM For comparison with prior diffusion-based motion generation systems, we adopt the Moti...