pith. machine review for the scientific record. sign in

arxiv: 2602.09534 · v2 · submitted 2026-02-10 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

AUHead: Realistic Emotional Talking Head Generation via Action Units Control

Authors on Pith no claims yet

Pith reviewed 2026-05-16 05:45 UTC · model grok-4.3

classification 💻 cs.CV
keywords talking head generationaction unitsemotional controldiffusion modelsaudio-language modelscontrollable video synthesislip synchronization
0
0 comments X

The pith

Disentangling Action Units from audio via audio-language models enables controllable generation of emotionally realistic talking-head videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AUHead, a two-stage method that first extracts fine-grained Action Units from raw speech using large audio-language models and then drives a diffusion model to synthesize videos conditioned on those units. The first stage applies spatial-temporal AU tokenization together with an emotion-then-AU chain-of-thought process to capture subtle emotional cues without visual input. The second stage maps the resulting AU sequences into structured 2D facial representations and models their interaction inside cross-attention layers, adding an AU disentanglement guidance step at inference to balance expressiveness and identity consistency. A sympathetic reader would care because current talking-head systems lack precise emotional control; if the approach works, it supplies an explicit, disentangled handle on expressions that improves realism, lip synchronization, and coherence on benchmark data.

Core claim

By tokenizing Action Units spatially and temporally and prompting audio-language models with an emotion-then-AU chain-of-thought sequence, AU sequences can be disentangled from raw speech; these sequences are then mapped to 2D facial representations and supplied to a diffusion model whose cross-attention layers learn AU-vision interactions, with an inference-time disentanglement guidance mechanism that trades off AU fidelity against visual quality, yielding talking-head videos that exhibit competitive emotional realism, accurate lip synchronization, and visual coherence on standard benchmarks.

What carries the argument

The AU extraction pipeline inside audio-language models that uses spatial-temporal tokenization and emotion-then-AU chain-of-thought prompting, followed by the AU-driven diffusion model that conditions synthesis on AU sequences after mapping them to structured 2D facial representations.

If this is right

  • Emotional realism, lip synchronization accuracy, and visual coherence improve over prior methods on benchmark datasets.
  • AU disentanglement guidance during inference supplies explicit control over the expressiveness-identity trade-off.
  • Mapping AU sequences to 2D facial representations preserves spatial fidelity in the generated frames.
  • The two-stage separation allows independent improvement of the audio-to-AU extractor or the AU-to-video synthesizer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same AU extraction step could be reused as a plug-in module for other facial animation pipelines that already accept AU input.
  • If the audio-language model component generalizes across languages and accents, the method could reduce reliance on paired audio-visual training data for new domains.
  • Extending the guidance strategy to continuous AU intensity values might allow finer real-time expression editing in interactive avatars.
  • A natural next measurement would compare the predicted AU sequences against those extracted directly from the generated video frames to quantify consistency.

Load-bearing premise

Action Units can be reliably disentangled from raw audio alone via chain-of-thought prompting in audio-language models without visual supervision or domain-specific fine-tuning.

What would settle it

A test set of audio clips whose emotional content is independently labeled by human raters; if the AU intensities predicted by the first stage and the resulting video expressions do not match the labels at rates significantly above chance, the disentanglement claim fails.

Figures

Figures reproduced from arXiv: 2602.09534 by Hanyu Jiang, Jian Xue, Jiayi Lyu, Kai Liu, Leigang Qu, Tat-Seng Chua, Wenjing Zhang, Xiaobo Xia, Zhenglin Zhou.

Figure 1
Figure 1. Figure 1: Framework comparison between existing talking head generation and our AUHead. (a) Direct generation from audio and portrait. (b) Our method: audio understanding via ALM, and then generation. Existing methods typically feed input au￾dio and a target portrait into a genera￾tive model directly ( [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the two-stage AU-guided talking head generation framework. Stage 1 stimulates the AU generation abilities of audio-language models to get 24-dimensional AU se￾quences from input audio, capturing facial motion dynamics. Stage 2 models the interaction be￾tween AU and visual facial representations in a diffusion model to synthesize identity-preserving, emotionally expressive, and lip-synchronized … view at source ↗
Figure 3
Figure 3. Figure 3: Impact of different AU guidance scales on visual quality (FID) and emotion expression (Emotion ACC and MAE). ⋆ : the best quality￾emotion trade-off. AniPortrait Echomimic HalloV1 MEMO AUHead Teeth Abnormal No Teeth Blur Flat Emotion [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of generated frames and their [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of AUHead’s generalization across 10-second sequences. Each row corre￾sponds to a unique combination of unseen audio and a new target identity. For each sequence, one frame is randomly sampled from each second of the generated video. The examples cover three visual styles, line-art sketches, oil-painting portraits, and realistic face images (Zhang et al., 2021) to illustrate the model’s behav… view at source ↗
Figure 8
Figure 8. Figure 8: Visual examples of AU activations at different intensity levels in the FEAFA dataset. Each [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Frame level AU verification interface. Each frame displays the annotated AU values, the [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Facial animation comparison under neutral audio using AU from Qwen-Audio-Chat. [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative comparison with SOTA methods on MEAD and CREMA. [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative comparison with SOTA methods on MEAD: re [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative comparison with SOTA methods on MEAD: re [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative comparison with SOTA methods on MEAD: re [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Qualitative comparison with SOTA methods on CREMA: [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Qualitative comparison with SOTA methods on CREMA: [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Qualitative comparison with SOTA methods on CREMA: [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗
read the original abstract

Realistic talking-head video generation is critical for virtual avatars, film production, and interactive systems. Current methods struggle with nuanced emotional expressions due to the lack of fine-grained emotion control. To address this issue, we introduce a novel two-stage method (AUHead) to disentangle fine-grained emotion control, i.e. , Action Units (AUs), from audio and achieve controllable generation. In the first stage, we explore the AU generation abilities of large audio-language models (ALMs), by spatial-temporal AU tokenization and an "emotion-then-AU" chain-of-thought mechanism. It aims to disentangle AUs from raw speech, effectively capturing subtle emotional cues. In the second stage, we propose an AU-driven controllable diffusion model that synthesizes realistic talking-head videos conditioned on AU sequences. Specifically, we first map the AU sequences into the structured 2D facial representation to enhance spatial fidelity, and then model the AU-vision interaction within cross-attention modules. To achieve flexible AU-quality trade-off control, we introduce an AU disentanglement guidance strategy during inference, further refining the emotional expressiveness and identity consistency of the generated videos. Results on benchmark datasets demonstrate that our approach achieves competitive performance in emotional realism, accurate lip synchronization, and visual coherence, significantly surpassing existing techniques. Our implementation is available at https://github.com/laura990501/AUHead_ICLR

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces AUHead, a two-stage framework for emotional talking-head video generation. Stage 1 uses large audio-language models with spatial-temporal AU tokenization and an 'emotion-then-AU' chain-of-thought mechanism to produce Action Unit sequences directly from raw audio. Stage 2 conditions a diffusion model on these AU sequences after mapping them to structured 2D facial representations, employing cross-attention for AU-vision interaction and an AU disentanglement guidance strategy at inference for controllable trade-offs. The authors report that the method achieves competitive performance in emotional realism, lip synchronization, and visual coherence while significantly surpassing prior techniques on benchmark datasets.

Significance. If the first-stage AU sequences prove accurate and disentangled, the approach could enable finer-grained, audio-driven emotion control in talking heads without requiring paired visual supervision during AU extraction, which would be a meaningful advance over existing audio-to-video methods that rely on coarser emotion labels or direct visual conditioning. The open-source code release supports reproducibility.

major comments (3)
  1. [§3.1] §3.1 (AU Generation via ALM): The central claim that the 'emotion-then-AU' CoT produces reliable, disentangled AU sequences from audio alone lacks any quantitative validation against ground-truth AUs extracted from real video (e.g., AU detection F1 or correlation metrics); only downstream video quality is evaluated, so misalignment between generated AUs and actual facial dynamics cannot be ruled out and directly undermines the 'significantly surpassing' comparison.
  2. [§4] §4 (Experiments): No ablation studies isolate the contribution of the ALM stage versus the diffusion stage, nor are error analyses or failure cases for AU prediction provided; without these, it is impossible to determine whether the reported gains in emotional realism stem from accurate AU control or from the diffusion model's general capacity.
  3. [§3.2] §3.2 (AU-driven Diffusion): The mapping of AU sequences to 2D facial representations and the cross-attention interaction are described at a high level, but the paper does not specify how AU intensity values are normalized or injected, leaving open whether the claimed spatial fidelity is achieved by construction or by learned components.
minor comments (2)
  1. [Abstract] The abstract states 'competitive performance' and 'significantly surpassing' without citing any numerical metrics or table references; this should be revised to point to specific results.
  2. [§3.2] Notation for the AU disentanglement guidance (e.g., the guidance scale and its interaction with the diffusion scheduler) is introduced without an equation or pseudocode, reducing clarity for reproduction.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We appreciate the opportunity to clarify and strengthen our work. Below, we provide point-by-point responses to the major comments.

read point-by-point responses
  1. Referee: [§3.1] §3.1 (AU Generation via ALM): The central claim that the 'emotion-then-AU' CoT produces reliable, disentangled AU sequences from audio alone lacks any quantitative validation against ground-truth AUs extracted from real video (e.g., AU detection F1 or correlation metrics); only downstream video quality is evaluated, so misalignment between generated AUs and actual facial dynamics cannot be ruled out and directly undermines the 'significantly surpassing' comparison.

    Authors: We agree that direct quantitative validation of the generated AU sequences against ground-truth AUs would strengthen the claims. Our original evaluation emphasized end-to-end video quality metrics because they reflect the practical outcome for talking-head generation. We will add AU-level metrics (F1 scores and correlations with video-detected AUs) in the revised manuscript to address this directly. revision: yes

  2. Referee: [§4] §4 (Experiments): No ablation studies isolate the contribution of the ALM stage versus the diffusion stage, nor are error analyses or failure cases for AU prediction provided; without these, it is impossible to determine whether the reported gains in emotional realism stem from accurate AU control or from the diffusion model's general capacity.

    Authors: We acknowledge that explicit ablations isolating the ALM stage and error/failure analyses for AU prediction would help clarify the source of improvements. While baseline comparisons in the original work provide indirect evidence, we will add targeted ablations (e.g., random or ground-truth AU inputs) along with error analysis and representative failure cases in the revised experiments section. revision: yes

  3. Referee: [§3.2] §3.2 (AU-driven Diffusion): The mapping of AU sequences to 2D facial representations and the cross-attention interaction are described at a high level, but the paper does not specify how AU intensity values are normalized or injected, leaving open whether the claimed spatial fidelity is achieved by construction or by learned components.

    Authors: We appreciate the request for greater implementation detail. We will revise §3.2 to explicitly describe the normalization of AU intensity values and the precise mechanism of their injection through cross-attention, including any supporting equations or pseudocode. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external pre-trained models

full rationale

The paper describes a two-stage pipeline: an audio-language model (ALM) with spatial-temporal tokenization and emotion-then-AU chain-of-thought prompting to generate AU sequences from raw audio, followed by an AU-conditioned diffusion model that maps AUs to 2D facial representations and uses cross-attention plus disentanglement guidance at inference. No equations, fitted parameters, or self-citations are presented that reduce any claimed prediction or result to the method's own inputs by construction. The approach depends on external pre-trained ALMs and standard diffusion training objectives, with performance claims evaluated on benchmark datasets against prior methods. This keeps the central derivation self-contained and independent of self-referential fitting or renaming.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The method rests on standard assumptions about pre-trained ALMs and diffusion models being able to generalize to AU prediction and video synthesis; no new free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)
  • domain assumption Pre-trained audio-language models can accurately predict facial Action Units from speech via chain-of-thought prompting
    Invoked in the first stage description without additional training details provided.
  • domain assumption Mapping AU sequences to 2D facial representations preserves spatial fidelity for diffusion conditioning
    Stated as part of the second-stage design.

pith-pipeline@v0.9.0 · 5573 in / 1385 out tokens · 36483 ms · 2026-05-16T05:45:49.995161+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 3 internal anchors

  1. [1]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InNeurIPS, pp. 1877–1901,

  2. [3]

    Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou

    URLhttps://arxiv.org/abs/2410.13726. Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models.arXiv preprint arXiv:2311.07919,

  3. [4]

    SpeechVerse: A large-scale gen- eralizable audio language model,

    11 Published as a conference paper at ICLR 2026 Jiahao Cui, Hui Li, Yao Yao, Hao Zhu, Hanlin Shang, Kaihui Cheng, Hang Zhou, Siyu Zhu, and Jingdong Wang. Hallo2: Long-duration and high-resolution audio-driven portrait image anima- tion, 2024a. Jiahao Cui, Hui Li, Yun Zhan, Hanlin Shang, Kaihui Cheng, Yuqi Ma, Shan Mu, Hang Zhou, Jingdong Wang, and Siyu Zh...

  4. [5]

    doi: 10.48550/ARXIV .2405. 08295. URLhttps://doi.org/10.48550/arXiv.2405.08295. Nikita Drobyshev, Antoni Bigata Casademunt, Konstantinos V ougioukas, Zoe Landgraf, Stavros Petridis, and Maja Pantic. Emoportraits: Emotion-enhanced multimodal one-shot head avatars. InCVPR, pp. 8498–8507,

  5. [6]

    Wan-s2v: Audio-driven cinematic video generation.arXiv preprint arXiv:2508.18621, 2025

    Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, Dechao Meng, Jinwei Qi, Penchong Qiao, Zhen Shen, Yafei Song, et al. Wan-s2v: Audio-driven cinematic video generation.arXiv preprint arXiv:2508.18621,

  6. [7]

    12 Published as a conference paper at ICLR 2026 Jianwen Jiang, Weihong Zeng, Zerong Zheng, Jiaqi Yang, Chao Liang, Wang Liao, Han Liang, Yuan Zhang, and Mingyuan Gao

    URL https://arxiv.org/abs/2205.15278. 12 Published as a conference paper at ICLR 2026 Jianwen Jiang, Weihong Zeng, Zerong Zheng, Jiaqi Yang, Chao Liang, Wang Liao, Han Liang, Yuan Zhang, and Mingyuan Gao. Omnihuman-1.5: Instilling an active mind in avatars via cognitive simulation.arXiv preprint arXiv:2508.19209,

  7. [8]

    Let them talk: Audio-driven multi-person conversational video generation

    Zhe Kong, Feng Gao, Yong Zhang, Zhuoliang Kang, Xiaoming Wei, Xunliang Cai, Guanying Chen, and Wenhan Luo. Let them talk: Audio-driven multi-person conversational video generation. arXiv preprint arXiv:2505.22647,

  8. [9]

    Latentsync: Taming audio-conditioned latent dif- fusion models for lip sync with syncnet supervision.arXiv preprint arXiv:2412.09262, 2024

    Chunyu Li, Chao Zhang, Weikai Xu, Jingyu Lin, Jinghui Xie, Weiguo Feng, Bingyue Peng, Cunjian Chen, and Weiwei Xing. Latentsync: Taming audio-conditioned latent diffusion models for lip sync with syncnet supervision.arXiv preprint arXiv:2412.09262,

  9. [10]

    Takin-ada: Emotion controllable audio-driven animation with canonical and landmark loss optimization.arXiv preprint arXiv:2410.14283,

    Bin Lin, Yanzhen Yu, Jianhao Ye, Ruitao Lv, Yuguang Yang, Ruoye Xie, Pan Yu, and Hongbin Zhou. Takin-ada: Emotion controllable audio-driven animation with canonical and landmark loss optimization.arXiv preprint arXiv:2410.14283,

  10. [11]

    Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis.arXiv preprint arXiv:2203.05297,

    Haiyang Liu, Zihao Zhu, Naoya Iwamoto, Yichen Peng, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis.arXiv preprint arXiv:2203.05297,

  11. [12]

    Javisgpt: A unified multi-modal llm for sounding-video comprehension and generation.arXiv preprint arXiv:2512.22905, 2025a

    Kai Liu, Jungang Li, Yuchong Sun, Shengqiong Wu, Jianzhang Gao, Daoan Zhang, Wei Zhang, Sheng Jin, Sicheng Yu, Geng Zhan, et al. Javisgpt: A unified multi-modal llm for sounding-video comprehension and generation.arXiv preprint arXiv:2512.22905, 2025a. Kai Liu, Wei Li, Lai Chen, Shengqiong Wu, Yanhao Zheng, Jiayi Ji, Fan Zhou, Rongxin Jiang, Jiebo Luo, Ha...

  12. [13]

    Jiayi Lyu, Xing Lan, Guohong Hu, Hanyu Jiang, Wei Gan, Jinbao Wang, and Jian Xue

    doi: 10.1109/ICME57554.2024.10687525. Jiayi Lyu, Xing Lan, Guohong Hu, Hanyu Jiang, Wei Gan, Jinbao Wang, and Jian Xue. Multimodal emotional talking face generation based on action units.IEEE Transactions on Circuits and Systems for Video Technology, 35(5):4026–4038,

  13. [14]

    Xingpei Ma, Jiaran Cai, Yuansheng Guan, Shenneng Huang, Qiang Zhang, and Shunsi Zhang

    doi: 10.1109/TCSVT.2024.3523359. Xingpei Ma, Jiaran Cai, Yuansheng Guan, Shenneng Huang, Qiang Zhang, and Shunsi Zhang. Play- mate: Flexible control of portrait animation via 3d-implicit space guided diffusion.arXiv preprint arXiv:2502.07203,

  14. [15]

    Por- traittalk: Towards customizable one-shot audio-to-talking face generation.arXiv preprint arXiv:2412.07754,

    Fatemeh Nazarieh, Zhenhua Feng, Diptesh Kanojia, Muhammad Awais, and Josef Kittler. Por- traittalk: Towards customizable one-shot audio-to-talking face generation.arXiv preprint arXiv:2412.07754,

  15. [16]

    Namboodiri, and C.V

    13 Published as a conference paper at ICLR 2026 K R Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, and C.V . Jawahar. A lip sync expert is all you need for speech to lip generation in the wild. InACM MM,

  16. [17]

    doi: 10.1145/3394171. 3413532. URLhttp://dx.doi.org/10.1145/3394171.3413532. Leigang Qu, Feng Cheng, Ziyan Yang, Qi Zhao, Shanchuan Lin, Yichun Shi, Yicong Li, Wenjie Wang, Tat-Seng Chua, and Lu Jiang. Vincie: Unlocking in-context image editing from video. arXiv preprint arXiv:2506.10941, 2025a. Leigang Qu, Haochuan Li, Wenjie Wang, Xiang Liu, Juncheng Li...

  17. [18]

    Fei Shen, Xin Jiang, Xin He, Hu Ye, Cong Wang, Xiaoyu Du, Zechao Li, and Jinhui Tang

    URLhttps://openreview.net/forum?id= rHzapPnCgT. Fei Shen, Xin Jiang, Xin He, Hu Ye, Cong Wang, Xiaoyu Du, Zechao Li, and Jinhui Tang. Imagdressing-v1: Customizable virtual dressing. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp. 6795–6804, 2025a. Fei Shen, Cong Wang, Junyao Gao, Qin Guo, Jisheng Dang, Jinhui Tang, and Tat-...

  18. [19]

    Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

    14 Published as a conference paper at ICLR 2026 Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, et al. Neural codec language models are zero-shot text to speech synthesizers.arXiv preprint arXiv:2301.02111,

  19. [20]

    Audio2head: Audio-driven one- shot talking-head generation with natural head motion.arXiv preprint arXiv:2107.09293,

    Suzhen Wang, Lincheng Li, Yu Ding, Changjie Fan, and Xin Yu. Audio2head: Audio-driven one- shot talking-head generation with natural head motion.arXiv preprint arXiv:2107.09293,

  20. [21]

    Latent image animator: Learn- ing to animate images via latent space navigation.arXiv preprint arXiv:2203.09043,

    Yaohui Wang, Di Yang, Francois Bremond, and Antitza Dantcheva. Latent image animator: Learn- ing to animate images via latent space navigation.arXiv preprint arXiv:2203.09043,

  21. [22]

    Animatable 3d-aware face image generation for video avatars.arXiv preprint arXiv:2210.06465,

    Y Wu, Y Deng, J Yang, F Wei, Q Chen, and X Tong. Animatable 3d-aware face image generation for video avatars.arXiv preprint arXiv:2210.06465,

  22. [23]

    If-mdm: Implicit face motion diffu- sion model for high-fidelity realtime talking head generation.arXiv preprint arXiv:2412.04000,

    Sejong Yang, Seoung Wug Oh, Yang Zhou, and Seon Joo Kim. If-mdm: Implicit face motion diffu- sion model for high-fidelity realtime talking head generation.arXiv preprint arXiv:2412.04000,

  23. [24]

    Magicinfinite: Generating infinite talking videos with your words and voice.arXiv preprint arXiv:2503.05978,

    Hongwei Yi, Tian Ye, Shitong Shao, Xuancheng Yang, Jiantong Zhao, Hanzhong Guo, Terrance Wang, Qingyu Yin, Zeke Xie, Lei Zhu, et al. Magicinfinite: Generating infinite talking videos with your words and voice.arXiv preprint arXiv:2503.05978,

  24. [25]

    Multimodal Diffusion Transformer with Memory Bank for Scalable Long-Duration Talking Video Generation

    Haojie Zhang, Zhihao Liang, Ruibo Fu, Bingyan Liu, Zhengqi Wen, Xuefei Liu, Jianhua Tao, and Yaling Liang. Efficient long-duration talking video synthesis with linear diffusion transformer under multimodal guidance.arXiv preprint arXiv:2411.16748,

  25. [26]

    Memo: Memory-guided diffusion for expressive talking video generation,

    15 Published as a conference paper at ICLR 2026 Longtao Zheng, Yifan Zhang, Hanzhong Guo, Jiachun Pan, Zhenxiong Tan, Jiahao Lu, Chuanxin Tang, Bo An, and Shuicheng Yan. Memo: Memory-guided diffusion for expressive talking video generation,

  26. [27]

    Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, and Ziwei Liu

    URLhttps://arxiv.org/abs/2412.04448. Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, and Ziwei Liu. Pose- controllable talking face generation by implicitly modularized audio-visual representation. In CVPR, pp. 4176–4186,

  27. [28]

    FACS provides a standardized framework for describing facial muscle movements

    16 Published as a conference paper at ICLR 2026 A DETAILEDINTRODUCTION TOFACIALACTIONUNIT Facial Action Units (AUs) are defined in the Facial Action Coding System (FACS), which was originally developed by Ekman & Friesen (1978). FACS provides a standardized framework for describing facial muscle movements. It decomposes facial expressions into 44 individu...

  28. [29]

    messages

    In FEAFA Yan et al. (2019); Gan et al. (2022), each AU is annotated with a continuous value ranging from 0 to 1, where 0 in- dicates no activation and 1 indicates maximum activation. As shown in Fig. 8, each AU is visually illustrated across different intensity levels, providing clear examples of activation changes across frames. Figure 8: Visual examples...