FluentAvatar: Flicker-Free Talking-Head Animation via Phoneme-Guided Autoregressive Modeling

arxiv: 2509.12052 · v3 · submitted 2025-09-15 · 💻 cs.CV

FluentAvatar: Flicker-Free Talking-Head Animation via Phoneme-Guided Autoregressive Modeling

Yuchen Deng , Xiuyang Wu , Hai-Tao Zheng , Suiyang Zhang , Yi He , Yuxing Han This is my paper

Pith reviewed 2026-05-18 16:38 UTC · model grok-4.3

classification 💻 cs.CV

keywords talking-head animationautoregressive generationphoneme guidancetemporal consistencyflicker reductiondiffusion modelsvideo generationstate space models

0 comments p. Extension

The pith

FluentAvatar uses phoneme-guided autoregressive modeling to generate flicker-free talking-head videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current diffusion-based methods for creating talking-head animations suffer from inter-frame flicker because random noise starting points lead to varying denoising paths that cause visual jumps between frames. The paper shows this by fixing inputs but changing seeds and finding low correlation in flicker patterns. To fix it, the authors propose an autoregressive system that builds each frame based on previous ones, guided by phoneme sequences for natural mouth movements and timing. This provides a stronger built-in continuity than parallel diffusion sampling. A new metric called BG-Flicker helps measure the background flicker separately for better assessment.

Core claim

FluentAvatar is a two-stage autoregressive framework built on phoneme representations. First, Facial Keyframe Generation produces phoneme-aligned keyframes under a Phoneme-Frame Causal Attention Mask. Then, Inter-frame Interpolation synthesizes transition frames via a timestamp-aware adaptive strategy built upon selective state space modeling. Experiments show it attains the best FVD on both CMLR and HDTF datasets with BG-Flicker results close to ground truth while maintaining strong visual fidelity, lip synchronization, and temporal stability.

What carries the argument

Phoneme-guided autoregressive modeling with a Phoneme-Frame Causal Attention Mask for keyframe generation and selective state space modeling for inter-frame interpolation.

If this is right

Attains the best Fréchet Video Distance on CMLR and HDTF datasets.
Produces BG-Flicker scores close to those of real ground-truth videos.
Delivers strong visual fidelity, accurate lip synchronization, and improved temporal stability.
Introduces BG-Flicker as a more reliable metric for evaluating inter-frame flicker in talking-head videos.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This sequential approach might allow for incremental generation suitable for live streaming applications.
The phoneme guidance could be adapted to other conditional video tasks like gesture or expression control.
Adopting similar autoregressive priors might reduce artifacts in broader diffusion-based video synthesis beyond talking heads.

Load-bearing premise

That the primary cause of inter-frame flicker is the variation in denoising trajectories from stochastic initialization in diffusion models.

What would settle it

Generating multiple samples with the autoregressive model on the same fixed input sequence and finding markedly different flicker patterns across runs with low Pearson correlation, as observed in the diffusion baseline.

Figures

Figures reproduced from arXiv: 2509.12052 by Hai-Tao Zheng, Suiyang Zhang, Xiuyang Wu, Yi He, Yuchen Deng, Yuxing Han.

**Figure 2.** Figure 2: Inter-frame Flicker Visualization. Left: reference frame; subsequent panels show pixel-wise differ [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The overall framework of AvatarSync. The pipeline first normalizes text/audio into a compact [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Generation Time Comparison. AvatarSync scales nearly linearly with phoneme count, while others [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on the CMLR and HDTF dataset. (a) Top: ground-truth frames. Middle: [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Loss comparison with and without PRL. 5 CONCLUSION We introduce AvatarSync, an autoregressive framework on phoneme representations for talkinghead animation generation. The method addresses two major limitations of diffusion-based approaches: (1) inter-frame flickers in generated videos; and (2) low training and inference efficiency. By leveraging the stable many-to-one mapping from text/audio to phoneme… view at source ↗

**Figure 7.** Figure 7: Original Video Frames from the Dataset [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Enhanced Video Frames after Super-Resolution. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Visual comparison of face preprocessing methods. Subset Model Face-Centric Cropping Landmark-Based Cropping ArcFace FaceNet FaceNet512 VGG-Face ArcFace FaceNet FaceNet512 VGG-Face s1 0.2958 0.1608 0.1931 0.2899 0.3250 0.2175 0.2360 0.3399 s2 0.2189 0.1672 0.1278 0.2885 0.2077 0.1886 0.1011 0.2236 s3 0.2576 0.1715 0.1079 0.2899 0.2873 0.1784 0.0752 0.2012 s4 0.3698 0.3415 0.2198 0.3643 0.3628 0.2822 0.1922 … view at source ↗

**Figure 10.** Figure 10: Training loss curves on the mixed dataset (CMLR + HDTF). The plots illustrate the convergence [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

read the original abstract

Current talking-head generation has gradually shifted from GAN-based methods to diffusion-based paradigms, achieving remarkable progress in visual fidelity and temporal consistency. However, inter-frame flicker remains prevalent in existing diffusion-based methods. An important reason is that denoising trajectory variation induced by stochastic initialization leaves residual inter-frame inconsistencies, which manifest as short-term, abrupt visual fluctuations between adjacent frames. To further verify this, we conduct a controlled study by fixing the input while varying only the random seed. The results show markedly different flicker patterns across samplings, with a mean inter-seed Pearson correlation of only r = 0.15. This motivates us to explore autoregressive generation, which models frames sequentially and provides a more direct prior for temporal continuity. Based on this, we propose FluentAvatar, a two-stage autoregressive framework built on phoneme representations. First, Facial Keyframe Generation produces phoneme-aligned keyframes under a Phoneme-Frame Causal Attention Mask, and Inter-frame Interpolation synthesizes transition frames via a timestamp-aware adaptive strategy built upon selective state space modeling. Moreover, we introduce BG-Flicker, a background-isolated metric for talking-head videos that enables more reliable evaluation of inter-frame flicker. Experiments on CMLR and HDTF demonstrate that FluentAvatar achieves strong performance in visual fidelity, lip synchronization, and temporal stability, attaining the best FVD on both datasets and BG-Flicker results close to ground truth. The code, the model, and the interface will be released to facilitate further research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FluentAvatar's autoregressive phoneme-guided pipeline delivers competitive FVD and near-GT flicker scores on talking-head data, but the seed-variation study leaves the causal link to diffusion flicker unproven.

read the letter

The main takeaway is that this two-stage autoregressive model with phoneme-frame causal masking and selective state-space interpolation produces talking-head videos with strong temporal stability and the best reported FVD on CMLR and HDTF. The BG-Flicker metric is a straightforward addition that isolates background changes for more reliable flicker measurement. Experiments show results close to ground truth on that metric while maintaining lip sync and visual quality, which is useful for practical avatar work.

Referee Report

1 major / 3 minor

Summary. The paper claims that inter-frame flicker in diffusion-based talking-head generation arises primarily from denoising trajectory variations due to stochastic initialization, supported by a controlled study showing low mean inter-seed Pearson correlation (r=0.15). It proposes FluentAvatar, a two-stage phoneme-guided autoregressive framework: (1) Facial Keyframe Generation using a Phoneme-Frame Causal Attention Mask to produce aligned keyframes, and (2) Inter-frame Interpolation via timestamp-aware selective state space modeling. On CMLR and HDTF datasets, it reports the best FVD scores, BG-Flicker values close to ground truth, and strong results in visual fidelity, lip synchronization, and temporal stability, with code and model to be released.

Significance. If the results hold, this represents a useful contribution by providing empirical evidence that autoregressive modeling with phoneme guidance can improve temporal stability over diffusion baselines in talking-head animation. The BG-Flicker metric is a practical addition for isolating background flicker evaluation. Explicit credit is due for the planned release of code, model, and interface, which supports reproducibility, and for grounding comparisons against diffusion baselines.

major comments (1)

[Motivation and controlled study] Controlled study (motivation section): The seed-variation experiment with r=0.15 demonstrates flicker inconsistency but does not isolate stochastic initialization from other diffusion factors such as noise schedule or U-Net biases; without such controls, the direct link to preferring autoregressive modeling over diffusion remains partially unproven, though the final quantitative results provide some external grounding.

minor comments (3)

[Method] The exact formulation of the Phoneme-Frame Causal Attention Mask and the timestamp-aware adaptive strategy in the selective state space module should include explicit equations or pseudocode for clarity and to allow verification of the claimed temporal continuity prior.
[Experiments] Hyperparameter choices, exact data splits for CMLR and HDTF, and training details are referenced but would benefit from a dedicated table or appendix subsection to facilitate reproduction of the reported FVD and BG-Flicker numbers.
[Figures] Figure captions for qualitative results should explicitly note the datasets and baselines shown to avoid ambiguity when comparing flicker patterns.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and recommendation for minor revision. We address the single major comment below.

read point-by-point responses

Referee: Controlled study (motivation section): The seed-variation experiment with r=0.15 demonstrates flicker inconsistency but does not isolate stochastic initialization from other diffusion factors such as noise schedule or U-Net biases; without such controls, the direct link to preferring autoregressive modeling over diffusion remains partially unproven, though the final quantitative results provide some external grounding.

Authors: We thank the referee for this observation. In the controlled study, we fix the input condition and vary only the random seed while keeping the noise schedule, U-Net weights, and all other diffusion hyperparameters constant. This isolates the contribution of stochastic initialization to the observed inter-seed variation in flicker patterns (mean Pearson r = 0.15). We agree that a broader set of ablations could further strengthen the motivation; accordingly, we will add a clarifying paragraph in the revised motivation section that explicitly states the controlled variables and notes that the empirical superiority of FluentAvatar over diffusion baselines on FVD and BG-Flicker provides complementary support for the autoregressive design choice. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical motivation and new architecture are externally validated

full rationale

The paper's chain begins with an empirical observation (low inter-seed correlation r=0.15 in a controlled diffusion study) used to motivate a modeling shift to autoregressive generation with phoneme guidance. It then defines a two-stage architecture (Phoneme-Frame Causal Attention Mask for keyframes + selective state-space interpolation) and evaluates it via standard metrics (FVD) plus a new BG-Flicker metric on CMLR/HDTF datasets against diffusion baselines. No step reduces a claimed prediction or first-principles result to a fitted parameter, self-definition, or self-citation chain; the central claims rest on experimental comparisons that are independent of the model's internal construction. This is the common case of a self-contained empirical proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the approach rests on standard components (causal attention, selective state space models) whose assumptions are inherited from prior literature rather than newly postulated here.

pith-pipeline@v0.9.0 · 5819 in / 1155 out tokens · 54599 ms · 2026-05-18T16:38:08.093620+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Autoregressive models generate video frames as a single and unified token sequence... P(x(t)j | x(1)1, ..., x(t-1)K, x(t)1, ..., x(t)j-1, c)
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat induction and embed_strictMono unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Phoneme-Frame Causal Attention Mask... timestamp-aware adaptive strategy built upon selective state space modeling

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
cs.LG 2026-05 unverdicted novelty 7.0

AsymTalker maintains identity consistency in long-term diffusion talking-head videos by encoding temporal references from a static image and training a student model under inference-like conditions via asymmetric dist...
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
cs.LG 2026-05 unverdicted novelty 6.0

AsymK-Talker introduces kernel-conditioned loop generation, temporal reference encoding, and asymmetric kernel distillation to achieve real-time, drift-resistant talking head synthesis from audio using diffusion models.
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
cs.LG 2026-05 unverdicted novelty 6.0

AsymTalker uses temporal reference encoding and asymmetric knowledge distillation to produce identity-consistent talking head videos up to 600 seconds long at 66 FPS.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · cited by 1 Pith paper · 9 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Videoretalking: Audio-based lip synchronization for talking head video editing in the wild

Kun Cheng, Xiaodong Cun, Yong Zhang, Menghan Xia, Fei Yin, Mingrui Zhu, Xuan Wang, Jue Wang, and Nannan Wang. Videoretalking: Audio-based lip synchronization for talking head video editing in the wild. InSIGGRAPH Asia 2022 Conference Papers, pp. 1–9,

work page 2022
[3]

Artalk: Speech- driven 3d head animation via autoregressive model.arXiv preprint arXiv:2502.20323,

Xuangeng Chu, Nabarun Goswami, Ziteng Cui, Hanqin Wang, and Tatsuya Harada. Artalk: Speech- driven 3d head animation via autoregressive model.arXiv preprint arXiv:2502.20323,

work page arXiv
[4]

Hallo2: Long-duration and high-resolution audio-driven portrait image anima- tion.arXiv preprint arXiv:2410.07718,

Jiahao Cui, Hui Li, Yao Yao, Hao Zhu, Hanlin Shang, Kaihui Cheng, Hang Zhou, Siyu Zhu, and Jingdong Wang. Hallo2: Long-duration and high-resolution audio-driven portrait image anima- tion.arXiv preprint arXiv:2410.07718,

work page arXiv
[5]

Speech-driven facial animation using cascaded gans for learning of motion and texture

Dipanjan Das, Sandika Biswas, Sanjana Sinha, and Brojeshwar Bhowmick. Speech-driven facial animation using cascaded gans for learning of motion and texture. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pp. 408–424. Springer,

work page 2020
[6]

Vimi: Grounding video generation through multi-modal instruction.arXiv preprint arXiv:2407.06304,

Yuwei Fang, Willi Menapace, Aliaksandr Siarohin, Tsai-Shien Chen, Kuan-Chien Wang, Ivan Sko- rokhodov, Graham Neubig, and Sergey Tulyakov. Vimi: Grounding video generation through multi-modal instruction.arXiv preprint arXiv:2407.06304,

work page arXiv
[7]

Break the sequential dependency of llm inference using lookahead decoding.arXiv preprint arXiv:2402.02057, 2024

Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of llm infer- ence using lookahead decoding.arXiv preprint arXiv:2402.02057,

work page arXiv
[8]

Liveportrait: Efficient portrait animation with stitching and retargeting control.arXiv preprint arXiv:2407.03168,

Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, and Di Zhang. Liveportrait: Efficient portrait animation with stitching and retargeting control.arXiv preprint arXiv:2407.03168,

work page arXiv
[9]

Zipar: Accelerating autoregressive image generation through spatial locality.arXiv preprint arXiv:2412.04062,

Yefei He, Feng Chen, Yuanyu He, Shaoxuan He, Hong Zhou, Kaipeng Zhang, and Bohan Zhuang. Zipar: Accelerating autoregressive image generation through spatial locality.arXiv preprint arXiv:2412.04062,

work page arXiv
[10]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pre- training for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Sonic: Shifting focus to global audio perception in portrait animation.arXiv preprint arXiv:2411.16331,

Xiaozhong Ji, Xiaobin Hu, Zhihong Xu, Junwei Zhu, Chuming Lin, Qingdong He, Jiangning Zhang, Donghao Luo, Yi Chen, Qin Lin, et al. Sonic: Shifting focus to global audio perception in portrait animation.arXiv preprint arXiv:2411.16331,

work page arXiv
[12]

Limitations

Jianwen Jiang, Chao Liang, Jiaqi Yang, Gaojie Lin, Tianyun Zhong, and Yanbo Zheng. Loopy: Taming audio-driven portrait avatar with long-term motion dependency.arXiv preprint arXiv:2409.02634,

work page arXiv
[13]

An image speaks a thousand words, but can everyone listen? on image transcreation for cultural relevance

Simran Khanuja, Sathyanarayanan Ramamoorthy, Yueqi Song, and Graham Neubig. An image speaks a thousand words, but can everyone listen? on image transcreation for cultural relevance. arXiv preprint arXiv:2404.01247,

work page arXiv
[14]

VideoPoet: A Large Language Model for Zero-Shot Video Generation

Dan Kondratyuk, Lijun Yu, Xiuye Gu, Jos ´e Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video generation.arXiv preprint arXiv:2312.14125,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Latentsync: Audio conditioned latent diffusion models for lip sync.arXiv preprint arXiv:2412.09262,

Chunyu Li, Chao Zhang, Weikai Xu, Jinghui Xie, Weiguo Feng, Bingyue Peng, and Weiwei Xing. Latentsync: Audio conditioned latent diffusion models for lip sync.arXiv preprint arXiv:2412.09262,

work page arXiv
[16]

Omnihuman-1: Re- thinking the scaling-up of one-stage conditioned human animation models.arXiv preprint arXiv:2502.01061,

Gaojie Lin, Jianwen Jiang, Jiaqi Yang, Zerong Zheng, and Chao Liang. Omnihuman-1: Re- thinking the scaling-up of one-stage conditioned human animation models.arXiv preprint arXiv:2502.01061,

work page arXiv
[17]

Open-magvit2: An open-source project toward democratizing auto-regressive visual generation.arXiv preprint arXiv:2409.04410,

Zhuoyan Luo, Fengyuan Shi, Yixiao Ge, Yujiu Yang, Limin Wang, and Ying Shan. Open-magvit2: An open-source project toward democratizing auto-regressive visual generation.arXiv preprint arXiv:2409.04410,

work page arXiv
[18]

Echomimicv2: Towards striking, simplified, and semi-body human animation.arXiv preprint arXiv:2411.10061,

Rang Meng, Xingyu Zhang, Yuming Li, and Chenguang Ma. Echomimicv2: Towards striking, simplified, and semi-body human animation.arXiv preprint arXiv:2411.10061,

work page arXiv
[19]

Vipe: Visualise pretty-much every- thing.arXiv preprint arXiv:2310.10543,

Hassan Shahmohammadi, Adhiraj Ghosh, and Hendrik Lensch. Vipe: Visualise pretty-much every- thing.arXiv preprint arXiv:2310.10543,

work page arXiv
[20]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[21]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

LLaMA: Open and Efficient Foundation Language Models

K Tian, Y Jiang, Z Yuan, et al. Visual autoregressive modeling: Scalable image generation via next- scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024a. Linrui Tian, Qi Wang, Bang Zhang, and Liefeng Bo. Emo: Emote portrait alive generating ex- pressive portrait videos with audio2video diffusion model under weak condit...

work page internal anchor Pith review Pith/arXiv arXiv
[23]

V-express: Conditional dropout for progressive training of portrait video generation.arXiv preprint arXiv:2406.02511, 2024a

Cong Wang, Kuan Tian, Jun Zhang, Yonghang Guan, Feng Luo, Fei Shen, Zhiwei Jiang, Qing Gu, Xiao Han, and Wei Yang. V-express: Conditional dropout for progressive training of portrait video generation.arXiv preprint arXiv:2406.02511, 2024a. Jiadong Wang, Xinyuan Qian, Malu Zhang, Robby T Tan, and Haizhou Li. Seeing what you said: Talking face generation gu...

work page arXiv
[24]

Emu3: Next-Token Prediction is All You Need

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024b. Yuchi Wang, Junliang Guo, Jianhong Bai, Runyi Yu, Tianyu He, Xu Tan, Xu Sun, and Jiang Bian. Instructavatar: Text-guided emotion and motion...

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Aniportrait: Audio-driven synthesis of photorealistic portrait animation.arXiv preprint arXiv:2403.17694,

Huawei Wei, Zejun Yang, and Zhisheng Wang. Aniportrait: Audio-driven synthesis of photorealistic portrait animation.arXiv preprint arXiv:2403.17694,

work page arXiv
[26]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Hallo: Hierarchical audio-driven visual synthesis for portrait image animation

Mingwang Xu, Hui Li, Qingkun Su, Hanlin Shang, Liwei Zhang, Ce Liu, Jingdong Wang, Yao Yao, and Siyu Zhu. Hallo: Hierarchical audio-driven visual synthesis for portrait image animation. arXiv preprint arXiv:2406.08801, 2024a. Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, and Baining Guo. Vasa-1: Lifel...

work page arXiv
[28]

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

Lijun Yu, Jos´e Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, et al. Language model beats diffusion– tokenizer is key to visual generation.arXiv preprint arXiv:2310.05737,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation

Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, Ying Shan, and Fei Wang. Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8652–8661, 2023a. Yue Zhang, Minhao Liu, Zhaokang Chen, Bi...

work page arXiv
[30]

Figure 7: Original Video Frames from the Dataset

Furthermore, to foster future research and benefit the community, we will open-source this enhanced, high-resolution version of the CMLR dataset. Figure 7: Original Video Frames from the Dataset. Figure 8: Enhanced Video Frames after Super-Resolution. A.2 TRAININGDETAILS We trained the model on a mixed dataset that combines the super-resolved CMLR dataset...

work page arXiv 1931
[31]

We then incre- mentally incorporate our other proposed loss terms: the pixel-level LPIPS perceptual loss, identity consistency loss, and facial similarity loss

In this study, we establish a baseline model trained exclusively with a token-level cross-entropy (CE) loss. We then incre- mentally incorporate our other proposed loss terms: the pixel-level LPIPS perceptual loss, identity consistency loss, and facial similarity loss. The experimental results clearly demonstrate that while each loss component individuall...

work page 2000

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Videoretalking: Audio-based lip synchronization for talking head video editing in the wild

Kun Cheng, Xiaodong Cun, Yong Zhang, Menghan Xia, Fei Yin, Mingrui Zhu, Xuan Wang, Jue Wang, and Nannan Wang. Videoretalking: Audio-based lip synchronization for talking head video editing in the wild. InSIGGRAPH Asia 2022 Conference Papers, pp. 1–9,

work page 2022

[3] [3]

Artalk: Speech- driven 3d head animation via autoregressive model.arXiv preprint arXiv:2502.20323,

Xuangeng Chu, Nabarun Goswami, Ziteng Cui, Hanqin Wang, and Tatsuya Harada. Artalk: Speech- driven 3d head animation via autoregressive model.arXiv preprint arXiv:2502.20323,

work page arXiv

[4] [4]

Hallo2: Long-duration and high-resolution audio-driven portrait image anima- tion.arXiv preprint arXiv:2410.07718,

Jiahao Cui, Hui Li, Yao Yao, Hao Zhu, Hanlin Shang, Kaihui Cheng, Hang Zhou, Siyu Zhu, and Jingdong Wang. Hallo2: Long-duration and high-resolution audio-driven portrait image anima- tion.arXiv preprint arXiv:2410.07718,

work page arXiv

[5] [5]

Speech-driven facial animation using cascaded gans for learning of motion and texture

Dipanjan Das, Sandika Biswas, Sanjana Sinha, and Brojeshwar Bhowmick. Speech-driven facial animation using cascaded gans for learning of motion and texture. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pp. 408–424. Springer,

work page 2020

[6] [6]

Vimi: Grounding video generation through multi-modal instruction.arXiv preprint arXiv:2407.06304,

Yuwei Fang, Willi Menapace, Aliaksandr Siarohin, Tsai-Shien Chen, Kuan-Chien Wang, Ivan Sko- rokhodov, Graham Neubig, and Sergey Tulyakov. Vimi: Grounding video generation through multi-modal instruction.arXiv preprint arXiv:2407.06304,

work page arXiv

[7] [7]

Break the sequential dependency of llm inference using lookahead decoding.arXiv preprint arXiv:2402.02057, 2024

Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of llm infer- ence using lookahead decoding.arXiv preprint arXiv:2402.02057,

work page arXiv

[8] [8]

Liveportrait: Efficient portrait animation with stitching and retargeting control.arXiv preprint arXiv:2407.03168,

Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, and Di Zhang. Liveportrait: Efficient portrait animation with stitching and retargeting control.arXiv preprint arXiv:2407.03168,

work page arXiv

[9] [9]

Zipar: Accelerating autoregressive image generation through spatial locality.arXiv preprint arXiv:2412.04062,

Yefei He, Feng Chen, Yuanyu He, Shaoxuan He, Hong Zhou, Kaipeng Zhang, and Bohan Zhuang. Zipar: Accelerating autoregressive image generation through spatial locality.arXiv preprint arXiv:2412.04062,

work page arXiv

[10] [10]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pre- training for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Sonic: Shifting focus to global audio perception in portrait animation.arXiv preprint arXiv:2411.16331,

Xiaozhong Ji, Xiaobin Hu, Zhihong Xu, Junwei Zhu, Chuming Lin, Qingdong He, Jiangning Zhang, Donghao Luo, Yi Chen, Qin Lin, et al. Sonic: Shifting focus to global audio perception in portrait animation.arXiv preprint arXiv:2411.16331,

work page arXiv

[12] [12]

Limitations

Jianwen Jiang, Chao Liang, Jiaqi Yang, Gaojie Lin, Tianyun Zhong, and Yanbo Zheng. Loopy: Taming audio-driven portrait avatar with long-term motion dependency.arXiv preprint arXiv:2409.02634,

work page arXiv

[13] [13]

An image speaks a thousand words, but can everyone listen? on image transcreation for cultural relevance

Simran Khanuja, Sathyanarayanan Ramamoorthy, Yueqi Song, and Graham Neubig. An image speaks a thousand words, but can everyone listen? on image transcreation for cultural relevance. arXiv preprint arXiv:2404.01247,

work page arXiv

[14] [14]

VideoPoet: A Large Language Model for Zero-Shot Video Generation

Dan Kondratyuk, Lijun Yu, Xiuye Gu, Jos ´e Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video generation.arXiv preprint arXiv:2312.14125,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Latentsync: Audio conditioned latent diffusion models for lip sync.arXiv preprint arXiv:2412.09262,

Chunyu Li, Chao Zhang, Weikai Xu, Jinghui Xie, Weiguo Feng, Bingyue Peng, and Weiwei Xing. Latentsync: Audio conditioned latent diffusion models for lip sync.arXiv preprint arXiv:2412.09262,

work page arXiv

[16] [16]

Omnihuman-1: Re- thinking the scaling-up of one-stage conditioned human animation models.arXiv preprint arXiv:2502.01061,

Gaojie Lin, Jianwen Jiang, Jiaqi Yang, Zerong Zheng, and Chao Liang. Omnihuman-1: Re- thinking the scaling-up of one-stage conditioned human animation models.arXiv preprint arXiv:2502.01061,

work page arXiv

[17] [17]

Open-magvit2: An open-source project toward democratizing auto-regressive visual generation.arXiv preprint arXiv:2409.04410,

Zhuoyan Luo, Fengyuan Shi, Yixiao Ge, Yujiu Yang, Limin Wang, and Ying Shan. Open-magvit2: An open-source project toward democratizing auto-regressive visual generation.arXiv preprint arXiv:2409.04410,

work page arXiv

[18] [18]

Echomimicv2: Towards striking, simplified, and semi-body human animation.arXiv preprint arXiv:2411.10061,

Rang Meng, Xingyu Zhang, Yuming Li, and Chenguang Ma. Echomimicv2: Towards striking, simplified, and semi-body human animation.arXiv preprint arXiv:2411.10061,

work page arXiv

[19] [19]

Vipe: Visualise pretty-much every- thing.arXiv preprint arXiv:2310.10543,

Hassan Shahmohammadi, Adhiraj Ghosh, and Hendrik Lensch. Vipe: Visualise pretty-much every- thing.arXiv preprint arXiv:2310.10543,

work page arXiv

[20] [20]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502,

work page internal anchor Pith review Pith/arXiv arXiv 2010

[21] [21]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

LLaMA: Open and Efficient Foundation Language Models

K Tian, Y Jiang, Z Yuan, et al. Visual autoregressive modeling: Scalable image generation via next- scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024a. Linrui Tian, Qi Wang, Bang Zhang, and Liefeng Bo. Emo: Emote portrait alive generating ex- pressive portrait videos with audio2video diffusion model under weak condit...

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

V-express: Conditional dropout for progressive training of portrait video generation.arXiv preprint arXiv:2406.02511, 2024a

Cong Wang, Kuan Tian, Jun Zhang, Yonghang Guan, Feng Luo, Fei Shen, Zhiwei Jiang, Qing Gu, Xiao Han, and Wei Yang. V-express: Conditional dropout for progressive training of portrait video generation.arXiv preprint arXiv:2406.02511, 2024a. Jiadong Wang, Xinyuan Qian, Malu Zhang, Robby T Tan, and Haizhou Li. Seeing what you said: Talking face generation gu...

work page arXiv

[24] [24]

Emu3: Next-Token Prediction is All You Need

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024b. Yuchi Wang, Junliang Guo, Jianhong Bai, Runyi Yu, Tianyu He, Xu Tan, Xu Sun, and Jiang Bian. Instructavatar: Text-guided emotion and motion...

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Aniportrait: Audio-driven synthesis of photorealistic portrait animation.arXiv preprint arXiv:2403.17694,

Huawei Wei, Zejun Yang, and Zhisheng Wang. Aniportrait: Audio-driven synthesis of photorealistic portrait animation.arXiv preprint arXiv:2403.17694,

work page arXiv

[26] [26]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528,

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Hallo: Hierarchical audio-driven visual synthesis for portrait image animation

Mingwang Xu, Hui Li, Qingkun Su, Hanlin Shang, Liwei Zhang, Ce Liu, Jingdong Wang, Yao Yao, and Siyu Zhu. Hallo: Hierarchical audio-driven visual synthesis for portrait image animation. arXiv preprint arXiv:2406.08801, 2024a. Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, and Baining Guo. Vasa-1: Lifel...

work page arXiv

[28] [28]

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

Lijun Yu, Jos´e Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, et al. Language model beats diffusion– tokenizer is key to visual generation.arXiv preprint arXiv:2310.05737,

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation

Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, Ying Shan, and Fei Wang. Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8652–8661, 2023a. Yue Zhang, Minhao Liu, Zhaokang Chen, Bi...

work page arXiv

[30] [30]

Figure 7: Original Video Frames from the Dataset

Furthermore, to foster future research and benefit the community, we will open-source this enhanced, high-resolution version of the CMLR dataset. Figure 7: Original Video Frames from the Dataset. Figure 8: Enhanced Video Frames after Super-Resolution. A.2 TRAININGDETAILS We trained the model on a mixed dataset that combines the super-resolved CMLR dataset...

work page arXiv 1931

[31] [31]

We then incre- mentally incorporate our other proposed loss terms: the pixel-level LPIPS perceptual loss, identity consistency loss, and facial similarity loss

In this study, we establish a baseline model trained exclusively with a token-level cross-entropy (CE) loss. We then incre- mentally incorporate our other proposed loss terms: the pixel-level LPIPS perceptual loss, identity consistency loss, and facial similarity loss. The experimental results clearly demonstrate that while each loss component individuall...

work page 2000