pith. sign in

arxiv: 2512.04677 · v5 · submitted 2025-12-04 · 💻 cs.CV

Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

Pith reviewed 2026-05-17 01:53 UTC · model grok-4.3

classification 💻 cs.CV
keywords audio-driven avatarsreal-time streamingdiffusion modelsinfinite length generationpipeline parallelismvideo synthesiscausal distillation
0
0 comments X p. Extension

The pith

Live Avatar enables real-time streaming of infinite-length audio-driven avatars using a 14-billion-parameter diffusion model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve the conflict between real-time streaming needs and the slow, drifting nature of diffusion models for avatar generation. It proposes distilling a large bidirectional model into a causal streaming version with few steps, plus strategies to maintain stability over very long sequences. On the hardware side, it uses a parallel pipeline across GPUs to boost speed and consistency. If successful, this would allow interactive, never-ending avatar videos that respond instantly to audio without quality loss or manual resets. A new benchmark is also provided to measure such long-form performance.

Core claim

Live Avatar introduces an algorithm-system co-designed framework for a 14-billion-parameter diffusion model. On the algorithm side, a two-stage pipeline distills a pretrained bidirectional model into a causal, few-step streaming one, while complementary long-horizon strategies eliminate identity drift and visual artifacts for stable autoregressive generation exceeding 10000 seconds. On the system side, Timestep-forcing Pipeline Parallelism assigns each GPU a fixed denoising timestep, turning the sequential diffusion chain into an asynchronous spatial pipeline that boosts throughput and improves temporal consistency.

What carries the argument

Timestep-forcing Pipeline Parallelism (TPP) that assigns each GPU a fixed denoising timestep to convert sequential diffusion into an asynchronous pipeline, combined with the two-stage distillation and long-horizon strategies.

Load-bearing premise

The long-horizon strategies and distillation process preserve visual quality and identity without introducing new artifacts or requiring per-sequence retraining for stable autoregressive generation exceeding 10000 seconds.

What would settle it

Generating a continuous 10000-second avatar video driven by audio and checking if identity remains consistent with no new visual artifacts appearing over time without any retraining.

Figures

Figures reproduced from arXiv: 2512.04677 by Enhong Chen, Fangtai Wu, Hailong Guo, Jiaming Liu, Lin Liu, Qijun Gan, Shifeng Zhang, Shijie Huang, Sirui Zhao, Steven Hoi, Weiqiang Wang, Yubo Huang.

Figure 1
Figure 1. Figure 1: We propose Live Avatar, a powerful real-time streaming model capable of infinitely long audio-driven avatar generation, produc [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The Live Avatar Training Framework. (a) Stage 1 Diffusion Forcing Pretraining, showing the block-wise noise setup and the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: A visual illustration of Timestep-forcing Pipeline Parallelism ( [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparisons with state-of-the-art methods. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative Comparison of Our Model, OmniAvatar, and [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of different inference settings. Horizontally, each row follows the spatial denoising order from low to high SNR; [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of the proposed Rolling-RoPE mechanism. Horizontally, each row follows the spatial denoising order from low to [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Multi-GPU Parallel Inference Timeline. This chart visualizes the computation and waiting periods for each GPU. The two distinct [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Visualization of the generated video at 10 s, 100 s, 1000 s, and 10000 s, demonstrating the model’s strong capability in long [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
read the original abstract

Audio-driven avatar interaction demands real-time, streaming, and infinite-length generation -- capabilities fundamentally at odds with the sequential denoising and long-horizon drift of current diffusion models. We present Live Avatar, an algorithm-system co-designed framework that addresses both challenges for a 14-billion-parameter diffusion model. On the algorithm side, a two-stage pipeline distills a pretrained bidirectional model into a causal, few-step streaming one, while a set of complementary long-horizon strategies eliminate identity drift and visual artifacts, enabling stable autoregressive generation exceeding 10000 seconds. On the system side, Timestep-forcing Pipeline Parallelism (TPP) assigns each GPU a fixed denoising timestep, converting the sequential diffusion chain into an asynchronous spatial pipeline that simultaneously boosts throughput and improves temporal consistency. Live Avatar achieves 45 FPS with a TTFF of 1.21\,s on 5 H800 GPUs, and to our knowledge is the first to enable practical real-time streaming of a 14B diffusion model for infinite-length avatar generation. We further introduce GenBench, a standardized long-form benchmark, to facilitate reproducible evaluation. Our project page is at https://liveavatar.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents Live Avatar, a framework for real-time streaming audio-driven avatar generation of infinite length. It distills a 14B bidirectional diffusion model into a causal few-step model using a two-stage pipeline and introduces long-horizon strategies to prevent identity drift and artifacts for autoregressive generation beyond 10000 seconds. System-wise, Timestep-forcing Pipeline Parallelism (TPP) enables parallel processing on multiple GPUs. Reported performance is 45 FPS with 1.21 s TTFF on 5 H800 GPUs, and a new benchmark GenBench is introduced.

Significance. Should the long-horizon stability and performance claims be validated through detailed experiments, this would be a significant contribution to the field of real-time avatar animation and diffusion model deployment. It tackles the challenges of sequential denoising and drift in diffusion models through co-design, potentially opening avenues for interactive applications. The benchmark introduction is a positive step for the community.

major comments (2)
  1. [§4 (Experiments and Long-horizon Evaluation)] §4 (Experiments and Long-horizon Evaluation): The central claim of stable autoregressive generation exceeding 10000 seconds relies on the long-horizon strategies eliminating identity drift. However, the provided details do not include quantitative long-horizon metrics (e.g., identity preservation scores or visual quality assessments over extended durations), which are necessary to confirm that per-step inconsistencies do not compound. This is load-bearing for the infinite-length assertion.
  2. [§3.2 (Distillation Process)] §3.2 (Distillation Process): The two-stage distillation into a causal few-step streaming model is key to enabling real-time performance. Clarify how the distillation preserves the audio-driven conditioning and visual fidelity without introducing artifacts that could affect the subsequent long-horizon rollout.
minor comments (2)
  1. [Abstract] Abstract: The abstract states 'to our knowledge is the first', which is a strong claim; ensure the related work section provides a thorough comparison to prior streaming avatar methods to support this.
  2. [Throughout] Throughout: Ensure that all figures include clear captions and that any ablation studies on the long-horizon strategies are presented with specific quantitative improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for highlighting the potential impact of our work. We address each major comment below and will strengthen the manuscript accordingly in the revision.

read point-by-point responses
  1. Referee: [§4 (Experiments and Long-horizon Evaluation)] §4 (Experiments and Long-horizon Evaluation): The central claim of stable autoregressive generation exceeding 10000 seconds relies on the long-horizon strategies eliminating identity drift. However, the provided details do not include quantitative long-horizon metrics (e.g., identity preservation scores or visual quality assessments over extended durations), which are necessary to confirm that per-step inconsistencies do not compound. This is load-bearing for the infinite-length assertion.

    Authors: We agree that quantitative long-horizon metrics are necessary to rigorously support the stability claims. The current manuscript emphasizes qualitative results and short-sequence metrics to illustrate the effectiveness of our strategies, but we acknowledge the need for extended evaluation. In the revised version, we will add Section 4.4 with quantitative metrics including face embedding similarity (e.g., ArcFace cosine similarity) and perceptual scores (LPIPS, FID) computed on sequences of increasing duration up to 10000 seconds. New plots will demonstrate that these metrics remain stable and do not show compounding degradation, directly addressing the concern about per-step inconsistencies. revision: yes

  2. Referee: [§3.2 (Distillation Process)] §3.2 (Distillation Process): The two-stage distillation into a causal few-step streaming model is key to enabling real-time performance. Clarify how the distillation preserves the audio-driven conditioning and visual fidelity without introducing artifacts that could affect the subsequent long-horizon rollout.

    Authors: We will expand Section 3.2 with additional details on the distillation pipeline. The first stage adapts the bidirectional teacher to a causal model while preserving audio conditioning via consistent cross-attention between audio features and visual latents, trained with a combination of denoising and audio-visual alignment losses. The second stage applies few-step consistency distillation augmented with perceptual and synchronization objectives to maintain fidelity. We will include ablation results showing that short-sequence performance matches the teacher model and that no artifacts are introduced that propagate in long-horizon rollouts, as confirmed by our existing long-sequence qualitative evaluations. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; claims rest on empirical engineering contributions

full rationale

The paper describes a two-stage distillation pipeline, complementary long-horizon strategies, and Timestep-forcing Pipeline Parallelism (TPP) as independent algorithmic and system-level contributions. Performance results (45 FPS, 1.21 s TTFF) and the claim of stable autoregressive generation exceeding 10000 seconds are presented as outcomes of direct measurement on hardware, supported by the new GenBench benchmark. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations that reduce the central claims to their own inputs appear in the provided text. The derivation chain is therefore self-contained against external benchmarks and measurements.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The framework assumes standard diffusion model properties and introduces no new physical entities; a small number of hyperparameters for the distillation and drift-correction strategies are implicit but not enumerated in the abstract.

pith-pipeline@v0.9.0 · 5539 in / 1067 out tokens · 32631 ms · 2026-05-17T01:53:09.269962+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Self-Forcing Distribution Matching Distillation to facilitate causal, streamable adaptation... Timestep-forcing Pipeline Parallelism (TPP) assigns each GPU a fixed denoising timestep... Rolling Sink Frame Mechanism (RSFM) dynamically recalibrates appearance using a cached reference image.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives

    cs.CV 2026-05 unverdicted novelty 7.0

    CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.

  2. Efficient Video Diffusion Models: Advancements and Challenges

    cs.CV 2026-04 unverdicted novelty 7.0

    A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.

  3. Forcing-KV: Hybrid KV Cache Compression for Efficient Autoregressive Video Diffusion Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Forcing-KV applies head-specific static and dynamic pruning to KV caches in AR video diffusion models, achieving over 29 fps, 30% memory reduction, and up to 2.82x speedup at maintained quality.

  4. LPM 1.0: Video-based Character Performance Model

    cs.CV 2026-04 unverdicted novelty 6.0

    LPM 1.0 generates infinite-length, identity-stable, real-time audio-visual conversational performances for single characters using a distilled causal diffusion transformer and a new benchmark.

  5. Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms

    eess.IV 2026-03 unverdicted novelty 6.0

    Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.

  6. Do Protective Perturbations Really Protect Portrait Privacy under Real-world Image Transformations?

    cs.CV 2026-04 conditional novelty 5.0

    Pixel-level protective perturbations for portrait privacy are ineffective against common image transformations, and a low-cost purification framework can strip them out.

  7. OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

    cs.CV 2026-04 unverdicted novelty 4.0

    OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.

  8. EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation

    cs.CV 2026-02 unverdicted novelty 4.0

    EchoTorrent combines multi-teacher distillation, adaptive CFG calibration, hybrid long-tail forcing, and VAE decoder refinement to enable few-pass autoregressive streaming video generation with improved temporal consi...

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 8 Pith papers · 14 internal anchors

  1. [1]

    Body of her: A preliminary study on end-to-end humanoid agent.arXiv preprint arXiv:2408.02879, 2024

    Tenglong Ao. Body of her: A preliminary study on end-to-end humanoid agent.arXiv preprint arXiv:2408.02879, 2024. 3

  2. [2]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024. 2, 3

  3. [3]

    Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

    Boyuan Chen, Diego Mart ´ı Mons´o, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024. 2

  4. [4]

    Midas: Multimodal interactive digital-human synthesis via real-time autoregressive video generation.arXiv preprint arXiv:2508.19320, 2025

    Ming Chen, Liyuan Cui, Wenyuan Zhang, Haoxian Zhang, Yan Zhou, Xiaohan Li, Songlin Tang, Jiwen Liu, Borui Liao, Hejia Chen, et al. Midas: Multimodal interactive digital-human synthesis via real-time autoregressive video generation.arXiv preprint arXiv:2508.19320, 2025. 3

  5. [5]

    Out of time: automated lip sync in the wild

    Joon Son Chung and Andrew Zisserman. Out of time: automated lip sync in the wild. InComputer Vision–ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part II 13, pages 251–263. Springer, 2017. 7

  6. [6]

    Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer

    Jiahao Cui, Hui Li, Yun Zhan, Hanlin Shang, Kaihui Cheng, Yuqi Ma, Shan Mu, Hang Zhou, Jingdong Wang, and Siyu Zhu. Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21086–21095, 2025. 2, 7, 9, 4

  7. [7]

    Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

    Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283, 2025. 2, 3

  8. [8]

    Rap: Real-time audio-driven portrait animation with video diffusion transformer.arXiv preprint arXiv:2508.05115, 2025

    Fangyu Du, Taiqing Li, Ziwei Zhang, Qian Qiao, Tan Yu, Dingcheng Zhen, Xu Jia, Yang Yang, Shunshun Yin, and Siyuan Liu. Rap: Real-time audio-driven portrait animation with video diffusion transformer.arXiv preprint arXiv:2508.05115, 2025. 3

  9. [9]

    Cosyvoice 2: Scalable streaming speech synthesis with large language models, 2024

    Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, Fan Yu, Huadai Liu, Zhengyan Sheng, Yue Gu, Chong Deng, Wen Wang, Shiliang Zhang, Zhijie Yan, and Jingren Zhou. Cosyvoice 2: Scalable streaming speech synthesis with large language models, 2024. 6

  10. [10]

    Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation

    Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T Freeman, and Michael Rubin- stein. Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation.arXiv preprint arXiv:1804.03619, 2018. 6

  11. [11]

    Phased dmd: Few-step distribution matching distillation via score matching within subintervals.arXiv preprint arXiv:2510.27684,

    Xiangyu Fan, Zesong Qiu, Zhuguanyu Wu, Fanzhou Wang, Zhiqian Lin, Tianxiang Ren, Dahua Lin, Ruihao Gong, and Lei Yang. Phased dmd: Few-step distribution matching distillation via score matching within subintervals.arXiv preprint arXiv:2510.27684,

  12. [12]

    One Step Diffusion via Shortcut Models

    Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models.arXiv preprint arXiv:2410.12557, 2024. 3

  13. [13]

    Omniavatar: Efficient audio-driven avatar video generation with adaptive body animation.arXiv preprint arXiv:2506.18866, 2025

    Qijun Gan, Ruizi Yang, Jianke Zhu, Shaofei Xue, and Steven Hoi. Omniavatar: Efficient audio-driven avatar video generation with adaptive body animation.arXiv preprint arXiv:2506.18866, 2025. 2, 6, 7, 9, 4

  14. [14]

    Wan-s2v: Audio-driven cinematic video generation, 2025

    Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, Dechao Meng, Jinwei Qi, Penchong Qiao, Zhen Shen, Yafei Song, Ke Sun, Linrui Tian, Guangyuan Wang, Qi Wang, Zhongjian Wang, Jiayu Xiao, Sheng Xu, Bang Zhang, Peng Zhang, Xindi Zhang, Zhe Zhang, Jingren Zhou, and Lian Zhuo. Wan-s2v: Audio-driven cinematic video generation, 2025. 2, 7, 9, 4

  15. [15]

    Wan-s2v: Audio-driven cinematic video generation.arXiv preprint arXiv:2508.18621,

    Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, Dechao Meng, Jinwei Qi, Penchong Qiao, Zhen Shen, Yafei Song, et al. Wan-s2v: Audio-driven cinematic video generation.arXiv preprint arXiv:2508.18621, 2025. 2, 6

  16. [16]

    Arig: Autoregressive interactive head generation for real-time conversations.arXiv preprint arXiv:2507.00472, 2025

    Ying Guo, Xi Liu, Cheng Zhen, Pengfei Yan, and Xiaoming Wei. Arig: Autoregressive interactive head generation for real-time conversations.arXiv preprint arXiv:2507.00472, 2025. 3

  17. [17]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024. 2, 3

  18. [18]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017. 7

  19. [19]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025. 2, 3, 5, 6, 1

  20. [20]

    Omnihuman-1.5: Instilling an active mind in avatars via cognitive simulation.arXiv preprint arXiv:2508.19209, 2025

    Jianwen Jiang, Weihong Zeng, Zerong Zheng, Jiaqi Yang, Chao Liang, Wang Liao, Han Liang, Yuan Zhang, and Mingyuan Gao. Omnihuman-1.5: Instilling an active mind in avatars via cognitive simulation.arXiv preprint arXiv:2508.19209, 2025. 2

  21. [21]

    Streamdit: Real-time streaming text-to-video generation.arXiv preprint arXiv:2507.03745, 2025

    Akio Kodaira, Tingbo Hou, Ji Hou, Masayoshi Tomizuka, and Yue Zhao. Streamdit: Real-time streaming text-to-video generation. arXiv preprint arXiv:2507.03745, 2025. 2

  22. [22]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 3

  23. [23]

    Autoregressive image generation without vector quantization

    Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization. Advances in Neural Information Processing Systems, 37:56424–56445, 2024. 3

  24. [24]

    Ditto: Motion-space diffusion for controllable realtime talking head synthesis.arXiv preprint arXiv:2411.19509, 2024

    Tianqi Li, Ruobing Zheng, Minghui Yang, Jingdong Chen, and Ming Yang. Ditto: Motion-space diffusion for controllable realtime talking head synthesis.arXiv preprint arXiv:2411.19509, 2024. 2, 3, 7, 9, 4

  25. [25]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022. 3

  26. [26]

    Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

    Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time. arXiv preprint arXiv:2509.25161, 2025. 2

  27. [27]

    Talkingmachines: Real-time audio-driven facetime-style video via autoregressive diffusion models

    Chetwin Low and Weimin Wang. Talkingmachines: Real-time audio-driven facetime-style video via autoregressive diffusion models. arXiv preprint arXiv:2506.03099, 2025. 3, 5, 6

  28. [28]

    Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models

    Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models.arXiv preprint arXiv:2410.11081, 2024. 3

  29. [29]

    Diff-instruct++: Training one-step text-to-image generator model to align with human preferences.arXiv preprint arXiv:2410.18881, 2024

    Weijian Luo. Diff-instruct++: Training one-step text-to-image generator model to align with human preferences.arXiv preprint arXiv:2410.18881, 2024. 3, 7

  30. [30]

    Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models.Advances in Neural Information Processing Systems, 36:76525–76546,

    Weijian Luo, Tianyang Hu, Shifeng Zhang, Jiacheng Sun, Zhenguo Li, and Zhihua Zhang. Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models.Advances in Neural Information Processing Systems, 36:76525–76546,

  31. [31]

    Learning few-step diffusion models by trajectory distribution matching.arXiv preprint arXiv:2503.06674, 2025

    Yihong Luo, Tianyang Hu, Jiacheng Sun, Yujun Cai, and Jing Tang. Learning few-step diffusion models by trajectory distribution matching.arXiv preprint arXiv:2503.06674, 2025. 3, 7

  32. [32]

    Mirrorme: Towards realtime and high fidelity audio-driven halfbody animation.arXiv preprint arXiv:2506.22065, 2025

    Dechao Meng, Steven Xiao, Xindi Zhang, Guangyuan Wang, Peng Zhang, Qi Wang, Bang Zhang, and Liefeng Bo. Mirrorme: Towards realtime and high fidelity audio-driven halfbody animation.arXiv preprint arXiv:2506.22065, 2025. 3

  33. [33]

    Echomimicv2: Towards striking, simplified, and semi-body human animation, 2025

    Rang Meng, Xingyu Zhang, Yuming Li, and Chenguang Ma. Echomimicv2: Towards striking, simplified, and semi-body human animation, 2025. 7, 9, 4

  34. [34]

    A lip sync expert is all you need for speech to lip generation in the wild

    KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. A lip sync expert is all you need for speech to lip generation in the wild. InProceedings of the 28th ACM international conference on multimedia, pages 484–492, 2020. 3

  35. [35]

    Nabyl Quignon, Baptiste Chopin, Yaohui Wang, and Antitza Dantcheva. Theval. evaluation framework for talking head video generation, 2025. 8

  36. [36]

    Consistency models

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. 2023. 3

  37. [37]

    Emo: Emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions

    Linrui Tian, Qi Wang, Bang Zhang, and Liefeng Bo. Emo: Emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions. InEuropean Conference on Computer Vision, pages 244–260. Springer, 2024. 3

  38. [38]

    Sta- bleavatar: Infinite-length audio-driven avatar video generation.arXiv preprint arXiv:2508.08248, 2025

    Shuyuan Tu, Yueming Pan, Yinming Huang, Xintong Han, Zhen Xing, Qi Dai, Chong Luo, Zuxuan Wu, and Yu-Gang Jiang. Sta- bleavatar: Infinite-length audio-driven avatar video generation.arXiv preprint arXiv:2508.08248, 2025. 2, 6, 7, 9, 4

  39. [39]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018. 7

  40. [40]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 2, 3

  41. [41]

    Fantasytalking: Realistic talking portrait generation via coherent motion synthesis

    Mengchao Wang, Qiang Wang, Fan Jiang, Yaqi Fan, Yunpeng Zhang, Yonggang Qi, Kun Zhao, and Mu Xu. Fantasytalking: Realistic talking portrait generation via coherent motion synthesis. InProceedings of the 33rd ACM International Conference on Multimedia, pages 9891–9900, 2025. 8

  42. [42]

    Omnitalker: Real-time text- driven talking head generation with in-context audio-visual style replication.arXiv e-prints, pages arXiv–2504, 2025

    Zhongjian Wang, Peng Zhang, Jinwei Qi, Guangyuan Wang Sheng Xu, Bang Zhang, and Liefeng Bo. Omnitalker: Real-time text- driven talking head generation with in-context audio-visual style replication.arXiv e-prints, pages arXiv–2504, 2025. 3

  43. [43]

    Qwen-image technical report, 2025

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...

  44. [44]

    Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels

    Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Yixuan Gao, Annan Wang, Erli Zhang, Wenxiu Sun, et al. Q-align: Teaching lmms for visual scoring via discrete text-defined levels.arXiv preprint arXiv:2312.17090, 2023. 7

  45. [45]

    X-streamer: Unified human world modeling with audiovisual interaction.arXiv preprint arXiv:2509.21574, 2025

    You Xie, Tianpei Gu, Zenan Li, Chenxu Zhang, Guoxian Song, Xiaochen Zhao, Chao Liang, Jianwen Jiang, Hongyi Xu, and Linjie Luo. X-streamer: Unified human world modeling with audiovisual interaction.arXiv preprint arXiv:2509.21574, 2025. 3, 4

  46. [46]

    LongLive: Real-time Interactive Long Video Generation

    Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025. 2, 3, 5, 1

  47. [47]

    Infinitetalk: Audio-driven video generation for sparse-frame video dubbing.arXiv preprint arXiv:2508.14033, 2025

    Shaoshu Yang, Zhe Kong, Feng Gao, Meng Cheng, Xiangyu Liu, Yong Zhang, Zhuoliang Kang, Wenhan Luo, Xunliang Cai, Ran He, et al. Infinitetalk: Audio-driven video generation for sparse-frame video dubbing.arXiv preprint arXiv:2508.14033, 2025. 2

  48. [48]

    Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024

    Tianwei Yin, Micha ¨el Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and Bill Freeman. Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024. 3, 4

  49. [49]

    One-step diffu- sion with distribution matching distillation

    Tianwei Yin, Micha ¨el Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffu- sion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024. 3

  50. [50]

    From slow bidi- rectional to fast autoregressive video diffusion models

    Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidi- rectional to fast autoregressive video diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22963–22974, 2025. 2, 3, 4, 5, 1

  51. [51]

    Llia–enabling low-latency interactive avatars: Real-time audio-driven portrait video generation with diffusion models.arXiv preprint arXiv:2506.05806, 2025

    Haojie Yu, Zhaonian Wang, Yihan Pan, Meng Cheng, Hao Yang, Chao Wang, Tao Xie, Xiaoming Xu, Xiaoming Wei, and Xunliang Cai. Llia–enabling low-latency interactive avatars: Real-time audio-driven portrait video generation with diffusion models.arXiv preprint arXiv:2506.05806, 2025. 3

  52. [52]

    Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation

    Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, Ying Shan, and Fei Wang. Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8652–8661, 2023. 3

  53. [53]

    Teller: Real-time streaming audio-driven portrait animation with autoregressive motion generation

    Dingcheng Zhen, Shunshun Yin, Shiyang Qin, Hou Yi, Ziwei Zhang, Siyuan Liu, Gan Qi, and Ming Tao. Teller: Real-time streaming audio-driven portrait animation with autoregressive motion generation. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 21075–21085, 2025. 3

  54. [54]

    Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency

    Kaiwen Zheng, Yuji Wang, Qianli Ma, Huayu Chen, Jintao Zhang, Yogesh Balaji, Jianfei Chen, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Large scale diffusion distillation via score-regularized continuous-time consistency.arXiv preprint arXiv:2510.08431, 2025. 3

  55. [55]

    Infp: Audio-driven interactive head generation in dyadic conversations

    Yongming Zhu, Longhao Zhang, Zhengkun Rong, Tianshu Hu, Shuang Liang, and Zhipeng Ge. Infp: Audio-driven interactive head generation in dyadic conversations. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10667–10677,