Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

arxiv: 2512.04677 · v5 · submitted 2025-12-04 · 💻 cs.CV

Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

Yubo Huang , Hailong Guo , Fangtai Wu , Weiqiang Wang , Shifeng Zhang , Shijie Huang , Qijun Gan , Lin Liu

show 4 more authors

Sirui Zhao Enhong Chen Jiaming Liu Steven Hoi

This is my paper

Pith reviewed 2026-05-17 01:53 UTC · model grok-4.3

classification 💻 cs.CV

keywords audio-driven avatarsreal-time streamingdiffusion modelsinfinite length generationpipeline parallelismvideo synthesiscausal distillation

0 comments p. Extension

The pith

Live Avatar enables real-time streaming of infinite-length audio-driven avatars using a 14-billion-parameter diffusion model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve the conflict between real-time streaming needs and the slow, drifting nature of diffusion models for avatar generation. It proposes distilling a large bidirectional model into a causal streaming version with few steps, plus strategies to maintain stability over very long sequences. On the hardware side, it uses a parallel pipeline across GPUs to boost speed and consistency. If successful, this would allow interactive, never-ending avatar videos that respond instantly to audio without quality loss or manual resets. A new benchmark is also provided to measure such long-form performance.

Core claim

Live Avatar introduces an algorithm-system co-designed framework for a 14-billion-parameter diffusion model. On the algorithm side, a two-stage pipeline distills a pretrained bidirectional model into a causal, few-step streaming one, while complementary long-horizon strategies eliminate identity drift and visual artifacts for stable autoregressive generation exceeding 10000 seconds. On the system side, Timestep-forcing Pipeline Parallelism assigns each GPU a fixed denoising timestep, turning the sequential diffusion chain into an asynchronous spatial pipeline that boosts throughput and improves temporal consistency.

What carries the argument

Timestep-forcing Pipeline Parallelism (TPP) that assigns each GPU a fixed denoising timestep to convert sequential diffusion into an asynchronous pipeline, combined with the two-stage distillation and long-horizon strategies.

Load-bearing premise

The long-horizon strategies and distillation process preserve visual quality and identity without introducing new artifacts or requiring per-sequence retraining for stable autoregressive generation exceeding 10000 seconds.

What would settle it

Generating a continuous 10000-second avatar video driven by audio and checking if identity remains consistent with no new visual artifacts appearing over time without any retraining.

Figures

Figures reproduced from arXiv: 2512.04677 by Enhong Chen, Fangtai Wu, Hailong Guo, Jiaming Liu, Lin Liu, Qijun Gan, Shifeng Zhang, Shijie Huang, Sirui Zhao, Steven Hoi, Weiqiang Wang, Yubo Huang.

**Figure 1.** Figure 1: We propose Live Avatar, a powerful real-time streaming model capable of infinitely long audio-driven avatar generation, produc [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: The Live Avatar Training Framework. (a) Stage 1 Diffusion Forcing Pretraining, showing the block-wise noise setup and the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: A visual illustration of Timestep-forcing Pipeline Parallelism ( [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparisons with state-of-the-art methods. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative Comparison of Our Model, OmniAvatar, and [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Illustration of different inference settings. Horizontally, each row follows the spatial denoising order from low to high SNR; [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization of the proposed Rolling-RoPE mechanism. Horizontally, each row follows the spatial denoising order from low to [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Multi-GPU Parallel Inference Timeline. This chart visualizes the computation and waiting periods for each GPU. The two distinct [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Visualization of the generated video at 10 s, 100 s, 1000 s, and 10000 s, demonstrating the model’s strong capability in long [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

read the original abstract

Audio-driven avatar interaction demands real-time, streaming, and infinite-length generation -- capabilities fundamentally at odds with the sequential denoising and long-horizon drift of current diffusion models. We present Live Avatar, an algorithm-system co-designed framework that addresses both challenges for a 14-billion-parameter diffusion model. On the algorithm side, a two-stage pipeline distills a pretrained bidirectional model into a causal, few-step streaming one, while a set of complementary long-horizon strategies eliminate identity drift and visual artifacts, enabling stable autoregressive generation exceeding 10000 seconds. On the system side, Timestep-forcing Pipeline Parallelism (TPP) assigns each GPU a fixed denoising timestep, converting the sequential diffusion chain into an asynchronous spatial pipeline that simultaneously boosts throughput and improves temporal consistency. Live Avatar achieves 45 FPS with a TTFF of 1.21\,s on 5 H800 GPUs, and to our knowledge is the first to enable practical real-time streaming of a 14B diffusion model for infinite-length avatar generation. We further introduce GenBench, a standardized long-form benchmark, to facilitate reproducible evaluation. Our project page is at https://liveavatar.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gets a 14B diffusion model streaming real-time avatars at 45 FPS with low latency and claims stable infinite-length output, but the long-horizon drift fix is the part that still needs solid evidence.

read the letter

The main takeaway is that Live Avatar combines two-stage distillation to turn a bidirectional diffusion model into a causal few-step streamer with Timestep-forcing Pipeline Parallelism that spreads denoising timesteps across GPUs. This setup reportedly hits 45 FPS and 1.21 s TTFF on five H800s while supporting autoregressive runs past 10,000 seconds. They also release GenBench for long-form testing. That engineering co-design is the concrete advance over prior audio-driven avatar work, and the performance numbers are specific enough to check.

Referee Report

2 major / 2 minor

Summary. The manuscript presents Live Avatar, a framework for real-time streaming audio-driven avatar generation of infinite length. It distills a 14B bidirectional diffusion model into a causal few-step model using a two-stage pipeline and introduces long-horizon strategies to prevent identity drift and artifacts for autoregressive generation beyond 10000 seconds. System-wise, Timestep-forcing Pipeline Parallelism (TPP) enables parallel processing on multiple GPUs. Reported performance is 45 FPS with 1.21 s TTFF on 5 H800 GPUs, and a new benchmark GenBench is introduced.

Significance. Should the long-horizon stability and performance claims be validated through detailed experiments, this would be a significant contribution to the field of real-time avatar animation and diffusion model deployment. It tackles the challenges of sequential denoising and drift in diffusion models through co-design, potentially opening avenues for interactive applications. The benchmark introduction is a positive step for the community.

major comments (2)

[§4 (Experiments and Long-horizon Evaluation)] §4 (Experiments and Long-horizon Evaluation): The central claim of stable autoregressive generation exceeding 10000 seconds relies on the long-horizon strategies eliminating identity drift. However, the provided details do not include quantitative long-horizon metrics (e.g., identity preservation scores or visual quality assessments over extended durations), which are necessary to confirm that per-step inconsistencies do not compound. This is load-bearing for the infinite-length assertion.
[§3.2 (Distillation Process)] §3.2 (Distillation Process): The two-stage distillation into a causal few-step streaming model is key to enabling real-time performance. Clarify how the distillation preserves the audio-driven conditioning and visual fidelity without introducing artifacts that could affect the subsequent long-horizon rollout.

minor comments (2)

[Abstract] Abstract: The abstract states 'to our knowledge is the first', which is a strong claim; ensure the related work section provides a thorough comparison to prior streaming avatar methods to support this.
[Throughout] Throughout: Ensure that all figures include clear captions and that any ablation studies on the long-horizon strategies are presented with specific quantitative improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for highlighting the potential impact of our work. We address each major comment below and will strengthen the manuscript accordingly in the revision.

read point-by-point responses

Referee: [§4 (Experiments and Long-horizon Evaluation)] §4 (Experiments and Long-horizon Evaluation): The central claim of stable autoregressive generation exceeding 10000 seconds relies on the long-horizon strategies eliminating identity drift. However, the provided details do not include quantitative long-horizon metrics (e.g., identity preservation scores or visual quality assessments over extended durations), which are necessary to confirm that per-step inconsistencies do not compound. This is load-bearing for the infinite-length assertion.

Authors: We agree that quantitative long-horizon metrics are necessary to rigorously support the stability claims. The current manuscript emphasizes qualitative results and short-sequence metrics to illustrate the effectiveness of our strategies, but we acknowledge the need for extended evaluation. In the revised version, we will add Section 4.4 with quantitative metrics including face embedding similarity (e.g., ArcFace cosine similarity) and perceptual scores (LPIPS, FID) computed on sequences of increasing duration up to 10000 seconds. New plots will demonstrate that these metrics remain stable and do not show compounding degradation, directly addressing the concern about per-step inconsistencies. revision: yes
Referee: [§3.2 (Distillation Process)] §3.2 (Distillation Process): The two-stage distillation into a causal few-step streaming model is key to enabling real-time performance. Clarify how the distillation preserves the audio-driven conditioning and visual fidelity without introducing artifacts that could affect the subsequent long-horizon rollout.

Authors: We will expand Section 3.2 with additional details on the distillation pipeline. The first stage adapts the bidirectional teacher to a causal model while preserving audio conditioning via consistent cross-attention between audio features and visual latents, trained with a combination of denoising and audio-visual alignment losses. The second stage applies few-step consistency distillation augmented with perceptual and synchronization objectives to maintain fidelity. We will include ablation results showing that short-sequence performance matches the teacher model and that no artifacts are introduced that propagate in long-horizon rollouts, as confirmed by our existing long-sequence qualitative evaluations. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; claims rest on empirical engineering contributions

full rationale

The paper describes a two-stage distillation pipeline, complementary long-horizon strategies, and Timestep-forcing Pipeline Parallelism (TPP) as independent algorithmic and system-level contributions. Performance results (45 FPS, 1.21 s TTFF) and the claim of stable autoregressive generation exceeding 10000 seconds are presented as outcomes of direct measurement on hardware, supported by the new GenBench benchmark. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations that reduce the central claims to their own inputs appear in the provided text. The derivation chain is therefore self-contained against external benchmarks and measurements.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The framework assumes standard diffusion model properties and introduces no new physical entities; a small number of hyperparameters for the distillation and drift-correction strategies are implicit but not enumerated in the abstract.

pith-pipeline@v0.9.0 · 5539 in / 1067 out tokens · 32631 ms · 2026-05-17T01:53:09.269962+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Self-Forcing Distribution Matching Distillation to facilitate causal, streamable adaptation... Timestep-forcing Pipeline Parallelism (TPP) assigns each GPU a fixed denoising timestep... Rolling Sink Frame Mechanism (RSFM) dynamically recalibrates appearance using a cached reference image.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives
cs.CV 2026-05 unverdicted novelty 7.0

CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.
Efficient Video Diffusion Models: Advancements and Challenges
cs.CV 2026-04 unverdicted novelty 7.0

A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
Forcing-KV: Hybrid KV Cache Compression for Efficient Autoregressive Video Diffusion Models
cs.CV 2026-05 unverdicted novelty 6.0

Forcing-KV applies head-specific static and dynamic pruning to KV caches in AR video diffusion models, achieving over 29 fps, 30% memory reduction, and up to 2.82x speedup at maintained quality.
LPM 1.0: Video-based Character Performance Model
cs.CV 2026-04 unverdicted novelty 6.0

LPM 1.0 generates infinite-length, identity-stable, real-time audio-visual conversational performances for single characters using a distilled causal diffusion transformer and a new benchmark.
Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms
eess.IV 2026-03 unverdicted novelty 6.0

Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.
Do Protective Perturbations Really Protect Portrait Privacy under Real-world Image Transformations?
cs.CV 2026-04 conditional novelty 5.0

Pixel-level protective perturbations for portrait privacy are ineffective against common image transformations, and a low-cost purification framework can strip them out.
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
cs.CV 2026-04 unverdicted novelty 4.0

OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.
EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation
cs.CV 2026-02 unverdicted novelty 4.0

EchoTorrent combines multi-teacher distillation, adaptive CFG calibration, hybrid long-tail forcing, and VAE decoder refinement to enable few-pass autoregressive streaming video generation with improved temporal consi...

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 8 Pith papers · 14 internal anchors

[1]

Body of her: A preliminary study on end-to-end humanoid agent.arXiv preprint arXiv:2408.02879, 2024

Tenglong Ao. Body of her: A preliminary study on end-to-end humanoid agent.arXiv preprint arXiv:2408.02879, 2024. 3

work page arXiv 2024
[2]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024. 2, 3

work page 2024
[3]

Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

Boyuan Chen, Diego Mart ´ı Mons´o, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024. 2

work page 2024
[4]

Midas: Multimodal interactive digital-human synthesis via real-time autoregressive video generation.arXiv preprint arXiv:2508.19320, 2025

Ming Chen, Liyuan Cui, Wenyuan Zhang, Haoxian Zhang, Yan Zhou, Xiaohan Li, Songlin Tang, Jiwen Liu, Borui Liao, Hejia Chen, et al. Midas: Multimodal interactive digital-human synthesis via real-time autoregressive video generation.arXiv preprint arXiv:2508.19320, 2025. 3

work page arXiv 2025
[5]

Out of time: automated lip sync in the wild

Joon Son Chung and Andrew Zisserman. Out of time: automated lip sync in the wild. InComputer Vision–ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part II 13, pages 251–263. Springer, 2017. 7

work page 2016
[6]

Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer

Jiahao Cui, Hui Li, Yun Zhan, Hanlin Shang, Kaihui Cheng, Yuqi Ma, Shan Mu, Hang Zhou, Jingdong Wang, and Siyu Zhu. Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21086–21095, 2025. 2, 7, 9, 4

work page 2025
[7]

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283, 2025. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Rap: Real-time audio-driven portrait animation with video diffusion transformer.arXiv preprint arXiv:2508.05115, 2025

Fangyu Du, Taiqing Li, Ziwei Zhang, Qian Qiao, Tan Yu, Dingcheng Zhen, Xu Jia, Yang Yang, Shunshun Yin, and Siyuan Liu. Rap: Real-time audio-driven portrait animation with video diffusion transformer.arXiv preprint arXiv:2508.05115, 2025. 3

work page arXiv 2025
[9]

Cosyvoice 2: Scalable streaming speech synthesis with large language models, 2024

Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, Fan Yu, Huadai Liu, Zhengyan Sheng, Yue Gu, Chong Deng, Wen Wang, Shiliang Zhang, Zhijie Yan, and Jingren Zhou. Cosyvoice 2: Scalable streaming speech synthesis with large language models, 2024. 6

work page 2024
[10]

Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation

Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T Freeman, and Michael Rubin- stein. Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation.arXiv preprint arXiv:1804.03619, 2018. 6

work page internal anchor Pith review Pith/arXiv arXiv 2018
[11]

Phased dmd: Few-step distribution matching distillation via score matching within subintervals.arXiv preprint arXiv:2510.27684,

Xiangyu Fan, Zesong Qiu, Zhuguanyu Wu, Fanzhou Wang, Zhiqian Lin, Tianxiang Ren, Dahua Lin, Ruihao Gong, and Lei Yang. Phased dmd: Few-step distribution matching distillation via score matching within subintervals.arXiv preprint arXiv:2510.27684,

work page arXiv
[12]

One Step Diffusion via Shortcut Models

Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models.arXiv preprint arXiv:2410.12557, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Omniavatar: Efficient audio-driven avatar video generation with adaptive body animation.arXiv preprint arXiv:2506.18866, 2025

Qijun Gan, Ruizi Yang, Jianke Zhu, Shaofei Xue, and Steven Hoi. Omniavatar: Efficient audio-driven avatar video generation with adaptive body animation.arXiv preprint arXiv:2506.18866, 2025. 2, 6, 7, 9, 4

work page arXiv 2025
[14]

Wan-s2v: Audio-driven cinematic video generation, 2025

Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, Dechao Meng, Jinwei Qi, Penchong Qiao, Zhen Shen, Yafei Song, Ke Sun, Linrui Tian, Guangyuan Wang, Qi Wang, Zhongjian Wang, Jiayu Xiao, Sheng Xu, Bang Zhang, Peng Zhang, Xindi Zhang, Zhe Zhang, Jingren Zhou, and Lian Zhuo. Wan-s2v: Audio-driven cinematic video generation, 2025. 2, 7, 9, 4

work page 2025
[15]

Wan-s2v: Audio-driven cinematic video generation.arXiv preprint arXiv:2508.18621,

Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, Dechao Meng, Jinwei Qi, Penchong Qiao, Zhen Shen, Yafei Song, et al. Wan-s2v: Audio-driven cinematic video generation.arXiv preprint arXiv:2508.18621, 2025. 2, 6

work page arXiv 2025
[16]

Arig: Autoregressive interactive head generation for real-time conversations.arXiv preprint arXiv:2507.00472, 2025

Ying Guo, Xi Liu, Cheng Zhen, Pengfei Yan, and Xiaoming Wei. Arig: Autoregressive interactive head generation for real-time conversations.arXiv preprint arXiv:2507.00472, 2025. 3

work page arXiv 2025
[17]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017. 7

work page 2017
[19]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025. 2, 3, 5, 6, 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Omnihuman-1.5: Instilling an active mind in avatars via cognitive simulation.arXiv preprint arXiv:2508.19209, 2025

Jianwen Jiang, Weihong Zeng, Zerong Zheng, Jiaqi Yang, Chao Liang, Wang Liao, Han Liang, Yuan Zhang, and Mingyuan Gao. Omnihuman-1.5: Instilling an active mind in avatars via cognitive simulation.arXiv preprint arXiv:2508.19209, 2025. 2

work page arXiv 2025
[21]

Streamdit: Real-time streaming text-to-video generation.arXiv preprint arXiv:2507.03745, 2025

Akio Kodaira, Tingbo Hou, Ji Hou, Masayoshi Tomizuka, and Yue Zhao. Streamdit: Real-time streaming text-to-video generation. arXiv preprint arXiv:2507.03745, 2025. 2

work page arXiv 2025
[22]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Autoregressive image generation without vector quantization

Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization. Advances in Neural Information Processing Systems, 37:56424–56445, 2024. 3

work page 2024
[24]

Ditto: Motion-space diffusion for controllable realtime talking head synthesis.arXiv preprint arXiv:2411.19509, 2024

Tianqi Li, Ruobing Zheng, Minghui Yang, Jingdong Chen, and Ming Yang. Ditto: Motion-space diffusion for controllable realtime talking head synthesis.arXiv preprint arXiv:2411.19509, 2024. 2, 3, 7, 9, 4

work page arXiv 2024
[25]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[26]

Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time. arXiv preprint arXiv:2509.25161, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Talkingmachines: Real-time audio-driven facetime-style video via autoregressive diffusion models

Chetwin Low and Weimin Wang. Talkingmachines: Real-time audio-driven facetime-style video via autoregressive diffusion models. arXiv preprint arXiv:2506.03099, 2025. 3, 5, 6

work page arXiv 2025
[28]

Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models

Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models.arXiv preprint arXiv:2410.11081, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Diff-instruct++: Training one-step text-to-image generator model to align with human preferences.arXiv preprint arXiv:2410.18881, 2024

Weijian Luo. Diff-instruct++: Training one-step text-to-image generator model to align with human preferences.arXiv preprint arXiv:2410.18881, 2024. 3, 7

work page arXiv 2024
[30]

Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models.Advances in Neural Information Processing Systems, 36:76525–76546,

Weijian Luo, Tianyang Hu, Shifeng Zhang, Jiacheng Sun, Zhenguo Li, and Zhihua Zhang. Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models.Advances in Neural Information Processing Systems, 36:76525–76546,

work page
[31]

Learning few-step diffusion models by trajectory distribution matching.arXiv preprint arXiv:2503.06674, 2025

Yihong Luo, Tianyang Hu, Jiacheng Sun, Yujun Cai, and Jing Tang. Learning few-step diffusion models by trajectory distribution matching.arXiv preprint arXiv:2503.06674, 2025. 3, 7

work page arXiv 2025
[32]

Mirrorme: Towards realtime and high fidelity audio-driven halfbody animation.arXiv preprint arXiv:2506.22065, 2025

Dechao Meng, Steven Xiao, Xindi Zhang, Guangyuan Wang, Peng Zhang, Qi Wang, Bang Zhang, and Liefeng Bo. Mirrorme: Towards realtime and high fidelity audio-driven halfbody animation.arXiv preprint arXiv:2506.22065, 2025. 3

work page arXiv 2025
[33]

Echomimicv2: Towards striking, simplified, and semi-body human animation, 2025

Rang Meng, Xingyu Zhang, Yuming Li, and Chenguang Ma. Echomimicv2: Towards striking, simplified, and semi-body human animation, 2025. 7, 9, 4

work page 2025
[34]

A lip sync expert is all you need for speech to lip generation in the wild

KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. A lip sync expert is all you need for speech to lip generation in the wild. InProceedings of the 28th ACM international conference on multimedia, pages 484–492, 2020. 3

work page 2020
[35]

Nabyl Quignon, Baptiste Chopin, Yaohui Wang, and Antitza Dantcheva. Theval. evaluation framework for talking head video generation, 2025. 8

work page 2025
[36]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. 2023. 3

work page 2023
[37]

Emo: Emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions

Linrui Tian, Qi Wang, Bang Zhang, and Liefeng Bo. Emo: Emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions. InEuropean Conference on Computer Vision, pages 244–260. Springer, 2024. 3

work page 2024
[38]

Sta- bleavatar: Infinite-length audio-driven avatar video generation.arXiv preprint arXiv:2508.08248, 2025

Shuyuan Tu, Yueming Pan, Yinming Huang, Xintong Han, Zhen Xing, Qi Dai, Chong Luo, Zuxuan Wu, and Yu-Gang Jiang. Sta- bleavatar: Infinite-length audio-driven avatar video generation.arXiv preprint arXiv:2508.08248, 2025. 2, 6, 7, 9, 4

work page arXiv 2025
[39]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018. 7

work page internal anchor Pith review Pith/arXiv arXiv 2018
[40]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Fantasytalking: Realistic talking portrait generation via coherent motion synthesis

Mengchao Wang, Qiang Wang, Fan Jiang, Yaqi Fan, Yunpeng Zhang, Yonggang Qi, Kun Zhao, and Mu Xu. Fantasytalking: Realistic talking portrait generation via coherent motion synthesis. InProceedings of the 33rd ACM International Conference on Multimedia, pages 9891–9900, 2025. 8

work page 2025
[42]

Omnitalker: Real-time text- driven talking head generation with in-context audio-visual style replication.arXiv e-prints, pages arXiv–2504, 2025

Zhongjian Wang, Peng Zhang, Jinwei Qi, Guangyuan Wang Sheng Xu, Bang Zhang, and Liefeng Bo. Omnitalker: Real-time text- driven talking head generation with in-context audio-visual style replication.arXiv e-prints, pages arXiv–2504, 2025. 3

work page 2025
[43]

Qwen-image technical report, 2025

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...

work page 2025
[44]

Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels

Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Yixuan Gao, Annan Wang, Erli Zhang, Wenxiu Sun, et al. Q-align: Teaching lmms for visual scoring via discrete text-defined levels.arXiv preprint arXiv:2312.17090, 2023. 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

X-streamer: Unified human world modeling with audiovisual interaction.arXiv preprint arXiv:2509.21574, 2025

You Xie, Tianpei Gu, Zenan Li, Chenxu Zhang, Guoxian Song, Xiaochen Zhao, Chao Liang, Jianwen Jiang, Hongyi Xu, and Linjie Luo. X-streamer: Unified human world modeling with audiovisual interaction.arXiv preprint arXiv:2509.21574, 2025. 3, 4

work page arXiv 2025
[46]

LongLive: Real-time Interactive Long Video Generation

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025. 2, 3, 5, 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Infinitetalk: Audio-driven video generation for sparse-frame video dubbing.arXiv preprint arXiv:2508.14033, 2025

Shaoshu Yang, Zhe Kong, Feng Gao, Meng Cheng, Xiangyu Liu, Yong Zhang, Zhuoliang Kang, Wenhan Luo, Xunliang Cai, Ran He, et al. Infinitetalk: Audio-driven video generation for sparse-frame video dubbing.arXiv preprint arXiv:2508.14033, 2025. 2

work page arXiv 2025
[48]

Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024

Tianwei Yin, Micha ¨el Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and Bill Freeman. Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024. 3, 4

work page 2024
[49]

One-step diffu- sion with distribution matching distillation

Tianwei Yin, Micha ¨el Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffu- sion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024. 3

work page 2024
[50]

From slow bidi- rectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidi- rectional to fast autoregressive video diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22963–22974, 2025. 2, 3, 4, 5, 1

work page 2025
[51]

Llia–enabling low-latency interactive avatars: Real-time audio-driven portrait video generation with diffusion models.arXiv preprint arXiv:2506.05806, 2025

Haojie Yu, Zhaonian Wang, Yihan Pan, Meng Cheng, Hao Yang, Chao Wang, Tao Xie, Xiaoming Xu, Xiaoming Wei, and Xunliang Cai. Llia–enabling low-latency interactive avatars: Real-time audio-driven portrait video generation with diffusion models.arXiv preprint arXiv:2506.05806, 2025. 3

work page arXiv 2025
[52]

Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation

Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, Ying Shan, and Fei Wang. Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8652–8661, 2023. 3

work page 2023
[53]

Teller: Real-time streaming audio-driven portrait animation with autoregressive motion generation

Dingcheng Zhen, Shunshun Yin, Shiyang Qin, Hou Yi, Ziwei Zhang, Siyuan Liu, Gan Qi, and Ming Tao. Teller: Real-time streaming audio-driven portrait animation with autoregressive motion generation. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 21075–21085, 2025. 3

work page 2025
[54]

Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency

Kaiwen Zheng, Yuji Wang, Qianli Ma, Huayu Chen, Jintao Zhang, Yogesh Balaji, Jianfei Chen, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Large scale diffusion distillation via score-regularized continuous-time consistency.arXiv preprint arXiv:2510.08431, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Infp: Audio-driven interactive head generation in dyadic conversations

Yongming Zhu, Longhao Zhang, Zhengkun Rong, Tianshu Hu, Shuang Liang, and Zhipeng Ge. Infp: Audio-driven interactive head generation in dyadic conversations. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10667–10677,

work page

[1] [1]

Body of her: A preliminary study on end-to-end humanoid agent.arXiv preprint arXiv:2408.02879, 2024

Tenglong Ao. Body of her: A preliminary study on end-to-end humanoid agent.arXiv preprint arXiv:2408.02879, 2024. 3

work page arXiv 2024

[2] [2]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024. 2, 3

work page 2024

[3] [3]

Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

Boyuan Chen, Diego Mart ´ı Mons´o, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024. 2

work page 2024

[4] [4]

Midas: Multimodal interactive digital-human synthesis via real-time autoregressive video generation.arXiv preprint arXiv:2508.19320, 2025

Ming Chen, Liyuan Cui, Wenyuan Zhang, Haoxian Zhang, Yan Zhou, Xiaohan Li, Songlin Tang, Jiwen Liu, Borui Liao, Hejia Chen, et al. Midas: Multimodal interactive digital-human synthesis via real-time autoregressive video generation.arXiv preprint arXiv:2508.19320, 2025. 3

work page arXiv 2025

[5] [5]

Out of time: automated lip sync in the wild

Joon Son Chung and Andrew Zisserman. Out of time: automated lip sync in the wild. InComputer Vision–ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part II 13, pages 251–263. Springer, 2017. 7

work page 2016

[6] [6]

Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer

Jiahao Cui, Hui Li, Yun Zhan, Hanlin Shang, Kaihui Cheng, Yuqi Ma, Shan Mu, Hang Zhou, Jingdong Wang, and Siyu Zhu. Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21086–21095, 2025. 2, 7, 9, 4

work page 2025

[7] [7]

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283, 2025. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Rap: Real-time audio-driven portrait animation with video diffusion transformer.arXiv preprint arXiv:2508.05115, 2025

Fangyu Du, Taiqing Li, Ziwei Zhang, Qian Qiao, Tan Yu, Dingcheng Zhen, Xu Jia, Yang Yang, Shunshun Yin, and Siyuan Liu. Rap: Real-time audio-driven portrait animation with video diffusion transformer.arXiv preprint arXiv:2508.05115, 2025. 3

work page arXiv 2025

[9] [9]

Cosyvoice 2: Scalable streaming speech synthesis with large language models, 2024

Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, Fan Yu, Huadai Liu, Zhengyan Sheng, Yue Gu, Chong Deng, Wen Wang, Shiliang Zhang, Zhijie Yan, and Jingren Zhou. Cosyvoice 2: Scalable streaming speech synthesis with large language models, 2024. 6

work page 2024

[10] [10]

Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation

Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T Freeman, and Michael Rubin- stein. Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation.arXiv preprint arXiv:1804.03619, 2018. 6

work page internal anchor Pith review Pith/arXiv arXiv 2018

[11] [11]

Phased dmd: Few-step distribution matching distillation via score matching within subintervals.arXiv preprint arXiv:2510.27684,

Xiangyu Fan, Zesong Qiu, Zhuguanyu Wu, Fanzhou Wang, Zhiqian Lin, Tianxiang Ren, Dahua Lin, Ruihao Gong, and Lei Yang. Phased dmd: Few-step distribution matching distillation via score matching within subintervals.arXiv preprint arXiv:2510.27684,

work page arXiv

[12] [12]

One Step Diffusion via Shortcut Models

Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models.arXiv preprint arXiv:2410.12557, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Omniavatar: Efficient audio-driven avatar video generation with adaptive body animation.arXiv preprint arXiv:2506.18866, 2025

Qijun Gan, Ruizi Yang, Jianke Zhu, Shaofei Xue, and Steven Hoi. Omniavatar: Efficient audio-driven avatar video generation with adaptive body animation.arXiv preprint arXiv:2506.18866, 2025. 2, 6, 7, 9, 4

work page arXiv 2025

[14] [14]

Wan-s2v: Audio-driven cinematic video generation, 2025

Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, Dechao Meng, Jinwei Qi, Penchong Qiao, Zhen Shen, Yafei Song, Ke Sun, Linrui Tian, Guangyuan Wang, Qi Wang, Zhongjian Wang, Jiayu Xiao, Sheng Xu, Bang Zhang, Peng Zhang, Xindi Zhang, Zhe Zhang, Jingren Zhou, and Lian Zhuo. Wan-s2v: Audio-driven cinematic video generation, 2025. 2, 7, 9, 4

work page 2025

[15] [15]

Wan-s2v: Audio-driven cinematic video generation.arXiv preprint arXiv:2508.18621,

Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, Dechao Meng, Jinwei Qi, Penchong Qiao, Zhen Shen, Yafei Song, et al. Wan-s2v: Audio-driven cinematic video generation.arXiv preprint arXiv:2508.18621, 2025. 2, 6

work page arXiv 2025

[16] [16]

Arig: Autoregressive interactive head generation for real-time conversations.arXiv preprint arXiv:2507.00472, 2025

Ying Guo, Xi Liu, Cheng Zhen, Pengfei Yan, and Xiaoming Wei. Arig: Autoregressive interactive head generation for real-time conversations.arXiv preprint arXiv:2507.00472, 2025. 3

work page arXiv 2025

[17] [17]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017. 7

work page 2017

[19] [19]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025. 2, 3, 5, 6, 1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

Omnihuman-1.5: Instilling an active mind in avatars via cognitive simulation.arXiv preprint arXiv:2508.19209, 2025

Jianwen Jiang, Weihong Zeng, Zerong Zheng, Jiaqi Yang, Chao Liang, Wang Liao, Han Liang, Yuan Zhang, and Mingyuan Gao. Omnihuman-1.5: Instilling an active mind in avatars via cognitive simulation.arXiv preprint arXiv:2508.19209, 2025. 2

work page arXiv 2025

[21] [21]

Streamdit: Real-time streaming text-to-video generation.arXiv preprint arXiv:2507.03745, 2025

Akio Kodaira, Tingbo Hou, Ji Hou, Masayoshi Tomizuka, and Yue Zhao. Streamdit: Real-time streaming text-to-video generation. arXiv preprint arXiv:2507.03745, 2025. 2

work page arXiv 2025

[22] [22]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

Autoregressive image generation without vector quantization

Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization. Advances in Neural Information Processing Systems, 37:56424–56445, 2024. 3

work page 2024

[24] [24]

Ditto: Motion-space diffusion for controllable realtime talking head synthesis.arXiv preprint arXiv:2411.19509, 2024

Tianqi Li, Ruobing Zheng, Minghui Yang, Jingdong Chen, and Ming Yang. Ditto: Motion-space diffusion for controllable realtime talking head synthesis.arXiv preprint arXiv:2411.19509, 2024. 2, 3, 7, 9, 4

work page arXiv 2024

[25] [25]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[26] [26]

Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time. arXiv preprint arXiv:2509.25161, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Talkingmachines: Real-time audio-driven facetime-style video via autoregressive diffusion models

Chetwin Low and Weimin Wang. Talkingmachines: Real-time audio-driven facetime-style video via autoregressive diffusion models. arXiv preprint arXiv:2506.03099, 2025. 3, 5, 6

work page arXiv 2025

[28] [28]

Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models

Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models.arXiv preprint arXiv:2410.11081, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

Diff-instruct++: Training one-step text-to-image generator model to align with human preferences.arXiv preprint arXiv:2410.18881, 2024

Weijian Luo. Diff-instruct++: Training one-step text-to-image generator model to align with human preferences.arXiv preprint arXiv:2410.18881, 2024. 3, 7

work page arXiv 2024

[30] [30]

Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models.Advances in Neural Information Processing Systems, 36:76525–76546,

Weijian Luo, Tianyang Hu, Shifeng Zhang, Jiacheng Sun, Zhenguo Li, and Zhihua Zhang. Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models.Advances in Neural Information Processing Systems, 36:76525–76546,

work page

[31] [31]

Learning few-step diffusion models by trajectory distribution matching.arXiv preprint arXiv:2503.06674, 2025

Yihong Luo, Tianyang Hu, Jiacheng Sun, Yujun Cai, and Jing Tang. Learning few-step diffusion models by trajectory distribution matching.arXiv preprint arXiv:2503.06674, 2025. 3, 7

work page arXiv 2025

[32] [32]

Mirrorme: Towards realtime and high fidelity audio-driven halfbody animation.arXiv preprint arXiv:2506.22065, 2025

Dechao Meng, Steven Xiao, Xindi Zhang, Guangyuan Wang, Peng Zhang, Qi Wang, Bang Zhang, and Liefeng Bo. Mirrorme: Towards realtime and high fidelity audio-driven halfbody animation.arXiv preprint arXiv:2506.22065, 2025. 3

work page arXiv 2025

[33] [33]

Echomimicv2: Towards striking, simplified, and semi-body human animation, 2025

Rang Meng, Xingyu Zhang, Yuming Li, and Chenguang Ma. Echomimicv2: Towards striking, simplified, and semi-body human animation, 2025. 7, 9, 4

work page 2025

[34] [34]

A lip sync expert is all you need for speech to lip generation in the wild

KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. A lip sync expert is all you need for speech to lip generation in the wild. InProceedings of the 28th ACM international conference on multimedia, pages 484–492, 2020. 3

work page 2020

[35] [35]

Nabyl Quignon, Baptiste Chopin, Yaohui Wang, and Antitza Dantcheva. Theval. evaluation framework for talking head video generation, 2025. 8

work page 2025

[36] [36]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. 2023. 3

work page 2023

[37] [37]

Emo: Emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions

Linrui Tian, Qi Wang, Bang Zhang, and Liefeng Bo. Emo: Emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions. InEuropean Conference on Computer Vision, pages 244–260. Springer, 2024. 3

work page 2024

[38] [38]

Sta- bleavatar: Infinite-length audio-driven avatar video generation.arXiv preprint arXiv:2508.08248, 2025

Shuyuan Tu, Yueming Pan, Yinming Huang, Xintong Han, Zhen Xing, Qi Dai, Chong Luo, Zuxuan Wu, and Yu-Gang Jiang. Sta- bleavatar: Infinite-length audio-driven avatar video generation.arXiv preprint arXiv:2508.08248, 2025. 2, 6, 7, 9, 4

work page arXiv 2025

[39] [39]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018. 7

work page internal anchor Pith review Pith/arXiv arXiv 2018

[40] [40]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Fantasytalking: Realistic talking portrait generation via coherent motion synthesis

Mengchao Wang, Qiang Wang, Fan Jiang, Yaqi Fan, Yunpeng Zhang, Yonggang Qi, Kun Zhao, and Mu Xu. Fantasytalking: Realistic talking portrait generation via coherent motion synthesis. InProceedings of the 33rd ACM International Conference on Multimedia, pages 9891–9900, 2025. 8

work page 2025

[42] [42]

Omnitalker: Real-time text- driven talking head generation with in-context audio-visual style replication.arXiv e-prints, pages arXiv–2504, 2025

Zhongjian Wang, Peng Zhang, Jinwei Qi, Guangyuan Wang Sheng Xu, Bang Zhang, and Liefeng Bo. Omnitalker: Real-time text- driven talking head generation with in-context audio-visual style replication.arXiv e-prints, pages arXiv–2504, 2025. 3

work page 2025

[43] [43]

Qwen-image technical report, 2025

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...

work page 2025

[44] [44]

Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels

Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Yixuan Gao, Annan Wang, Erli Zhang, Wenxiu Sun, et al. Q-align: Teaching lmms for visual scoring via discrete text-defined levels.arXiv preprint arXiv:2312.17090, 2023. 7

work page internal anchor Pith review Pith/arXiv arXiv 2023

[45] [45]

X-streamer: Unified human world modeling with audiovisual interaction.arXiv preprint arXiv:2509.21574, 2025

You Xie, Tianpei Gu, Zenan Li, Chenxu Zhang, Guoxian Song, Xiaochen Zhao, Chao Liang, Jianwen Jiang, Hongyi Xu, and Linjie Luo. X-streamer: Unified human world modeling with audiovisual interaction.arXiv preprint arXiv:2509.21574, 2025. 3, 4

work page arXiv 2025

[46] [46]

LongLive: Real-time Interactive Long Video Generation

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025. 2, 3, 5, 1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [47]

Infinitetalk: Audio-driven video generation for sparse-frame video dubbing.arXiv preprint arXiv:2508.14033, 2025

Shaoshu Yang, Zhe Kong, Feng Gao, Meng Cheng, Xiangyu Liu, Yong Zhang, Zhuoliang Kang, Wenhan Luo, Xunliang Cai, Ran He, et al. Infinitetalk: Audio-driven video generation for sparse-frame video dubbing.arXiv preprint arXiv:2508.14033, 2025. 2

work page arXiv 2025

[48] [48]

Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024

Tianwei Yin, Micha ¨el Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and Bill Freeman. Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024. 3, 4

work page 2024

[49] [49]

One-step diffu- sion with distribution matching distillation

Tianwei Yin, Micha ¨el Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffu- sion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024. 3

work page 2024

[50] [50]

From slow bidi- rectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidi- rectional to fast autoregressive video diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22963–22974, 2025. 2, 3, 4, 5, 1

work page 2025

[51] [51]

Llia–enabling low-latency interactive avatars: Real-time audio-driven portrait video generation with diffusion models.arXiv preprint arXiv:2506.05806, 2025

Haojie Yu, Zhaonian Wang, Yihan Pan, Meng Cheng, Hao Yang, Chao Wang, Tao Xie, Xiaoming Xu, Xiaoming Wei, and Xunliang Cai. Llia–enabling low-latency interactive avatars: Real-time audio-driven portrait video generation with diffusion models.arXiv preprint arXiv:2506.05806, 2025. 3

work page arXiv 2025

[52] [52]

Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation

Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, Ying Shan, and Fei Wang. Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8652–8661, 2023. 3

work page 2023

[53] [53]

Teller: Real-time streaming audio-driven portrait animation with autoregressive motion generation

Dingcheng Zhen, Shunshun Yin, Shiyang Qin, Hou Yi, Ziwei Zhang, Siyuan Liu, Gan Qi, and Ming Tao. Teller: Real-time streaming audio-driven portrait animation with autoregressive motion generation. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 21075–21085, 2025. 3

work page 2025

[54] [54]

Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency

Kaiwen Zheng, Yuji Wang, Qianli Ma, Huayu Chen, Jintao Zhang, Yogesh Balaji, Jianfei Chen, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Large scale diffusion distillation via score-regularized continuous-time consistency.arXiv preprint arXiv:2510.08431, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[55] [55]

Infp: Audio-driven interactive head generation in dyadic conversations

Yongming Zhu, Longhao Zhang, Zhengkun Rong, Tianshu Hu, Shuang Liang, and Zhipeng Ge. Infp: Audio-driven interactive head generation in dyadic conversations. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10667–10677,

work page