Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length
Pith reviewed 2026-05-17 01:53 UTC · model grok-4.3
The pith
Live Avatar enables real-time streaming of infinite-length audio-driven avatars using a 14-billion-parameter diffusion model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Live Avatar introduces an algorithm-system co-designed framework for a 14-billion-parameter diffusion model. On the algorithm side, a two-stage pipeline distills a pretrained bidirectional model into a causal, few-step streaming one, while complementary long-horizon strategies eliminate identity drift and visual artifacts for stable autoregressive generation exceeding 10000 seconds. On the system side, Timestep-forcing Pipeline Parallelism assigns each GPU a fixed denoising timestep, turning the sequential diffusion chain into an asynchronous spatial pipeline that boosts throughput and improves temporal consistency.
What carries the argument
Timestep-forcing Pipeline Parallelism (TPP) that assigns each GPU a fixed denoising timestep to convert sequential diffusion into an asynchronous pipeline, combined with the two-stage distillation and long-horizon strategies.
Load-bearing premise
The long-horizon strategies and distillation process preserve visual quality and identity without introducing new artifacts or requiring per-sequence retraining for stable autoregressive generation exceeding 10000 seconds.
What would settle it
Generating a continuous 10000-second avatar video driven by audio and checking if identity remains consistent with no new visual artifacts appearing over time without any retraining.
Figures
read the original abstract
Audio-driven avatar interaction demands real-time, streaming, and infinite-length generation -- capabilities fundamentally at odds with the sequential denoising and long-horizon drift of current diffusion models. We present Live Avatar, an algorithm-system co-designed framework that addresses both challenges for a 14-billion-parameter diffusion model. On the algorithm side, a two-stage pipeline distills a pretrained bidirectional model into a causal, few-step streaming one, while a set of complementary long-horizon strategies eliminate identity drift and visual artifacts, enabling stable autoregressive generation exceeding 10000 seconds. On the system side, Timestep-forcing Pipeline Parallelism (TPP) assigns each GPU a fixed denoising timestep, converting the sequential diffusion chain into an asynchronous spatial pipeline that simultaneously boosts throughput and improves temporal consistency. Live Avatar achieves 45 FPS with a TTFF of 1.21\,s on 5 H800 GPUs, and to our knowledge is the first to enable practical real-time streaming of a 14B diffusion model for infinite-length avatar generation. We further introduce GenBench, a standardized long-form benchmark, to facilitate reproducible evaluation. Our project page is at https://liveavatar.github.io/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Live Avatar, a framework for real-time streaming audio-driven avatar generation of infinite length. It distills a 14B bidirectional diffusion model into a causal few-step model using a two-stage pipeline and introduces long-horizon strategies to prevent identity drift and artifacts for autoregressive generation beyond 10000 seconds. System-wise, Timestep-forcing Pipeline Parallelism (TPP) enables parallel processing on multiple GPUs. Reported performance is 45 FPS with 1.21 s TTFF on 5 H800 GPUs, and a new benchmark GenBench is introduced.
Significance. Should the long-horizon stability and performance claims be validated through detailed experiments, this would be a significant contribution to the field of real-time avatar animation and diffusion model deployment. It tackles the challenges of sequential denoising and drift in diffusion models through co-design, potentially opening avenues for interactive applications. The benchmark introduction is a positive step for the community.
major comments (2)
- [§4 (Experiments and Long-horizon Evaluation)] §4 (Experiments and Long-horizon Evaluation): The central claim of stable autoregressive generation exceeding 10000 seconds relies on the long-horizon strategies eliminating identity drift. However, the provided details do not include quantitative long-horizon metrics (e.g., identity preservation scores or visual quality assessments over extended durations), which are necessary to confirm that per-step inconsistencies do not compound. This is load-bearing for the infinite-length assertion.
- [§3.2 (Distillation Process)] §3.2 (Distillation Process): The two-stage distillation into a causal few-step streaming model is key to enabling real-time performance. Clarify how the distillation preserves the audio-driven conditioning and visual fidelity without introducing artifacts that could affect the subsequent long-horizon rollout.
minor comments (2)
- [Abstract] Abstract: The abstract states 'to our knowledge is the first', which is a strong claim; ensure the related work section provides a thorough comparison to prior streaming avatar methods to support this.
- [Throughout] Throughout: Ensure that all figures include clear captions and that any ablation studies on the long-horizon strategies are presented with specific quantitative improvements.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for highlighting the potential impact of our work. We address each major comment below and will strengthen the manuscript accordingly in the revision.
read point-by-point responses
-
Referee: [§4 (Experiments and Long-horizon Evaluation)] §4 (Experiments and Long-horizon Evaluation): The central claim of stable autoregressive generation exceeding 10000 seconds relies on the long-horizon strategies eliminating identity drift. However, the provided details do not include quantitative long-horizon metrics (e.g., identity preservation scores or visual quality assessments over extended durations), which are necessary to confirm that per-step inconsistencies do not compound. This is load-bearing for the infinite-length assertion.
Authors: We agree that quantitative long-horizon metrics are necessary to rigorously support the stability claims. The current manuscript emphasizes qualitative results and short-sequence metrics to illustrate the effectiveness of our strategies, but we acknowledge the need for extended evaluation. In the revised version, we will add Section 4.4 with quantitative metrics including face embedding similarity (e.g., ArcFace cosine similarity) and perceptual scores (LPIPS, FID) computed on sequences of increasing duration up to 10000 seconds. New plots will demonstrate that these metrics remain stable and do not show compounding degradation, directly addressing the concern about per-step inconsistencies. revision: yes
-
Referee: [§3.2 (Distillation Process)] §3.2 (Distillation Process): The two-stage distillation into a causal few-step streaming model is key to enabling real-time performance. Clarify how the distillation preserves the audio-driven conditioning and visual fidelity without introducing artifacts that could affect the subsequent long-horizon rollout.
Authors: We will expand Section 3.2 with additional details on the distillation pipeline. The first stage adapts the bidirectional teacher to a causal model while preserving audio conditioning via consistent cross-attention between audio features and visual latents, trained with a combination of denoising and audio-visual alignment losses. The second stage applies few-step consistency distillation augmented with perceptual and synchronization objectives to maintain fidelity. We will include ablation results showing that short-sequence performance matches the teacher model and that no artifacts are introduced that propagate in long-horizon rollouts, as confirmed by our existing long-sequence qualitative evaluations. revision: yes
Circularity Check
No significant circularity detected; claims rest on empirical engineering contributions
full rationale
The paper describes a two-stage distillation pipeline, complementary long-horizon strategies, and Timestep-forcing Pipeline Parallelism (TPP) as independent algorithmic and system-level contributions. Performance results (45 FPS, 1.21 s TTFF) and the claim of stable autoregressive generation exceeding 10000 seconds are presented as outcomes of direct measurement on hardware, supported by the new GenBench benchmark. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations that reduce the central claims to their own inputs appear in the provided text. The derivation chain is therefore self-contained against external benchmarks and measurements.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Self-Forcing Distribution Matching Distillation to facilitate causal, streamable adaptation... Timestep-forcing Pipeline Parallelism (TPP) assigns each GPU a fixed denoising timestep... Rolling Sink Frame Mechanism (RSFM) dynamically recalibrates appearance using a cached reference image.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 8 Pith papers
-
CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives
CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.
-
Efficient Video Diffusion Models: Advancements and Challenges
A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
-
Forcing-KV: Hybrid KV Cache Compression for Efficient Autoregressive Video Diffusion Models
Forcing-KV applies head-specific static and dynamic pruning to KV caches in AR video diffusion models, achieving over 29 fps, 30% memory reduction, and up to 2.82x speedup at maintained quality.
-
LPM 1.0: Video-based Character Performance Model
LPM 1.0 generates infinite-length, identity-stable, real-time audio-visual conversational performances for single characters using a distilled causal diffusion transformer and a new benchmark.
-
Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms
Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.
-
Do Protective Perturbations Really Protect Portrait Privacy under Real-world Image Transformations?
Pixel-level protective perturbations for portrait privacy are ineffective against common image transformations, and a low-cost purification framework can strip them out.
-
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.
-
EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation
EchoTorrent combines multi-teacher distillation, adaptive CFG calibration, hybrid long-tail forcing, and VAE decoder refinement to enable few-pass autoregressive streaming video generation with improved temporal consi...
Reference graph
Works this paper leans on
-
[1]
Body of her: A preliminary study on end-to-end humanoid agent.arXiv preprint arXiv:2408.02879, 2024
Tenglong Ao. Body of her: A preliminary study on end-to-end humanoid agent.arXiv preprint arXiv:2408.02879, 2024. 3
-
[2]
Video generation models as world simulators
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024. 2, 3
work page 2024
-
[3]
Boyuan Chen, Diego Mart ´ı Mons´o, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024. 2
work page 2024
-
[4]
Ming Chen, Liyuan Cui, Wenyuan Zhang, Haoxian Zhang, Yan Zhou, Xiaohan Li, Songlin Tang, Jiwen Liu, Borui Liao, Hejia Chen, et al. Midas: Multimodal interactive digital-human synthesis via real-time autoregressive video generation.arXiv preprint arXiv:2508.19320, 2025. 3
-
[5]
Out of time: automated lip sync in the wild
Joon Son Chung and Andrew Zisserman. Out of time: automated lip sync in the wild. InComputer Vision–ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part II 13, pages 251–263. Springer, 2017. 7
work page 2016
-
[6]
Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer
Jiahao Cui, Hui Li, Yun Zhan, Hanlin Shang, Kaihui Cheng, Yuqi Ma, Shan Mu, Hang Zhou, Jingdong Wang, and Siyu Zhu. Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21086–21095, 2025. 2, 7, 9, 4
work page 2025
-
[7]
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283, 2025. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Fangyu Du, Taiqing Li, Ziwei Zhang, Qian Qiao, Tan Yu, Dingcheng Zhen, Xu Jia, Yang Yang, Shunshun Yin, and Siyuan Liu. Rap: Real-time audio-driven portrait animation with video diffusion transformer.arXiv preprint arXiv:2508.05115, 2025. 3
-
[9]
Cosyvoice 2: Scalable streaming speech synthesis with large language models, 2024
Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, Fan Yu, Huadai Liu, Zhengyan Sheng, Yue Gu, Chong Deng, Wen Wang, Shiliang Zhang, Zhijie Yan, and Jingren Zhou. Cosyvoice 2: Scalable streaming speech synthesis with large language models, 2024. 6
work page 2024
-
[10]
Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T Freeman, and Michael Rubin- stein. Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation.arXiv preprint arXiv:1804.03619, 2018. 6
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[11]
Xiangyu Fan, Zesong Qiu, Zhuguanyu Wu, Fanzhou Wang, Zhiqian Lin, Tianxiang Ren, Dahua Lin, Ruihao Gong, and Lei Yang. Phased dmd: Few-step distribution matching distillation via score matching within subintervals.arXiv preprint arXiv:2510.27684,
-
[12]
One Step Diffusion via Shortcut Models
Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models.arXiv preprint arXiv:2410.12557, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Qijun Gan, Ruizi Yang, Jianke Zhu, Shaofei Xue, and Steven Hoi. Omniavatar: Efficient audio-driven avatar video generation with adaptive body animation.arXiv preprint arXiv:2506.18866, 2025. 2, 6, 7, 9, 4
-
[14]
Wan-s2v: Audio-driven cinematic video generation, 2025
Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, Dechao Meng, Jinwei Qi, Penchong Qiao, Zhen Shen, Yafei Song, Ke Sun, Linrui Tian, Guangyuan Wang, Qi Wang, Zhongjian Wang, Jiayu Xiao, Sheng Xu, Bang Zhang, Peng Zhang, Xindi Zhang, Zhe Zhang, Jingren Zhou, and Lian Zhuo. Wan-s2v: Audio-driven cinematic video generation, 2025. 2, 7, 9, 4
work page 2025
-
[15]
Wan-s2v: Audio-driven cinematic video generation.arXiv preprint arXiv:2508.18621,
Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, Dechao Meng, Jinwei Qi, Penchong Qiao, Zhen Shen, Yafei Song, et al. Wan-s2v: Audio-driven cinematic video generation.arXiv preprint arXiv:2508.18621, 2025. 2, 6
-
[16]
Ying Guo, Xi Liu, Cheng Zhen, Pengfei Yan, and Xiaoming Wei. Arig: Autoregressive interactive head generation for real-time conversations.arXiv preprint arXiv:2507.00472, 2025. 3
-
[17]
LTX-Video: Realtime Video Latent Diffusion
Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017. 7
work page 2017
-
[19]
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025. 2, 3, 5, 6, 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Jianwen Jiang, Weihong Zeng, Zerong Zheng, Jiaqi Yang, Chao Liang, Wang Liao, Han Liang, Yuan Zhang, and Mingyuan Gao. Omnihuman-1.5: Instilling an active mind in avatars via cognitive simulation.arXiv preprint arXiv:2508.19209, 2025. 2
-
[21]
Streamdit: Real-time streaming text-to-video generation.arXiv preprint arXiv:2507.03745, 2025
Akio Kodaira, Tingbo Hou, Ji Hou, Masayoshi Tomizuka, and Yue Zhao. Streamdit: Real-time streaming text-to-video generation. arXiv preprint arXiv:2507.03745, 2025. 2
-
[22]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Autoregressive image generation without vector quantization
Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization. Advances in Neural Information Processing Systems, 37:56424–56445, 2024. 3
work page 2024
-
[24]
Tianqi Li, Ruobing Zheng, Minghui Yang, Jingdong Chen, and Ming Yang. Ditto: Motion-space diffusion for controllable realtime talking head synthesis.arXiv preprint arXiv:2411.19509, 2024. 2, 3, 7, 9, 4
-
[25]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[26]
Rolling Forcing: Autoregressive Long Video Diffusion in Real Time
Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time. arXiv preprint arXiv:2509.25161, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Talkingmachines: Real-time audio-driven facetime-style video via autoregressive diffusion models
Chetwin Low and Weimin Wang. Talkingmachines: Real-time audio-driven facetime-style video via autoregressive diffusion models. arXiv preprint arXiv:2506.03099, 2025. 3, 5, 6
-
[28]
Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models
Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models.arXiv preprint arXiv:2410.11081, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Weijian Luo. Diff-instruct++: Training one-step text-to-image generator model to align with human preferences.arXiv preprint arXiv:2410.18881, 2024. 3, 7
-
[30]
Weijian Luo, Tianyang Hu, Shifeng Zhang, Jiacheng Sun, Zhenguo Li, and Zhihua Zhang. Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models.Advances in Neural Information Processing Systems, 36:76525–76546,
-
[31]
Yihong Luo, Tianyang Hu, Jiacheng Sun, Yujun Cai, and Jing Tang. Learning few-step diffusion models by trajectory distribution matching.arXiv preprint arXiv:2503.06674, 2025. 3, 7
-
[32]
Dechao Meng, Steven Xiao, Xindi Zhang, Guangyuan Wang, Peng Zhang, Qi Wang, Bang Zhang, and Liefeng Bo. Mirrorme: Towards realtime and high fidelity audio-driven halfbody animation.arXiv preprint arXiv:2506.22065, 2025. 3
-
[33]
Echomimicv2: Towards striking, simplified, and semi-body human animation, 2025
Rang Meng, Xingyu Zhang, Yuming Li, and Chenguang Ma. Echomimicv2: Towards striking, simplified, and semi-body human animation, 2025. 7, 9, 4
work page 2025
-
[34]
A lip sync expert is all you need for speech to lip generation in the wild
KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. A lip sync expert is all you need for speech to lip generation in the wild. InProceedings of the 28th ACM international conference on multimedia, pages 484–492, 2020. 3
work page 2020
-
[35]
Nabyl Quignon, Baptiste Chopin, Yaohui Wang, and Antitza Dantcheva. Theval. evaluation framework for talking head video generation, 2025. 8
work page 2025
-
[36]
Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. 2023. 3
work page 2023
-
[37]
Linrui Tian, Qi Wang, Bang Zhang, and Liefeng Bo. Emo: Emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions. InEuropean Conference on Computer Vision, pages 244–260. Springer, 2024. 3
work page 2024
-
[38]
Shuyuan Tu, Yueming Pan, Yinming Huang, Xintong Han, Zhen Xing, Qi Dai, Chong Luo, Zuxuan Wu, and Yu-Gang Jiang. Sta- bleavatar: Infinite-length audio-driven avatar video generation.arXiv preprint arXiv:2508.08248, 2025. 2, 6, 7, 9, 4
-
[39]
Towards Accurate Generative Models of Video: A New Metric & Challenges
Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018. 7
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[40]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Fantasytalking: Realistic talking portrait generation via coherent motion synthesis
Mengchao Wang, Qiang Wang, Fan Jiang, Yaqi Fan, Yunpeng Zhang, Yonggang Qi, Kun Zhao, and Mu Xu. Fantasytalking: Realistic talking portrait generation via coherent motion synthesis. InProceedings of the 33rd ACM International Conference on Multimedia, pages 9891–9900, 2025. 8
work page 2025
-
[42]
Zhongjian Wang, Peng Zhang, Jinwei Qi, Guangyuan Wang Sheng Xu, Bang Zhang, and Liefeng Bo. Omnitalker: Real-time text- driven talking head generation with in-context audio-visual style replication.arXiv e-prints, pages arXiv–2504, 2025. 3
work page 2025
-
[43]
Qwen-image technical report, 2025
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...
work page 2025
-
[44]
Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels
Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Yixuan Gao, Annan Wang, Erli Zhang, Wenxiu Sun, et al. Q-align: Teaching lmms for visual scoring via discrete text-defined levels.arXiv preprint arXiv:2312.17090, 2023. 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
You Xie, Tianpei Gu, Zenan Li, Chenxu Zhang, Guoxian Song, Xiaochen Zhao, Chao Liang, Jianwen Jiang, Hongyi Xu, and Linjie Luo. X-streamer: Unified human world modeling with audiovisual interaction.arXiv preprint arXiv:2509.21574, 2025. 3, 4
-
[46]
LongLive: Real-time Interactive Long Video Generation
Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025. 2, 3, 5, 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Shaoshu Yang, Zhe Kong, Feng Gao, Meng Cheng, Xiangyu Liu, Yong Zhang, Zhuoliang Kang, Wenhan Luo, Xunliang Cai, Ran He, et al. Infinitetalk: Audio-driven video generation for sparse-frame video dubbing.arXiv preprint arXiv:2508.14033, 2025. 2
-
[48]
Tianwei Yin, Micha ¨el Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and Bill Freeman. Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024. 3, 4
work page 2024
-
[49]
One-step diffu- sion with distribution matching distillation
Tianwei Yin, Micha ¨el Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffu- sion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024. 3
work page 2024
-
[50]
From slow bidi- rectional to fast autoregressive video diffusion models
Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidi- rectional to fast autoregressive video diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22963–22974, 2025. 2, 3, 4, 5, 1
work page 2025
-
[51]
Haojie Yu, Zhaonian Wang, Yihan Pan, Meng Cheng, Hao Yang, Chao Wang, Tao Xie, Xiaoming Xu, Xiaoming Wei, and Xunliang Cai. Llia–enabling low-latency interactive avatars: Real-time audio-driven portrait video generation with diffusion models.arXiv preprint arXiv:2506.05806, 2025. 3
-
[52]
Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, Ying Shan, and Fei Wang. Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8652–8661, 2023. 3
work page 2023
-
[53]
Teller: Real-time streaming audio-driven portrait animation with autoregressive motion generation
Dingcheng Zhen, Shunshun Yin, Shiyang Qin, Hou Yi, Ziwei Zhang, Siyuan Liu, Gan Qi, and Ming Tao. Teller: Real-time streaming audio-driven portrait animation with autoregressive motion generation. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 21075–21085, 2025. 3
work page 2025
-
[54]
Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency
Kaiwen Zheng, Yuji Wang, Qianli Ma, Huayu Chen, Jintao Zhang, Yogesh Balaji, Jianfei Chen, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Large scale diffusion distillation via score-regularized continuous-time consistency.arXiv preprint arXiv:2510.08431, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
Infp: Audio-driven interactive head generation in dyadic conversations
Yongming Zhu, Longhao Zhang, Zhengkun Rong, Tianshu Hu, Shuang Liang, and Zhipeng Ge. Infp: Audio-driven interactive head generation in dyadic conversations. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10667–10677,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.