pith. machine review for the scientific record. sign in

arxiv: 2604.07823 · v2 · submitted 2026-04-09 · 💻 cs.CV · cs.AI· cs.MM

Recognition: unknown

LPM 1.0: Video-based Character Performance Model

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:04 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.MM
keywords video-based character performanceconversational video generationdiffusion transformerreal-time inferenceidentity stabilityperformance trilemmaLPM-Benchmultimodal conditioning
0
0 comments X

The pith

LPM 1.0 generates expressive, identity-stable conversational videos in real time from audio and text prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to resolve the performance trilemma, where existing video models cannot simultaneously deliver high expressiveness, real-time speed, and long-horizon identity stability in character animation. It targets conversational scenarios in which a character must speak, listen, react, and emote while holding a consistent visual identity over extended interactions. The solution involves building a filtered multimodal dataset with speaking-listening pairings and identity-aware references, training a 17B-parameter diffusion transformer on multimodal inputs for controllable output, and distilling the result into a causal streaming generator. The resulting system produces listening videos from user audio and speaking videos from synthesized audio plus text motion prompts. It is positioned as a visual engine for agents, live streams, and game characters, with a new benchmark confirming superior results across metrics at interactive frame rates.

Core claim

LPM 1.0 is constructed through strict dataset filtering, speaking-listening audio-video pairing, performance understanding, and identity-aware multi-reference extraction; a 17B-parameter Diffusion Transformer (Base LPM) is trained for highly controllable, identity-consistent performance via multimodal conditioning; this is distilled into a causal streaming generator (Online LPM) that supports low-latency, infinite-length interaction. Given a character image with identity-aware references, the model outputs listening videos from user audio and speaking videos from synthesized audio, with text prompts controlling motion, all while running in real time.

What carries the argument

Multimodal conditioning on audio, identity references, and text prompts inside a 17B Diffusion Transformer that is distilled into a causal streaming generator for controllable, infinite-horizon performance synthesis.

If this is right

  • It functions as a visual engine that supplies real-time listening and speaking behaviors for conversational agents, live-stream characters, and game NPCs.
  • The model supports infinite-length generation while maintaining identity consistency across extended conversational turns.
  • It delivers state-of-the-art scores on all dimensions of the new LPM-Bench benchmark at real-time inference speeds.
  • Text prompts allow explicit motion control on top of audio-driven performance without requiring 3D rigs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The single-person conversational focus could be extended to multi-character scenes if the identity-aware conditioning generalizes to mutual reactions.
  • The speaking-listening data pairing technique may transfer to training other interactive visual systems such as virtual-reality avatars or telepresence.
  • Deployment in open-ended user sessions would test whether identity stability persists beyond the lengths examined in the benchmark.
  • Pairing the model with separate audio synthesis would create an end-to-end pipeline from text input to synchronized speech and visual performance.

Load-bearing premise

The strict filtering, speaking-listening pairing, and identity-aware extraction used to build the dataset, together with the distillation step, actually preserve expressiveness and identity stability without introducing artifacts or benchmark leakage.

What would settle it

Long video sequences generated by the model showing visible identity drift, motion artifacts, or failure to match claimed scores when independently evaluated on the LPM-Bench.

read the original abstract

Performance, the externalization of intent, emotion, and personality through visual, vocal, and temporal behavior, is what makes a character alive. Learning such performance from video is a promising alternative to traditional 3D pipelines. However, existing video models struggle to jointly achieve high expressiveness, real-time inference, and long-horizon identity stability, a tension we call the performance trilemma. Conversation is the most comprehensive performance scenario, as characters simultaneously speak, listen, react, and emote while maintaining identity over time. To address this, we present LPM 1.0 (Large Performance Model), focusing on single-person full-duplex audio-visual conversational performance. Concretely, we build a multimodal human-centric dataset through strict filtering, speaking-listening audio-video pairing, performance understanding, and identity-aware multi-reference extraction; train a 17B-parameter Diffusion Transformer (Base LPM) for highly controllable, identity-consistent performance through multimodal conditioning; and distill it into a causal streaming generator (Online LPM) for low-latency, infinite-length interaction. At inference, given a character image with identity-aware references, LPM 1.0 generates listening videos from user audio and speaking videos from synthesized audio, with text prompts for motion control, all at real-time speed with identity-stable, infinite-length generation. LPM 1.0 thus serves as a visual engine for conversational agents, live streaming characters, and game NPCs. To systematically evaluate this setting, we propose LPM-Bench, the first benchmark for interactive character performance. LPM 1.0 achieves state-of-the-art results across all evaluated dimensions while maintaining real-time inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces LPM 1.0, a video-based Large Performance Model for single-person full-duplex audio-visual conversational character performance. It addresses the 'performance trilemma' (expressiveness, real-time inference, long-horizon identity stability) by constructing a multimodal human-centric dataset via strict filtering, speaking-listening pairing, performance understanding, and identity-aware multi-reference extraction; training a 17B-parameter Diffusion Transformer (Base LPM) with multimodal conditioning; distilling it into a causal streaming Online LPM; and evaluating on the newly proposed LPM-Bench benchmark, where it claims state-of-the-art results across all dimensions while achieving real-time speed.

Significance. If the quantitative results, ablations, and controls hold, the work could meaningfully advance real-time video generation for interactive applications such as conversational agents, live-streaming characters, and game NPCs. The introduction of LPM-Bench as a standardized evaluation protocol for interactive performance is a constructive contribution to the field. The combination of large-scale diffusion training followed by distillation to a streaming model is technically interesting, but the absence of any numerical metrics, error bars, or dataset statistics in the abstract prevents assessment of whether the central claims are supported.

major comments (3)
  1. [Abstract] Abstract: The central claim that 'LPM 1.0 achieves state-of-the-art results across all evaluated dimensions while maintaining real-time inference' is asserted without any quantitative metrics, ablation tables, error bars, or description of how LPM-Bench was constructed or how data exclusions were handled. This directly undermines evaluation of the headline result.
  2. [Abstract] Abstract (dataset construction paragraph): The multimodal dataset is built via 'strict filtering, speaking-listening audio-video pairing, performance understanding, and identity-aware multi-reference extraction,' yet no details are supplied on filtering criteria, diversity statistics, train/test split methodology, or controls for test-set contamination. Without these, it is impossible to determine whether the reported LPM-Bench numbers reflect genuine generalization or artifacts of the custom data pipeline.
  3. [Abstract] Abstract (distillation paragraph): The claim that distillation from the 17B Base Diffusion Transformer into the causal streaming Online LPM 'preserves' the same metrics is stated without any comparative numbers, degradation analysis, or latency/quality trade-off measurements. This step is load-bearing for the real-time claim and requires explicit verification.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'performance trilemma' is introduced without a formal definition or explicit metrics for each of the three axes (expressiveness, real-time inference, identity stability).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We agree that the abstract should be more self-contained with quantitative support and dataset details to allow immediate assessment of the claims. We have revised the abstract accordingly while preserving its brevity. Point-by-point responses are provided below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that 'LPM 1.0 achieves state-of-the-art results across all evaluated dimensions while maintaining real-time inference' is asserted without any quantitative metrics, ablation tables, error bars, or description of how LPM-Bench was constructed or how data exclusions were handled. This directly undermines evaluation of the headline result.

    Authors: We agree that the original abstract was insufficiently quantitative. In the revised manuscript we have updated the abstract to include key quantitative results from LPM-Bench (comparative scores on expressiveness, identity stability, and latency against baselines) together with a concise description of LPM-Bench construction and data-exclusion protocols. Full ablation tables, error bars, and methodological details remain in the main text and supplementary material. This revision makes the central claim directly verifiable from the abstract. revision: yes

  2. Referee: [Abstract] Abstract (dataset construction paragraph): The multimodal dataset is built via 'strict filtering, speaking-listening audio-video pairing, performance understanding, and identity-aware multi-reference extraction,' yet no details are supplied on filtering criteria, diversity statistics, train/test split methodology, or controls for test-set contamination. Without these, it is impossible to determine whether the reported LPM-Bench numbers reflect genuine generalization or artifacts of the custom data pipeline.

    Authors: We acknowledge the need for greater transparency on dataset construction even within the abstract. The revised abstract now specifies the filtering criteria (resolution, duration, and quality thresholds), provides high-level diversity statistics (total video hours and number of identities), describes the train/test split methodology (identity-disjoint partitioning), and notes controls for test-set contamination. Complete statistics and implementation details are given in Section 3 of the paper. revision: yes

  3. Referee: [Abstract] Abstract (distillation paragraph): The claim that distillation from the 17B Base Diffusion Transformer into the causal streaming Online LPM 'preserves' the same metrics is stated without any comparative numbers, degradation analysis, or latency/quality trade-off measurements. This step is load-bearing for the real-time claim and requires explicit verification.

    Authors: We agree that the abstract must explicitly verify the distillation outcome. The revised abstract now states that the Online LPM retains performance comparable to the Base LPM across LPM-Bench dimensions while achieving real-time inference, and it references the degradation analysis and latency-quality trade-offs. The full comparative numbers, degradation study, and measurements are provided in Section 4 and the supplementary material. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper describes an empirical pipeline: construction of a custom multimodal dataset via filtering/pairing/extraction steps, training of a 17B Diffusion Transformer (Base LPM), distillation into a causal streaming Online LPM, and evaluation on the newly proposed LPM-Bench. No mathematical equations, first-principles derivations, or parameter-fitting steps are presented that reduce a claimed prediction or result to the inputs by construction. The SOTA claims are experimental performance measurements on the authors' benchmark rather than self-definitional outputs or fitted quantities renamed as predictions. No self-citations appear as load-bearing justifications for uniqueness or ansatz choices in the provided text. The central claims therefore retain independent empirical content and do not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.0 · 5678 in / 1216 out tokens · 49421 ms · 2026-05-10T17:04:43.050199+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

94 extracted references · 39 canonical work pages · 14 internal anchors

  1. [1]

    The presentation of self in everyday life

    Erving Goffman. The presentation of self in everyday life. InSocial theory re-wired, pages 450–459. Routledge, 2023

  2. [2]

    Thomson Wadsworth, 1972

    Mark L Knapp, Judith A Hall, and Terrence G Horgan.Nonverbal communication in human interaction. Thomson Wadsworth, 1972

  3. [3]

    Schegloff, and Gail Jefferson

    Harvey Sacks, Emanuel A. Schegloff, and Gail Jefferson. A simplest systematics for the organi- zation of turn-taking for conversation.Language, 50(4):696–735, 1974

  4. [4]

    Newnes, 2012

    Rick Parent.Computer animation: algorithms and techniques. Newnes, 2012

  5. [5]

    Human motion generation: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(4):2430–2449, 2023

    Wentao Zhu, Xiaoxuan Ma, Dongwoo Ro, Hai Ci, Jinlu Zhang, Jiaxin Shi, Feng Gao, Qi Tian, and Yizhou Wang. Human motion generation: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(4):2430–2449, 2023

  6. [6]

    Generative agents: Interactive simulacra of human behavior

    Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

  7. [7]

    NVIDIA ACE: Autonomous game characters with generative AI.https://develope r.nvidia.com/ace, 2024

    NVIDIA. NVIDIA ACE: Autonomous game characters with generative AI.https://develope r.nvidia.com/ace, 2024

  8. [8]

    Unils: End- to-end audio-driven avatars for unified listening and speaking.arXiv preprint arXiv:2512.09327, 2025

    Xuangeng Chu, Ruicong Liu, Yifei Huang, Yun Liu, Yichen Peng, and Bo Zheng. Unils: End-to- end audio-driven avatars for unified listening and speaking.arXiv preprint arXiv:2512.09327, 2025

  9. [9]

    Dyadit: A multi-modal diffusion transformer for socially favorable dyadic gesture generation.arXiv preprint arXiv:2602.23165, 2026

    Yichen Peng, Jyun-Ting Song, Siyeol Jung, Ruofan Liu, Haiyang Liu, Xuangeng Chu, Ruicong Liu, Erwin Wu, Hideki Koike, and Kris Kitani. Dyadit: A multi-modal diffusion transformer for socially favorable dyadic gesture generation.arXiv preprint arXiv:2602.23165, 2026

  10. [10]

    Towards interactive intelligence for digital humans.arXiv preprint arXiv:2512.13674, 2025

    YiyiCai, XuangengChu, XiweiGao, SitongGong, YifeiHuang, CaixinKang, KunhangLi, Haiyang Liu, Ruicong Liu, Yun Liu, et al. Towards interactive intelligence for digital humans.arXiv preprint arXiv:2512.13674, 2025

  11. [11]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 3(4):6, 2025

  12. [12]

    Veo — Google DeepMind.https://deepmind.google/models/veo/,

    Google DeepMind. Veo — Google DeepMind.https://deepmind.google/models/veo/,

  13. [13]

    Accessed: 2026-03-14

  14. [14]

    Kling ai.https://klingai.kuaishou.com/, 2024.06

    Kuaishou. Kling ai.https://klingai.kuaishou.com/, 2024.06

  15. [15]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024

  16. [16]

    Seedance 2.0.https://seed.bytedance.com/en/blog/seedanc e-2-0, 2026

    ByteDance Seed Team. Seedance 2.0.https://seed.bytedance.com/en/blog/seedanc e-2-0, 2026. Accessed: 2026-03-14

  17. [17]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weissbuch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024. 37 LPM 1.0: Video-based Character Performance Model

  18. [18]

    Kuaishou

    Gaojie Lin, Jianwen Jiang, Jiaqi Yang, Zerong Zheng, and Chao Liang. Omnihuman-1: Re- thinking the scaling-up of one-stage conditioned human animation models.arXiv preprint arXiv:2502.01061, 2025

  19. [19]

    Omnihuman-1.5: Instilling an active mind in avatars via cognitive simulation.arXiv preprint arXiv:2508.19209, 2025

    Jianwen Jiang, Weihong Zeng, Zerong Zheng, Jiaqi Yang, Chao Liang, Wang Liao, Han Liang, Yuan Zhang, and Mingyuan Gao. Omnihuman-1.5: Instilling an active mind in avatars via cognitive simulation.arXiv preprint arXiv:2508.19209, 2025

  20. [20]

    arXiv preprint arXiv:2512.13313 (2025)

    Kling Team, Jialu Chen, Yikang Ding, Zhixue Fang, Kun Gai, Yuan Gao, Kang He, Jingyun Hua, Boyuan Jiang, Mingming Lao, et al. Klingavatar 2.0 technical report.arXiv preprint arXiv:2512.13313, 2025

  21. [21]

    Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

    Yubo Huang, Hailong Guo, Fangtai Wu, Shifeng Zhang, Shijie Huang, Qijun Gan, Lin Liu, Sirui Zhao, Enhong Chen, Jiaming Liu, et al. Live avatar: Streaming real-time audio-driven avatar generation with infinite length.arXiv preprint arXiv:2512.04677, 2025

  22. [22]

    Soulx-livetalk: Real-time infinite streaming of audio-driven avatars via self-correcting bidirectional distillation.arXiv e-prints, pages arXiv–2512, 2025

    LeShen, QiaoQian, TanYu, KeZhou, TianhangYu, YuZhan, ZhenjieWang, MingTao, Shunshun Yin, and Siyuan Liu. Soulx-livetalk: Real-time infinite streaming of audio-driven avatars via self-correcting bidirectional distillation.arXiv e-prints, pages arXiv–2512, 2025

  23. [23]

    Accessed: 2025-11-12

    Sekotalk.https://sekotalk.com/. Accessed: 2025-11-12

  24. [25]

    Flowact-r1: Towards interactive humanoid video generation.arXiv preprint arXiv:2601.10103, 2026

    Lizhen Wang, Yongming Zhu, Zhipeng Ge, Youwei Zheng, Longhao Zhang, Tianshu Hu, Shiyang Qin, Mingshuang Luo, Jiaxu Zhang, Xin Chen, et al. Flowact-r1: Towards interactive humanoid video generation.arXiv preprint arXiv:2601.10103, 2026

  25. [26]

    Talkingmachines: Real- time audio-driven facetime-style video via autoregressive diffusion models.CoRR, abs/2506.03099, 2025

    Chetwin Low and Weimin Wang. Talkingmachines: Real-time audio-driven facetime-style video via autoregressive diffusion models.arXiv preprint arXiv:2506.03099, 2025

  26. [27]

    A large-scale high-quality dataset for audio-visual dyadic interactive human generation

    Youliang Zhang, Zhaoyang Li, Duomin Wang, Jiahe Zhang, Deyu Zhou, Zixin Yin, Xili Dai, Gang Yu, and Li Xiu. A large-scale high-quality dataset for audio-visual dyadic interactive human generation. 2025

  27. [28]

    Seam- less interaction: Dyadic audiovisual motion modeling and large-scale dataset.arXiv preprint arXiv:2506.22554, 2025

    Vasu Agrawal, Akinniyi Akinyemi, Kathryn Alvero, Morteza Behrooz, Julia Buffalini, Fabio Maria Carlucci, Joy Chen, Junming Chen, Zhang Chen, Shiyang Cheng, et al. Seamless interaction: Dyadic audiovisual motion modeling and large-scale dataset.arXiv preprint arXiv:2506.22554, 2025

  28. [29]

    Arig: Autoregressive interactive head generation for real-time conversations

    Ying Guo, Xi Liu, Cheng Zhen, Pengfei Yan, and Xiaoming Wei. Arig: Autoregressive interactive head generation for real-time conversations. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12956–12965, 2025

  29. [30]

    Ditailistener: Controllable high fidelity listener video generation with diffusion

    Maksim Siniukov, Di Chang, Minh Tran, Hongkun Gong, Ashutosh Chaubey, and Mohammad Soleymani. Ditailistener: Controllable high fidelity listener video generation with diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11991–12001, 2025

  30. [31]

    X-streamer: Unified human world modeling with audiovisual interaction.arXiv preprint arXiv:2509.21574, 2025

    You Xie, Tianpei Gu, Zenan Li, Chenxu Zhang, Guoxian Song, Xiaochen Zhao, Chao Liang, Jianwen Jiang, Hongyi Xu, and Linjie Luo. X-streamer: Unified human world modeling with audiovisual interaction.arXiv preprint arXiv:2509.21574, 2025. 38 LPM 1.0: Video-based Character Performance Model

  31. [32]

    Responsive listening head generation: A benchmark dataset and baseline

    Mohan Zhou, Yalong Bai, Wei Zhang, Ting Yao, Tiejun Zhao, and Tao Mei. Responsive listening head generation: A benchmark dataset and baseline. InProceedings of the European Conference on Computer Vision (ECCV), pages 124–142. Springer, 2022

  32. [33]

    Wan-s2v: Audio-driven cinematic video generation.arXiv preprint arXiv:2508.18621, 2025

    Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, Dechao Meng, Jinwei Qi, Penchong Qiao, Zhen Shen, Yafei Song, et al. Wan-s2v: Audio-driven cinematic video generation.arXiv preprint arXiv:2508.18621, 2025

  33. [34]

    Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation.arXiv preprint arXiv:2412.00115, 2024

    Hui Li, Mingwang Xu, Yun Zhan, Shan Mu, Jiaye Li, Kaihui Cheng, Yuxuan Chen, Tan Chen, Mao Ye, Jingdong Wang, et al. Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation.arXiv preprint arXiv:2412.00115, 2024

  34. [35]

    ViCo-X: Multimodal conversation dataset.https://project.mhzhou.com/vico, 2022

    Mohan Zhou, Yalong Bai, Wei Zhang, Ting Yao, Tiejun Zhao, and Tao Mei. ViCo-X: Multimodal conversation dataset.https://project.mhzhou.com/vico, 2022. Accessed: 2022-09-30

  35. [36]

    Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset

    Zhimeng Zhang, Lincheng Li, Yu Ding, and Changjie Fan. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3661–3670, 2021

  36. [37]

    Celebv-hq: A large-scale video facial attributes dataset

    Hao Zhu, Wayne Wu, Wentao Zhu, Liming Jiang, Siwei Tang, Li Zhang, Ziwei Liu, and Chen Change Loy. Celebv-hq: A large-scale video facial attributes dataset. InEuropean conference on computer vision, pages 650–667. Springer, 2022

  37. [38]

    Affective faces for goal-driven dyadic communication.arXiv preprint arXiv:2301.10939, 2023

    Scott Geng, Revant Teotia, Purva Tendulkar, Sachit Menon, and Carl Vondrick. Affective faces for goal-driven dyadic communication.arXiv preprint arXiv:2301.10939, 2023

  38. [39]

    Avatar- forcing: One-step streaming talking avatars via local-future sliding-window denoising.arXiv preprint arXiv:2603.14331, 2026

    Liyuan Cui, Wentao Hu, Wenyuan Zhang, Zesong Yang, Fan Shi, and Xiaoqiang Liu. Avatar- forcing: One-step streaming talking avatars via local-future sliding-window denoising.arXiv preprint arXiv:2603.14331, 2026

  39. [40]

    arXiv preprint arXiv:2512.11423 (2025)

    Chaochao Li, Ruikui Wang, Liangbo Zhou, Jinheng Feng, Huaishao Luo, Huan Zhang, Youzheng Wu, and Xiaodong He. Joyavatar-flash: Real-time and infinite audio-driven avatar generation with autoregressive diffusion.arXiv preprint arXiv:2512.11423, 2025

  40. [41]

    arXiv preprint arXiv:2512.21734 (2025)

    Steven Xiao, XIndi Zhang, Dechao Meng, Qi Wang, Peng Zhang, and Bang Zhang. Knot forcing: Tamingautoregressivevideodiffusionmodelsforreal-timeinfiniteinteractiveportraitanimation. arXiv preprint arXiv:2512.21734, 2025

  41. [42]

    StableAvatar: Infinite-length audio-driven avatar video generation.arXiv preprint arXiv:2508.08248, 2025

    ShuyuanTu, YuemingPan, YinmingHuang, XintongHan, ZhenXing, QiDai, ChongLuo, Zuxuan Wu, and Yu-Gang Jiang. Stableavatar: Infinite-length audio-driven avatar video generation. arXiv preprint arXiv:2508.08248, 2025

  42. [43]

    Vasa-1: Lifelike audio-driven talking faces generated in real time

    Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, and Baining Guo. Vasa-1: Lifelike audio-driven talking faces generated in real time. Advances in Neural Information Processing Systems, 37:660–684, 2024

  43. [44]

    Transnet v2: An effective deep network architecture for fast shot transition detection

    Tomáš Souček and Jakub Lokoč. Transnet v2: An effective deep network architecture for fast shot transition detection.arXiv preprint arXiv:2008.04838, 2020

  44. [45]

    Yolov9: Learning what you want to learn using programmable gradient information

    Chien-Yao Wang, I-Hau Yeh, and Hong-Yuan Mark Liao. Yolov9: Learning what you want to learn using programmable gradient information. InEuropean conference on computer vision, pages 1–21. Springer, 2024. 39 LPM 1.0: Video-based Character Performance Model

  45. [46]

    Finevq: Fine-grained user generated content video quality assessment

    Huiyu Duan, Qiang Hu, Jiarui Wang, Liu Yang, Zitong Xu, Lu Liu, Xiongkuo Min, Chunlei Cai, Tianxiao Ye, Xiaoyun Zhang, and Guangtao Zhai. Finevq: Fine-grained user generated content video quality assessment. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 3206–3217, June 2025

  46. [47]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  47. [48]

    Qwen3-Omni Technical Report

    Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo ...

  48. [49]

    A light weight model for active speaker detection

    Junhua Liao, Haihan Duan, Kanghui Feng, Wanbing Zhao, Yanbing Yang, and Liangyin Chen. A light weight model for active speaker detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22932–22941, 2023

  49. [50]

    Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection

    Ruijie Tao, Zexu Pan, Rohan Kumar Das, Xinyuan Qian, Mike Zheng Shou, and Haizhou Li. Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection. InProceedings of the 29th ACM international conference on multimedia, pages 3927– 3935, 2021

  50. [51]

    Lr-asd: Lightweight and robust network for active speaker detection.International Journal of Computer Vision, pages 1–21, 2025

    Junhua Liao, Haihan Duan, Kanghui Feng, Wanbing Zhao, Yanbing Yang, Liangyin Chen, and Yanru Chen. Lr-asd: Lightweight and robust network for active speaker detection.International Journal of Computer Vision, pages 1–21, 2025

  51. [52]

    Out of time: automated lip sync in the wild

    Joon Son Chung and Andrew Zisserman. Out of time: automated lip sync in the wild. In Workshop on Multi-view Lip-reading, ACCV, 2016

  52. [53]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  53. [54]

    World-grounded human motion recovery via gravity-view coordinates

    Zehong Shen, Huaijin Pi, Yan Xia, Zhi Cen, Sida Peng, Zechen Hu, Hujun Bao, Ruizhen Hu, and Xiaowei Zhou. World-grounded human motion recovery via gravity-view coordinates. In SIGGRAPH Asia Conference Proceedings, 2024

  54. [55]

    Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. SMPL: A skinned multi-person linear model.ACM Trans. Graph., 2015

  55. [56]

    Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Advances in neural information processing systems, 34:16558–16569, 2021

    Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Advances in neural information processing systems, 34:16558–16569, 2021. 40 LPM 1.0: Video-based Character Performance Model

  56. [57]

    Facial expression recognition with adaptive frame rate based on multiple testing correction

    Andrey Savchenko. Facial expression recognition with adaptive frame rate based on multiple testing correction. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning (ICML), volume 202 ofProceedings of Machine Learning Research,...

  57. [58]

    Classifying emotions and engagement in online learning based on a single facial expression recognition neural network

    Andrey V Savchenko, Lyudmila V Savchenko, and Ilya Makarov. Classifying emotions and engagement in online learning based on a single facial expression recognition neural network. IEEE Transactions on Affective Computing, 2022

  58. [59]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  59. [60]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  60. [61]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

  61. [62]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

  62. [63]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

  63. [64]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  64. [65]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  65. [66]

    Avatar forcing: Real-time interactive head avatar generation for natural conversation.arXiv preprint arXiv:2601.00664, 2026

    Taekyung Ki, Sangwon Jang, Jaehyeong Jo, Jaehong Yoon, and Sung Ju Hwang. Avatar forcing: Real-time interactive head avatar generation for natural conversation.arXiv preprint arXiv:2601.00664, 2026

  66. [67]

    Reg-dpo: Sft-regularized direct preference optimization with gt-pair for improving video generation.arXiv preprint arXiv:2511.01450, 2025

    Jie Du, Xinyu Gong, Qingshan Tan, Wen Li, Yangming Cheng, Weitao Wang, Chenlu Zhan, Suhui Wu, Hao Zhang, and Jun Zhang. Reg-dpo: Sft-regularized direct preference optimization with gt-pair for improving video generation.arXiv preprint arXiv:2511.01450, 2025

  67. [68]

    Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025a

    Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025

  68. [69]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

  69. [70]

    Improved distribution matching distillation for fast image synthesis

    Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis. In NeurIPS, 2024

  70. [71]

    From slow bidirectional to fast autoregressive video diffusion models

    Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InCVPR, 2025. 41 LPM 1.0: Video-based Character Performance Model

  71. [72]

    One-step diffusion with distribution matching distillation

    Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Frédo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InCVPR, 2024

  72. [73]

    Barron, and Ben Mildenhall

    Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv, 2022

  73. [74]

    Pro- lificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation

    Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Pro- lificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. Advances in neural information processing systems, 36:8406–8441, 2023

  74. [75]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018

  75. [76]

    Efficient Streaming Language Models with Attention Sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

  76. [77]

    Torchtitan: One-stop pytorch native solution for production ready LLM pretraining

    Wanchao Liang, Tianyu Liu, Less Wright, Will Constable, Andrew Gu, Chien-Chin Huang, Iris Zhang, Wei Feng, Howard Huang, Junjie Wang, Sanket Purandare, Gokul Nadathur, and Stratos Idreos. Torchtitan: One-stop pytorch native solution for production ready LLM pretraining. In The Thirteenth International Conference on Learning Representations, 2025

  77. [78]

    DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

    Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models.arXiv preprint arXiv:2309.14509, 2023

  78. [79]

    Flashattention-2: Faster attention with better parallelism and work partitioning, 2023

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023

  79. [80]

    Flashattention-3: Fast and accurate attention with asynchrony and low-precision, 2024

    Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision, 2024

  80. [81]

    arrive remote, wait local

    Ted Zadouri, Markus Hoehnerbach, Jay Shah, Timmy Liu, Vijay Thakkar, and Tri Dao. Flashattention-4: Algorithm and kernel pipelining co-design for asymmetric hardware scaling. arXiv preprint arXiv:2603.05451, 2026

Showing first 80 references.