arxiv: 2604.07823 · v2 · submitted 2026-04-09 · 💻 cs.CV · cs.AI· cs.MM

Recognition: unknown

LPM 1.0: Video-based Character Performance Model

Ailing Zeng , Casper Yang , Chauncey Ge , Eddie Zhang , Garvey Xu , Gavin Lin , Gilbert Gu , Jeremy Pi

show 17 more authors

Leo Li Mingyi Shi Shawn Wang Sheng Bi Steven Tang Thorn Hang Tobey Guo Vincent Li Xin Tong Yikang Li Yuchen Sun Yue Zhao Yuhan Lu Yuwei Li Zane Zhang Zeshi Yang Zi Ye

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:04 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.MM

keywords video-based character performanceconversational video generationdiffusion transformerreal-time inferenceidentity stabilityperformance trilemmaLPM-Benchmultimodal conditioning

0 comments

The pith

LPM 1.0 generates expressive, identity-stable conversational videos in real time from audio and text prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to resolve the performance trilemma, where existing video models cannot simultaneously deliver high expressiveness, real-time speed, and long-horizon identity stability in character animation. It targets conversational scenarios in which a character must speak, listen, react, and emote while holding a consistent visual identity over extended interactions. The solution involves building a filtered multimodal dataset with speaking-listening pairings and identity-aware references, training a 17B-parameter diffusion transformer on multimodal inputs for controllable output, and distilling the result into a causal streaming generator. The resulting system produces listening videos from user audio and speaking videos from synthesized audio plus text motion prompts. It is positioned as a visual engine for agents, live streams, and game characters, with a new benchmark confirming superior results across metrics at interactive frame rates.

Core claim

LPM 1.0 is constructed through strict dataset filtering, speaking-listening audio-video pairing, performance understanding, and identity-aware multi-reference extraction; a 17B-parameter Diffusion Transformer (Base LPM) is trained for highly controllable, identity-consistent performance via multimodal conditioning; this is distilled into a causal streaming generator (Online LPM) that supports low-latency, infinite-length interaction. Given a character image with identity-aware references, the model outputs listening videos from user audio and speaking videos from synthesized audio, with text prompts controlling motion, all while running in real time.

What carries the argument

Multimodal conditioning on audio, identity references, and text prompts inside a 17B Diffusion Transformer that is distilled into a causal streaming generator for controllable, infinite-horizon performance synthesis.

If this is right

It functions as a visual engine that supplies real-time listening and speaking behaviors for conversational agents, live-stream characters, and game NPCs.
The model supports infinite-length generation while maintaining identity consistency across extended conversational turns.
It delivers state-of-the-art scores on all dimensions of the new LPM-Bench benchmark at real-time inference speeds.
Text prompts allow explicit motion control on top of audio-driven performance without requiring 3D rigs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The single-person conversational focus could be extended to multi-character scenes if the identity-aware conditioning generalizes to mutual reactions.
The speaking-listening data pairing technique may transfer to training other interactive visual systems such as virtual-reality avatars or telepresence.
Deployment in open-ended user sessions would test whether identity stability persists beyond the lengths examined in the benchmark.
Pairing the model with separate audio synthesis would create an end-to-end pipeline from text input to synchronized speech and visual performance.

Load-bearing premise

The strict filtering, speaking-listening pairing, and identity-aware extraction used to build the dataset, together with the distillation step, actually preserve expressiveness and identity stability without introducing artifacts or benchmark leakage.

What would settle it

Long video sequences generated by the model showing visible identity drift, motion artifacts, or failure to match claimed scores when independently evaluated on the LPM-Bench.

read the original abstract

Performance, the externalization of intent, emotion, and personality through visual, vocal, and temporal behavior, is what makes a character alive. Learning such performance from video is a promising alternative to traditional 3D pipelines. However, existing video models struggle to jointly achieve high expressiveness, real-time inference, and long-horizon identity stability, a tension we call the performance trilemma. Conversation is the most comprehensive performance scenario, as characters simultaneously speak, listen, react, and emote while maintaining identity over time. To address this, we present LPM 1.0 (Large Performance Model), focusing on single-person full-duplex audio-visual conversational performance. Concretely, we build a multimodal human-centric dataset through strict filtering, speaking-listening audio-video pairing, performance understanding, and identity-aware multi-reference extraction; train a 17B-parameter Diffusion Transformer (Base LPM) for highly controllable, identity-consistent performance through multimodal conditioning; and distill it into a causal streaming generator (Online LPM) for low-latency, infinite-length interaction. At inference, given a character image with identity-aware references, LPM 1.0 generates listening videos from user audio and speaking videos from synthesized audio, with text prompts for motion control, all at real-time speed with identity-stable, infinite-length generation. LPM 1.0 thus serves as a visual engine for conversational agents, live streaming characters, and game NPCs. To systematically evaluate this setting, we propose LPM-Bench, the first benchmark for interactive character performance. LPM 1.0 achieves state-of-the-art results across all evaluated dimensions while maintaining real-time inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LPM 1.0 scales a 17B diffusion transformer for conversational video and distills it for streaming, but the SOTA claims rest on dataset and distillation steps that lack transparent validation.

read the letter

The paper's main move is training a 17B diffusion transformer on a custom multimodal dataset of speaking-listening video pairs with identity references, then distilling it into a causal online generator that produces real-time, infinite-length full-duplex performances from audio and a reference image. They also introduce LPM-Bench to evaluate expressiveness, identity stability, and latency in this setting. That combination targets a practical gap for conversational agents and game characters better than prior offline 3D or non-streaming video models. The framing of the performance trilemma is clear, and the conditioning setup plus distillation for low-latency inference is a workable engineering path. The benchmark itself gives the field a concrete target to measure against. The soft spots sit in the results section. The abstract asserts state-of-the-art numbers across the trilemma dimensions, yet the provided text supplies no concrete metrics, ablations on the distillation step, error bars, or explicit checks that the dataset filtering and pairing avoided leakage or diversity loss. If the strict identity-aware extraction or the distillation process quietly narrows the distribution or introduces artifacts, the real-time model only demonstrates speed on a simplified case rather than solving the full trilemma. This work is aimed at groups building real-time generative video systems for interactive characters. A reader who needs architecture sketches or a new benchmark to run their own models against can extract useful pieces, though they would still have to reconstruct the data pipeline. It deserves a serious referee because the scale and the targeted application are substantial enough to merit detailed checking of the claims.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces LPM 1.0, a video-based Large Performance Model for single-person full-duplex audio-visual conversational character performance. It addresses the 'performance trilemma' (expressiveness, real-time inference, long-horizon identity stability) by constructing a multimodal human-centric dataset via strict filtering, speaking-listening pairing, performance understanding, and identity-aware multi-reference extraction; training a 17B-parameter Diffusion Transformer (Base LPM) with multimodal conditioning; distilling it into a causal streaming Online LPM; and evaluating on the newly proposed LPM-Bench benchmark, where it claims state-of-the-art results across all dimensions while achieving real-time speed.

Significance. If the quantitative results, ablations, and controls hold, the work could meaningfully advance real-time video generation for interactive applications such as conversational agents, live-streaming characters, and game NPCs. The introduction of LPM-Bench as a standardized evaluation protocol for interactive performance is a constructive contribution to the field. The combination of large-scale diffusion training followed by distillation to a streaming model is technically interesting, but the absence of any numerical metrics, error bars, or dataset statistics in the abstract prevents assessment of whether the central claims are supported.

major comments (3)

[Abstract] Abstract: The central claim that 'LPM 1.0 achieves state-of-the-art results across all evaluated dimensions while maintaining real-time inference' is asserted without any quantitative metrics, ablation tables, error bars, or description of how LPM-Bench was constructed or how data exclusions were handled. This directly undermines evaluation of the headline result.
[Abstract] Abstract (dataset construction paragraph): The multimodal dataset is built via 'strict filtering, speaking-listening audio-video pairing, performance understanding, and identity-aware multi-reference extraction,' yet no details are supplied on filtering criteria, diversity statistics, train/test split methodology, or controls for test-set contamination. Without these, it is impossible to determine whether the reported LPM-Bench numbers reflect genuine generalization or artifacts of the custom data pipeline.
[Abstract] Abstract (distillation paragraph): The claim that distillation from the 17B Base Diffusion Transformer into the causal streaming Online LPM 'preserves' the same metrics is stated without any comparative numbers, degradation analysis, or latency/quality trade-off measurements. This step is load-bearing for the real-time claim and requires explicit verification.

minor comments (1)

[Abstract] Abstract: The phrase 'performance trilemma' is introduced without a formal definition or explicit metrics for each of the three axes (expressiveness, real-time inference, identity stability).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We agree that the abstract should be more self-contained with quantitative support and dataset details to allow immediate assessment of the claims. We have revised the abstract accordingly while preserving its brevity. Point-by-point responses are provided below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'LPM 1.0 achieves state-of-the-art results across all evaluated dimensions while maintaining real-time inference' is asserted without any quantitative metrics, ablation tables, error bars, or description of how LPM-Bench was constructed or how data exclusions were handled. This directly undermines evaluation of the headline result.

Authors: We agree that the original abstract was insufficiently quantitative. In the revised manuscript we have updated the abstract to include key quantitative results from LPM-Bench (comparative scores on expressiveness, identity stability, and latency against baselines) together with a concise description of LPM-Bench construction and data-exclusion protocols. Full ablation tables, error bars, and methodological details remain in the main text and supplementary material. This revision makes the central claim directly verifiable from the abstract. revision: yes
Referee: [Abstract] Abstract (dataset construction paragraph): The multimodal dataset is built via 'strict filtering, speaking-listening audio-video pairing, performance understanding, and identity-aware multi-reference extraction,' yet no details are supplied on filtering criteria, diversity statistics, train/test split methodology, or controls for test-set contamination. Without these, it is impossible to determine whether the reported LPM-Bench numbers reflect genuine generalization or artifacts of the custom data pipeline.

Authors: We acknowledge the need for greater transparency on dataset construction even within the abstract. The revised abstract now specifies the filtering criteria (resolution, duration, and quality thresholds), provides high-level diversity statistics (total video hours and number of identities), describes the train/test split methodology (identity-disjoint partitioning), and notes controls for test-set contamination. Complete statistics and implementation details are given in Section 3 of the paper. revision: yes
Referee: [Abstract] Abstract (distillation paragraph): The claim that distillation from the 17B Base Diffusion Transformer into the causal streaming Online LPM 'preserves' the same metrics is stated without any comparative numbers, degradation analysis, or latency/quality trade-off measurements. This step is load-bearing for the real-time claim and requires explicit verification.

Authors: We agree that the abstract must explicitly verify the distillation outcome. The revised abstract now states that the Online LPM retains performance comparable to the Base LPM across LPM-Bench dimensions while achieving real-time inference, and it references the degradation analysis and latency-quality trade-offs. The full comparative numbers, degradation study, and measurements are provided in Section 4 and the supplementary material. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper describes an empirical pipeline: construction of a custom multimodal dataset via filtering/pairing/extraction steps, training of a 17B Diffusion Transformer (Base LPM), distillation into a causal streaming Online LPM, and evaluation on the newly proposed LPM-Bench. No mathematical equations, first-principles derivations, or parameter-fitting steps are presented that reduce a claimed prediction or result to the inputs by construction. The SOTA claims are experimental performance measurements on the authors' benchmark rather than self-definitional outputs or fitted quantities renamed as predictions. No self-citations appear as load-bearing justifications for uniqueness or ansatz choices in the provided text. The central claims therefore retain independent empirical content and do not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.0 · 5678 in / 1216 out tokens · 49421 ms · 2026-05-10T17:04:43.050199+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

94 extracted references · 39 canonical work pages · 14 internal anchors

[1]

The presentation of self in everyday life

Erving Goffman. The presentation of self in everyday life. InSocial theory re-wired, pages 450–459. Routledge, 2023

2023
[2]

Thomson Wadsworth, 1972

Mark L Knapp, Judith A Hall, and Terrence G Horgan.Nonverbal communication in human interaction. Thomson Wadsworth, 1972

1972
[3]

Schegloff, and Gail Jefferson

Harvey Sacks, Emanuel A. Schegloff, and Gail Jefferson. A simplest systematics for the organi- zation of turn-taking for conversation.Language, 50(4):696–735, 1974

1974
[4]

Newnes, 2012

Rick Parent.Computer animation: algorithms and techniques. Newnes, 2012

2012
[5]

Human motion generation: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(4):2430–2449, 2023

Wentao Zhu, Xiaoxuan Ma, Dongwoo Ro, Hai Ci, Jinlu Zhang, Jiaxin Shi, Feng Gao, Qi Tian, and Yizhou Wang. Human motion generation: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(4):2430–2449, 2023

2023
[6]

Generative agents: Interactive simulacra of human behavior

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

2023
[7]

NVIDIA ACE: Autonomous game characters with generative AI.https://develope r.nvidia.com/ace, 2024

NVIDIA. NVIDIA ACE: Autonomous game characters with generative AI.https://develope r.nvidia.com/ace, 2024

2024
[8]

Unils: End- to-end audio-driven avatars for unified listening and speaking.arXiv preprint arXiv:2512.09327, 2025

Xuangeng Chu, Ruicong Liu, Yifei Huang, Yun Liu, Yichen Peng, and Bo Zheng. Unils: End-to- end audio-driven avatars for unified listening and speaking.arXiv preprint arXiv:2512.09327, 2025

work page arXiv 2025
[9]

Dyadit: A multi-modal diffusion transformer for socially favorable dyadic gesture generation.arXiv preprint arXiv:2602.23165, 2026

Yichen Peng, Jyun-Ting Song, Siyeol Jung, Ruofan Liu, Haiyang Liu, Xuangeng Chu, Ruicong Liu, Erwin Wu, Hideki Koike, and Kris Kitani. Dyadit: A multi-modal diffusion transformer for socially favorable dyadic gesture generation.arXiv preprint arXiv:2602.23165, 2026

work page arXiv 2026
[10]

Towards interactive intelligence for digital humans.arXiv preprint arXiv:2512.13674, 2025

YiyiCai, XuangengChu, XiweiGao, SitongGong, YifeiHuang, CaixinKang, KunhangLi, Haiyang Liu, Ruicong Liu, Yun Liu, et al. Towards interactive intelligence for digital humans.arXiv preprint arXiv:2512.13674, 2025

work page arXiv 2025
[11]

Wan: Open and Advanced Large-Scale Video Generative Models

Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 3(4):6, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Veo — Google DeepMind.https://deepmind.google/models/veo/,

Google DeepMind. Veo — Google DeepMind.https://deepmind.google/models/veo/,
[13]

Accessed: 2026-03-14

2026
[14]

Kling ai.https://klingai.kuaishou.com/, 2024.06

Kuaishou. Kling ai.https://klingai.kuaishou.com/, 2024.06

2024
[15]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024

2024
[16]

Seedance 2.0.https://seed.bytedance.com/en/blog/seedanc e-2-0, 2026

ByteDance Seed Team. Seedance 2.0.https://seed.bytedance.com/en/blog/seedanc e-2-0, 2026. Accessed: 2026-03-14

2026
[17]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weissbuch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024. 37 LPM 1.0: Video-based Character Performance Model

work page internal anchor Pith review arXiv 2024
[18]

Kuaishou

Gaojie Lin, Jianwen Jiang, Jiaqi Yang, Zerong Zheng, and Chao Liang. Omnihuman-1: Re- thinking the scaling-up of one-stage conditioned human animation models.arXiv preprint arXiv:2502.01061, 2025

work page arXiv 2025
[19]

Omnihuman-1.5: Instilling an active mind in avatars via cognitive simulation.arXiv preprint arXiv:2508.19209, 2025

Jianwen Jiang, Weihong Zeng, Zerong Zheng, Jiaqi Yang, Chao Liang, Wang Liao, Han Liang, Yuan Zhang, and Mingyuan Gao. Omnihuman-1.5: Instilling an active mind in avatars via cognitive simulation.arXiv preprint arXiv:2508.19209, 2025

work page arXiv 2025
[20]

arXiv preprint arXiv:2512.13313 (2025)

Kling Team, Jialu Chen, Yikang Ding, Zhixue Fang, Kun Gai, Yuan Gao, Kang He, Jingyun Hua, Boyuan Jiang, Mingming Lao, et al. Klingavatar 2.0 technical report.arXiv preprint arXiv:2512.13313, 2025

work page arXiv 2025
[21]

Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

Yubo Huang, Hailong Guo, Fangtai Wu, Shifeng Zhang, Shijie Huang, Qijun Gan, Lin Liu, Sirui Zhao, Enhong Chen, Jiaming Liu, et al. Live avatar: Streaming real-time audio-driven avatar generation with infinite length.arXiv preprint arXiv:2512.04677, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Soulx-livetalk: Real-time infinite streaming of audio-driven avatars via self-correcting bidirectional distillation.arXiv e-prints, pages arXiv–2512, 2025

LeShen, QiaoQian, TanYu, KeZhou, TianhangYu, YuZhan, ZhenjieWang, MingTao, Shunshun Yin, and Siyuan Liu. Soulx-livetalk: Real-time infinite streaming of audio-driven avatars via self-correcting bidirectional distillation.arXiv e-prints, pages arXiv–2512, 2025

2025
[23]

Accessed: 2025-11-12

Sekotalk.https://sekotalk.com/. Accessed: 2025-11-12

2025
[25]

Flowact-r1: Towards interactive humanoid video generation.arXiv preprint arXiv:2601.10103, 2026

Lizhen Wang, Yongming Zhu, Zhipeng Ge, Youwei Zheng, Longhao Zhang, Tianshu Hu, Shiyang Qin, Mingshuang Luo, Jiaxu Zhang, Xin Chen, et al. Flowact-r1: Towards interactive humanoid video generation.arXiv preprint arXiv:2601.10103, 2026

work page arXiv 2026
[26]

Talkingmachines: Real- time audio-driven facetime-style video via autoregressive diffusion models.CoRR, abs/2506.03099, 2025

Chetwin Low and Weimin Wang. Talkingmachines: Real-time audio-driven facetime-style video via autoregressive diffusion models.arXiv preprint arXiv:2506.03099, 2025

work page arXiv 2025
[27]

A large-scale high-quality dataset for audio-visual dyadic interactive human generation

Youliang Zhang, Zhaoyang Li, Duomin Wang, Jiahe Zhang, Deyu Zhou, Zixin Yin, Xili Dai, Gang Yu, and Li Xiu. A large-scale high-quality dataset for audio-visual dyadic interactive human generation. 2025

2025
[28]

Seam- less interaction: Dyadic audiovisual motion modeling and large-scale dataset.arXiv preprint arXiv:2506.22554, 2025

Vasu Agrawal, Akinniyi Akinyemi, Kathryn Alvero, Morteza Behrooz, Julia Buffalini, Fabio Maria Carlucci, Joy Chen, Junming Chen, Zhang Chen, Shiyang Cheng, et al. Seamless interaction: Dyadic audiovisual motion modeling and large-scale dataset.arXiv preprint arXiv:2506.22554, 2025

work page arXiv 2025
[29]

Arig: Autoregressive interactive head generation for real-time conversations

Ying Guo, Xi Liu, Cheng Zhen, Pengfei Yan, and Xiaoming Wei. Arig: Autoregressive interactive head generation for real-time conversations. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12956–12965, 2025

2025
[30]

Ditailistener: Controllable high fidelity listener video generation with diffusion

Maksim Siniukov, Di Chang, Minh Tran, Hongkun Gong, Ashutosh Chaubey, and Mohammad Soleymani. Ditailistener: Controllable high fidelity listener video generation with diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11991–12001, 2025

2025
[31]

X-streamer: Unified human world modeling with audiovisual interaction.arXiv preprint arXiv:2509.21574, 2025

You Xie, Tianpei Gu, Zenan Li, Chenxu Zhang, Guoxian Song, Xiaochen Zhao, Chao Liang, Jianwen Jiang, Hongyi Xu, and Linjie Luo. X-streamer: Unified human world modeling with audiovisual interaction.arXiv preprint arXiv:2509.21574, 2025. 38 LPM 1.0: Video-based Character Performance Model

work page arXiv 2025
[32]

Responsive listening head generation: A benchmark dataset and baseline

Mohan Zhou, Yalong Bai, Wei Zhang, Ting Yao, Tiejun Zhao, and Tao Mei. Responsive listening head generation: A benchmark dataset and baseline. InProceedings of the European Conference on Computer Vision (ECCV), pages 124–142. Springer, 2022

2022
[33]

Wan-s2v: Audio-driven cinematic video generation.arXiv preprint arXiv:2508.18621, 2025

Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, Dechao Meng, Jinwei Qi, Penchong Qiao, Zhen Shen, Yafei Song, et al. Wan-s2v: Audio-driven cinematic video generation.arXiv preprint arXiv:2508.18621, 2025

work page arXiv 2025
[34]

Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation.arXiv preprint arXiv:2412.00115, 2024

Hui Li, Mingwang Xu, Yun Zhan, Shan Mu, Jiaye Li, Kaihui Cheng, Yuxuan Chen, Tan Chen, Mao Ye, Jingdong Wang, et al. Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation.arXiv preprint arXiv:2412.00115, 2024

work page arXiv 2024
[35]

ViCo-X: Multimodal conversation dataset.https://project.mhzhou.com/vico, 2022

Mohan Zhou, Yalong Bai, Wei Zhang, Ting Yao, Tiejun Zhao, and Tao Mei. ViCo-X: Multimodal conversation dataset.https://project.mhzhou.com/vico, 2022. Accessed: 2022-09-30

2022
[36]

Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset

Zhimeng Zhang, Lincheng Li, Yu Ding, and Changjie Fan. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3661–3670, 2021

2021
[37]

Celebv-hq: A large-scale video facial attributes dataset

Hao Zhu, Wayne Wu, Wentao Zhu, Liming Jiang, Siwei Tang, Li Zhang, Ziwei Liu, and Chen Change Loy. Celebv-hq: A large-scale video facial attributes dataset. InEuropean conference on computer vision, pages 650–667. Springer, 2022

2022
[38]

Affective faces for goal-driven dyadic communication.arXiv preprint arXiv:2301.10939, 2023

Scott Geng, Revant Teotia, Purva Tendulkar, Sachit Menon, and Carl Vondrick. Affective faces for goal-driven dyadic communication.arXiv preprint arXiv:2301.10939, 2023

work page arXiv 2023
[39]

Avatar- forcing: One-step streaming talking avatars via local-future sliding-window denoising.arXiv preprint arXiv:2603.14331, 2026

Liyuan Cui, Wentao Hu, Wenyuan Zhang, Zesong Yang, Fan Shi, and Xiaoqiang Liu. Avatar- forcing: One-step streaming talking avatars via local-future sliding-window denoising.arXiv preprint arXiv:2603.14331, 2026

work page arXiv 2026
[40]

arXiv preprint arXiv:2512.11423 (2025)

Chaochao Li, Ruikui Wang, Liangbo Zhou, Jinheng Feng, Huaishao Luo, Huan Zhang, Youzheng Wu, and Xiaodong He. Joyavatar-flash: Real-time and infinite audio-driven avatar generation with autoregressive diffusion.arXiv preprint arXiv:2512.11423, 2025

work page arXiv 2025
[41]

arXiv preprint arXiv:2512.21734 (2025)

Steven Xiao, XIndi Zhang, Dechao Meng, Qi Wang, Peng Zhang, and Bang Zhang. Knot forcing: Tamingautoregressivevideodiffusionmodelsforreal-timeinfiniteinteractiveportraitanimation. arXiv preprint arXiv:2512.21734, 2025

work page arXiv 2025
[42]

StableAvatar: Infinite-length audio-driven avatar video generation.arXiv preprint arXiv:2508.08248, 2025

ShuyuanTu, YuemingPan, YinmingHuang, XintongHan, ZhenXing, QiDai, ChongLuo, Zuxuan Wu, and Yu-Gang Jiang. Stableavatar: Infinite-length audio-driven avatar video generation. arXiv preprint arXiv:2508.08248, 2025

work page arXiv 2025
[43]

Vasa-1: Lifelike audio-driven talking faces generated in real time

Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, and Baining Guo. Vasa-1: Lifelike audio-driven talking faces generated in real time. Advances in Neural Information Processing Systems, 37:660–684, 2024

2024
[44]

Transnet v2: An effective deep network architecture for fast shot transition detection

Tomáš Souček and Jakub Lokoč. Transnet v2: An effective deep network architecture for fast shot transition detection.arXiv preprint arXiv:2008.04838, 2020

work page arXiv 2008
[45]

Yolov9: Learning what you want to learn using programmable gradient information

Chien-Yao Wang, I-Hau Yeh, and Hong-Yuan Mark Liao. Yolov9: Learning what you want to learn using programmable gradient information. InEuropean conference on computer vision, pages 1–21. Springer, 2024. 39 LPM 1.0: Video-based Character Performance Model

2024
[46]

Finevq: Fine-grained user generated content video quality assessment

Huiyu Duan, Qiang Hu, Jiarui Wang, Liu Yang, Zitong Xu, Lu Liu, Xiongkuo Min, Chunlei Cai, Tianxiao Ye, Xiaoyun Zhang, and Guangtao Zhai. Finevq: Fine-grained user generated content video quality assessment. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 3206–3217, June 2025

2025
[47]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo ...

work page internal anchor Pith review arXiv 2025
[49]

A light weight model for active speaker detection

Junhua Liao, Haihan Duan, Kanghui Feng, Wanbing Zhao, Yanbing Yang, and Liangyin Chen. A light weight model for active speaker detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22932–22941, 2023

2023
[50]

Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection

Ruijie Tao, Zexu Pan, Rohan Kumar Das, Xinyuan Qian, Mike Zheng Shou, and Haizhou Li. Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection. InProceedings of the 29th ACM international conference on multimedia, pages 3927– 3935, 2021

2021
[51]

Lr-asd: Lightweight and robust network for active speaker detection.International Journal of Computer Vision, pages 1–21, 2025

Junhua Liao, Haihan Duan, Kanghui Feng, Wanbing Zhao, Yanbing Yang, Liangyin Chen, and Yanru Chen. Lr-asd: Lightweight and robust network for active speaker detection.International Journal of Computer Vision, pages 1–21, 2025

2025
[52]

Out of time: automated lip sync in the wild

Joon Son Chung and Andrew Zisserman. Out of time: automated lip sync in the wild. In Workshop on Multi-view Lip-reading, ACCV, 2016

2016
[53]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

World-grounded human motion recovery via gravity-view coordinates

Zehong Shen, Huaijin Pi, Yan Xia, Zhi Cen, Sida Peng, Zechen Hu, Hujun Bao, Ruizhen Hu, and Xiaowei Zhou. World-grounded human motion recovery via gravity-view coordinates. In SIGGRAPH Asia Conference Proceedings, 2024

2024
[55]

Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. SMPL: A skinned multi-person linear model.ACM Trans. Graph., 2015

2015
[56]

Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Advances in neural information processing systems, 34:16558–16569, 2021

Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Advances in neural information processing systems, 34:16558–16569, 2021. 40 LPM 1.0: Video-based Character Performance Model

2021
[57]

Facial expression recognition with adaptive frame rate based on multiple testing correction

Andrey Savchenko. Facial expression recognition with adaptive frame rate based on multiple testing correction. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning (ICML), volume 202 ofProceedings of Machine Learning Research,...

2023
[58]

Classifying emotions and engagement in online learning based on a single facial expression recognition neural network

Andrey V Savchenko, Lyudmila V Savchenko, and Ilya Makarov. Classifying emotions and engagement in online learning based on a single facial expression recognition neural network. IEEE Transactions on Affective Computing, 2022

2022
[59]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023
[60]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024
[61]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[62]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[63]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

2023
[64]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[65]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[66]

Avatar forcing: Real-time interactive head avatar generation for natural conversation.arXiv preprint arXiv:2601.00664, 2026

Taekyung Ki, Sangwon Jang, Jaehyeong Jo, Jaehong Yoon, and Sung Ju Hwang. Avatar forcing: Real-time interactive head avatar generation for natural conversation.arXiv preprint arXiv:2601.00664, 2026

work page arXiv 2026
[67]

Reg-dpo: Sft-regularized direct preference optimization with gt-pair for improving video generation.arXiv preprint arXiv:2511.01450, 2025

Jie Du, Xinyu Gong, Qingshan Tan, Wen Li, Yangming Cheng, Weitao Wang, Chenlu Zhan, Suhui Wu, Hao Zhang, and Jun Zhang. Reg-dpo: Sft-regularized direct preference optimization with gt-pair for improving video generation.arXiv preprint arXiv:2511.01450, 2025

work page arXiv 2025
[68]

Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025a

Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025

work page arXiv 2025
[69]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

work page internal anchor Pith review arXiv 2025
[70]

Improved distribution matching distillation for fast image synthesis

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis. In NeurIPS, 2024

2024
[71]

From slow bidirectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InCVPR, 2025. 41 LPM 1.0: Video-based Character Performance Model

2025
[72]

One-step diffusion with distribution matching distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Frédo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InCVPR, 2024

2024
[73]

Barron, and Ben Mildenhall

Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv, 2022

2022
[74]

Pro- lificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation

Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Pro- lificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. Advances in neural information processing systems, 36:8406–8441, 2023

2023
[75]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018

2018
[76]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

work page internal anchor Pith review arXiv 2023
[77]

Torchtitan: One-stop pytorch native solution for production ready LLM pretraining

Wanchao Liang, Tianyu Liu, Less Wright, Will Constable, Andrew Gu, Chien-Chin Huang, Iris Zhang, Wei Feng, Howard Huang, Junjie Wang, Sanket Purandare, Gokul Nadathur, and Stratos Idreos. Torchtitan: One-stop pytorch native solution for production ready LLM pretraining. In The Thirteenth International Conference on Learning Representations, 2025

2025
[78]

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models.arXiv preprint arXiv:2309.14509, 2023

work page internal anchor Pith review arXiv 2023
[79]

Flashattention-2: Faster attention with better parallelism and work partitioning, 2023

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023

2023
[80]

Flashattention-3: Fast and accurate attention with asynchrony and low-precision, 2024

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision, 2024

2024
[81]

arrive remote, wait local

Ted Zadouri, Markus Hoehnerbach, Jay Shah, Timmy Liu, Vijay Thakkar, and Tri Dao. Flashattention-4: Algorithm and kernel pipelining co-design for asymmetric hardware scaling. arXiv preprint arXiv:2603.05451, 2026

work page arXiv 2026

Showing first 80 references.