arxiv: 2604.02908 · v2 · submitted 2026-04-03 · 💻 cs.CV · cs.HC· cs.MM

Recognition: no theorem link

SentiAvatar: Towards Expressive and Interactive Digital Humans

Chuhao Jin , Rui Zhang , Qingzhe Gao , Haoyu Shi , Dayu Wu , Yichen Jiang , Yihan Wu , Ruihua Song

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:11 UTC · model grok-4.3

classification 💻 cs.CV cs.HCcs.MM

keywords SentiAvatardigital humansspeech-driven animation3D motion generationreal-time avatarmultimodal dialogue datasetmotion foundation modelprosody synchronization

0 comments

The pith

SentiAvatar generates real-time 3D digital humans that speak, gesture, and emote by decoupling semantic planning from prosody interpolation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a complete pipeline for interactive digital humans that must solve data scarcity, semantic motion mapping, and frame-level audio synchronization at once. It releases SuSuInterActs, a 37-hour mocap dialogue corpus with speech, full-body motion, and facial expressions captured around one character. A motion foundation model is pre-trained on over 200,000 sequences to supply broad action priors, then an audio-aware plan-then-infill network separates sentence-level intent planning from frame-level rhythm interpolation. The resulting system produces six seconds of output in 0.3 seconds, supports unlimited multi-turn streaming, and records large gains on both the new dataset and BEATv2.

Core claim

SentiAvatar is an end-to-end framework that first assembles SuSuInterActs, a 21K-clip dialogue corpus, pre-trains a Motion Foundation Model on 200K+ motion sequences, and then applies an audio-aware plan-then-infill architecture. Sentence-level semantic planning selects appropriate gestures and expressions while frame-level prosody interpolation aligns motion timing and dynamics to the incoming speech waveform, yielding motions that are both contextually appropriate and rhythmically natural.

What carries the argument

The audio-aware plan-then-infill architecture that decouples sentence-level semantic planning from frame-level prosody-driven interpolation.

If this is right

The system reaches R@1 of 43.64 percent on SuSuInterActs, nearly twice the best reported baseline.
It records FGD of 4.941 and BC of 8.078 on BEATv2 while generating six seconds of motion in 0.3 seconds.
Unlimited multi-turn streaming becomes feasible because the architecture separates long-horizon planning from local interpolation.
The same pre-trained motion priors can be reused for non-conversational actions beyond the dialogue domain.
The released dataset and model weights enable direct replication and extension by other researchers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar plan-then-infill separation may improve other conditional motion tasks such as music-driven dance or sign-language synthesis.
The 37-hour single-character corpus could serve as a seed for few-shot personalization of new avatars.
Real-time performance opens direct integration into live virtual agents or game engines without offline rendering.

Load-bearing premise

The pre-trained motion foundation model supplies useful priors for conversational gestures and the decoupled planning-infill step produces coherent motions without introducing artifacts or losing naturalness.

What would settle it

A side-by-side evaluation in which human raters consistently judge SentiAvatar motions as less natural or less speech-synchronized than the best baseline on held-out multi-turn dialogues.

Figures

Figures reproduced from arXiv: 2604.02908 by Chuhao Jin, Dayu Wu, Haoyu Shi, Qingzhe Gao, Ruihua Song, Rui Zhang, Yichen Jiang, Yihan Wu.

**Figure 2.** Figure 2: Overview of SuSuInterActs data pipeline. (1) Character design for a consistent persona. (2) [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of SentiAvatar. (a) Multi-modal inputs are quantized into tokens via encoders. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of generated motions across methods. Each row shows keyframe [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

read the original abstract

We present SentiAvatar, a framework for building expressive interactive 3D digital humans, and use it to create SuSu, a virtual character that speaks, gestures, and emotes in real time. Achieving such a system remains challenging, as it requires jointly addressing three key problems: the lack of large-scale, high-quality multimodal data, robust semantic-to-motion mapping, and fine-grained frame-level motion-prosody synchronization. To solve these problems, first, we build SuSuInterActs (21K clips, 37 hours), a dialogue corpus captured via optical motion capture around a single character with synchronized speech, full-body motion, and facial expressions. Second, we pre-train a Motion Foundation Model on 200K+ motion sequences, equipping it with rich action priors that go well beyond the conversation. We then propose an audio-aware plan-then-infill architecture that decouples sentence-level semantic planning from frame-level prosody-driven interpolation, so that generated motions are both semantically appropriate and rhythmically aligned with speech. Experiments show that SentiAvatar achieves state-of-the-art on both SuSuInterActs (R@1 43.64%, nearly 2 times the best baseline) and BEATv2 (FGD 4.941, BC 8.078), producing 6s of output in 0.3s with unlimited multi-turn streaming. The source code, model, and dataset are available at https://sentiavatar.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SentiAvatar adds a useful 37-hour dialogue mocap dataset and a plan-then-infill split for semantic planning versus prosody alignment, with solid reported numbers on real-time performance, though multi-turn coherence is not directly measured.

read the letter

The main thing here is a new 37-hour mocap dataset for dialogue plus an architecture that plans sentence-level semantics first then infills frame-level motion from audio prosody. They captured SuSuInterActs with 21K clips of one character, full body, face, and synchronized speech. A motion foundation model pretrained on 200K+ sequences supplies the priors, and the plan-then-infill split aims to keep both meaning and rhythm. On their dataset this reaches R@1 of 43.64 percent, nearly double the best baseline, and on BEATv2 it records FGD 4.941 and BC 8.078 while generating 6 seconds of output in 0.3 seconds for streaming multi-turn use. The public release of code, model, and data is a clear plus for anyone who wants to build on it. The engineering is straightforward and the numbers on the reported benchmarks look competitive. The softer spot is the streaming claim. The metrics focus on single clips or short outputs, so they do not directly check cross-turn consistency or boundary artifacts such as foot sliding or sudden jumps when the infill step takes over. If the sentence plans do not stay coherent enough for the prosody stage to fix without visible glitches, the unlimited interactive use case weakens. Baseline details would also help confirm the gains are not from uneven implementation. This is for graphics and vision groups working on real-time avatars for VR or interfaces. The dataset alone is likely to get picked up. I would send it for peer review because the system is concrete, the artifacts are available, and the results are strong enough to merit detailed checking.

Referee Report

3 major / 2 minor

Summary. The manuscript presents SentiAvatar, a framework for expressive interactive 3D digital humans. It introduces the SuSuInterActs dataset (21K clips, 37 hours) with synchronized speech, full-body motion, and facial expressions; pre-trains a Motion Foundation Model on 200K+ sequences; and proposes an audio-aware plan-then-infill architecture that decouples sentence-level semantic planning from frame-level prosody-driven interpolation. Experiments claim state-of-the-art results on SuSuInterActs (R@1 43.64%, nearly 2x best baseline) and BEATv2 (FGD 4.941, BC 8.078), with real-time performance (6s output in 0.3s) and unlimited multi-turn streaming support. Code, model, and dataset are released.

Significance. If the empirical results hold after addressing evaluation gaps, the work would advance real-time conversational avatar systems by releasing a large-scale multimodal dialogue corpus, rich motion priors from a foundation model, and a decoupled architecture for semantic-rhythmic alignment. The open-sourcing of resources is a clear strength that supports reproducibility and extension in the field.

major comments (3)

[§5 Experiments] §5 Experiments and §5.1: The SOTA claims (R@1 43.64% on SuSuInterActs, nearly 2x baseline; FGD/BC on BEATv2) are reported without details on baseline re-implementations, hyper-parameter choices, training data splits, or potential data leakage between SuSuInterActs and the 200K+ pre-training sequences. This leaves the central performance gains only moderately supported.
[§4.2 Architecture] §4.2 Architecture and §5.2: The plan-then-infill decoupling is presented as solving semantic appropriateness and rhythmic alignment, yet no quantitative metrics or ablations evaluate cross-turn coherence, sentence-boundary discontinuities, foot-sliding, or streaming artifacts. The reported R@1/FGD/BC scores do not directly test the weakest assumption for the multi-turn interactive claim.
[§4.3 Ablations] §4.3 and Table 2: Ablation studies on the contribution of the pre-trained Motion Foundation Model priors versus the infill module are missing; without them, it is unclear whether the reported gains stem from the proposed decoupling or from the foundation model alone.

minor comments (2)

[Figure 4] Figure 4 and §4.1: The caption and text do not specify the exact input/output dimensions or conditioning signals for the semantic planner versus the prosody interpolator.
[§3.2] §3.2: Notation for motion sequences (e.g., use of M_t vs. P_t) is introduced without a consolidated table of symbols.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's detailed feedback and constructive suggestions. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§5 Experiments] §5 Experiments and §5.1: The SOTA claims (R@1 43.64% on SuSuInterActs, nearly 2x baseline; FGD/BC on BEATv2) are reported without details on baseline re-implementations, hyper-parameter choices, training data splits, or potential data leakage between SuSuInterActs and the 200K+ pre-training sequences. This leaves the central performance gains only moderately supported.

Authors: We agree that additional details are necessary to fully support the SOTA claims. In the revised manuscript, we will expand §5.1 to include comprehensive information on baseline re-implementations, specific hyper-parameter choices, and the exact training/validation/test splits used. For data leakage, we will clarify that the pre-training sequences come from established public motion datasets (e.g., AMASS, HumanML3D) that do not overlap with our newly captured SuSuInterActs dataset, which was recorded in a controlled setting with a specific character. This will be explicitly documented to ensure transparency. revision: yes
Referee: [§4.2 Architecture] §4.2 Architecture and §5.2: The plan-then-infill decoupling is presented as solving semantic appropriateness and rhythmic alignment, yet no quantitative metrics or ablations evaluate cross-turn coherence, sentence-boundary discontinuities, foot-sliding, or streaming artifacts. The reported R@1/FGD/BC scores do not directly test the weakest assumption for the multi-turn interactive claim.

Authors: We acknowledge that the current evaluation focuses on per-clip metrics, which primarily assess semantic appropriateness (via R@1) and rhythmic alignment (via FGD/BC). However, the plan-then-infill design inherently supports multi-turn coherence by planning at the sentence level and infilling frames consistently. To address this, we will add new quantitative evaluations in the revised §5.2, including metrics for cross-turn coherence (e.g., motion continuity scores across turns) and analysis of sentence-boundary discontinuities. We will also report on foot-sliding artifacts and streaming performance in multi-turn scenarios. While the real-time streaming capability is demonstrated qualitatively in the supplementary video, we will include supporting quantitative results. revision: partial
Referee: [§4.3 Ablations] §4.3 and Table 2: Ablation studies on the contribution of the pre-trained Motion Foundation Model priors versus the infill module are missing; without them, it is unclear whether the reported gains stem from the proposed decoupling or from the foundation model alone.

Authors: We agree that isolating the contributions is important. In the revised version, we will expand the ablation studies in §4.3 and Table 2 to include experiments that compare: (1) the full model, (2) the model without the pre-trained Motion Foundation Model (using random initialization), and (3) variants without the infill module. This will clarify the role of each component in achieving the reported performance gains. revision: yes

Circularity Check

0 steps flagged

No circularity detected; claims rest on new data and empirical evaluation

full rationale

The paper constructs a new dataset (SuSuInterActs with 21K clips) and pre-trains a Motion Foundation Model on 200K+ external sequences, then proposes and trains an audio-aware plan-then-infill architecture whose outputs are evaluated on standard metrics (R@1, FGD, BC) and runtime benchmarks. No step reduces by construction to its inputs, no self-citation is load-bearing for the central claims, and the derivation chain is falsifiable through the reported experiments rather than tautological. The architecture choices are presented as design decisions trained on data, not derived from prior self-referential results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard deep-learning assumptions about transfer of motion priors and the representativeness of the collected dialogue data; no free parameters, axioms, or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Pre-trained general motion priors transfer effectively to conversational full-body and facial motions
Invoked by the use of the 200K+ motion sequence foundation model for the dialogue task.

pith-pipeline@v0.9.0 · 5587 in / 1304 out tokens · 67097 ms · 2026-05-13T20:11:31.721494+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AudioFace: Language-Assisted Speech-Driven Facial Animation with Multimodal Language Models
cs.CV 2026-05 unverdicted novelty 5.0

AudioFace improves speech-driven facial animation by guiding blendshape prediction with linguistic and articulatory information extracted via multimodal language models.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · cited by 1 Pith paper

[1]

Action2motion: Conditioned generation of 3d human motions

Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. Action2motion: Conditioned generation of 3d human motions. InProceedings of the 28th ACM International Conference on Multimedia (ACM MM), pages 2021–2029, 2020. 11

work page 2021
[2]

Black, and Gül Varol

Mathis Petrovich, Michael J. Black, and Gül Varol. Action-conditioned 3d human motion synthesis with transformer vae. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10985–10995, 2021

work page 2021
[3]

T2m-gpt: Generating human motion from textual descriptions with discrete representations, 2023

Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Shaoli Huang, Yong Zhang, Hongwei Zhao, Hongtao Lu, and Xi Shen. T2m-gpt: Generating human motion from textual descriptions with discrete representations, 2023

work page 2023
[4]

Motiongpt: Human motion as a foreign language, 2023

Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. Motiongpt: Human motion as a foreign language, 2023

work page 2023
[5]

Human motion diffusion model

Guy Tevet, Sigal Raab, Brian Gordon, Yoni Shafir, Daniel Cohen-or, and Amit Haim Bermano. Human motion diffusion model. InThe Eleventh International Conference on Learning Repre- sentations, 2023

work page 2023
[6]

Momask: Generative masked modeling of 3d human motions

Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Sen Wang, and Li Cheng. Momask: Generative masked modeling of 3d human motions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1900–1910, 2024

work page 1900
[7]

Emage: Towards unified holistic co-speech gesture generation via expressive masked audio gesture modeling

Haiyang Liu, Zihao Zhu, Giorgio Becherini, Yichen Peng, Mingyang Su, You Zhou, Xuefei Zhe, Naoya Iwamoto, Bo Zheng, and Michael J Black. Emage: Towards unified holistic co-speech gesture generation via expressive masked audio gesture modeling. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1144–1154, 2024

work page 2024
[8]

Speech gesture generation from the trimodal context of text, audio, and speaker identity.ACM Transactions on Graphics, 39(6):1–16, 2020

Youngwoo Yoon, Bok Cha, Joo-Haeng Lee, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. Speech gesture generation from the trimodal context of text, audio, and speaker identity.ACM Transactions on Graphics, 39(6):1–16, 2020

work page 2020
[9]

Learning hierarchical cross-modal association for co-speech gesture generation

Xian Liu, Qianyi Wu, Hang Zhou, Yinghao Xu, Rui Qian, Xinyi Lin, Xiaowei Zhou, Wayne Wu, Bo Dai, and Bolei Zhou. Learning hierarchical cross-modal association for co-speech gesture generation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 10462–10472, 2022

work page 2022
[10]

Taming diffusion models for audio-driven co-speech gesture generation

Lingting Zhu, Xian Liu, Xuanyu Liu, Rui Qian, Ziwei Liu, and Lequan Yu. Taming diffusion models for audio-driven co-speech gesture generation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 10544–10553, 2023

work page 2023
[11]

Rhythmic gesticula- tor: Rhythm-aware co-speech gesture synthesis with hierarchical neural embeddings.ACM Transactions on Graphics, 41(6):1–19, 2022

Tenglong Ao, Qingzhe Gao, Yuke Lou, Baoquan Chen, and Libin Liu. Rhythmic gesticula- tor: Rhythm-aware co-speech gesture synthesis with hierarchical neural embeddings.ACM Transactions on Graphics, 41(6):1–19, 2022

work page 2022
[12]

Generating diverse and natural 3d human motions from text

Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5152–5161, June 2022

work page 2022
[13]

Punnakkal, Arjun Chandrasekaran, Nikos Athanasiou, Alejandra Quiros-Ramirez, and Michael J

Abhinanda R. Punnakkal, Arjun Chandrasekaran, Nikos Athanasiou, Alejandra Quiros-Ramirez, and Michael J. Black. BABEL: Bodies, action and behavior with english labels. InProceedings IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), pages 722–731, 2021

work page 2021
[14]

Motion-x: A large-scale 3d expressive whole-body human motion dataset.Advances in Neural Information Processing Systems, 2023

Jing Lin, Ailing Zeng, Shunlin Lu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, and Lei Zhang. Motion-x: A large-scale 3d expressive whole-body human motion dataset.Advances in Neural Information Processing Systems, 2023

work page 2023
[15]

Beat: A large-scale semantic and emotional multi-modal dataset for conversa- tional gestures synthesis.arXiv preprint arXiv:2203.05297, 2022

Haiyang Liu, Zihao Zhu, Naoya Iwamoto, Yichen Peng, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. Beat: A large-scale semantic and emotional multi-modal dataset for conversa- tional gestures synthesis.arXiv preprint arXiv:2203.05297, 2022

work page arXiv 2022
[16]

Snapmogen: Human motion generation from expressive texts, 2025

Chuan Guo, Inwoo Hwang, Jian Wang, and Bing Zhou. Snapmogen: Human motion generation from expressive texts, 2025

work page 2025
[17]

L. Xu, X. Lv, Y . Yan, X. Jin, S. Wu, C. Xu, Y . Liu, Y . Zhou, F. Rao, X. Sheng, Y . Liu, W. Zeng, and X. Yang. Inter-x: Towards versatile human-human interaction analysis, 2023. 12

work page 2023
[18]

Liang, W

H. Liang, W. Zhang, W. Li, J. Yu, and L. Xu. Intergen: Diffusion-based multi-human motion generation under complex interactions.International Journal of Computer Vision, 2024

work page 2024
[19]

Charactereval: A chinese benchmark for role-playing conversational agent evaluation

Quan Tu, Shilong Fan, Zihang Tian, Tianhao Shen, Shuo Shang, Xin Gao, and Rui Yan. Charactereval: A chinese benchmark for role-playing conversational agent evaluation. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11836–11850, 2024

work page 2024
[20]

Motionclip: Exposing human motion generation to clip space

Guy Tevet, Brian Gordon, Amir Hertz, Amit H Bermano, and Daniel Cohen-Or. Motionclip: Exposing human motion generation to clip space. InComputer Vision–ECCV 2022, pages 358–374. Springer, 2022

work page 2022
[21]

Finemogen: Fine-grained spatio-temporal motion generation and editing.NeurIPS, 2023

Mingyuan Zhang, Huirong Li, Zhongang Cai, Jiawei Ren, Lei Yang, and Ziwei Liu. Finemogen: Fine-grained spatio-temporal motion generation and editing.NeurIPS, 2023

work page 2023
[22]

Salad: Skeleton-aware latent diffusion for text-driven motion generation and editing

Seokhyeon Hong, Chaelin Kim, Serin Yoon, Junghyun Nam, Sihun Cha, and Junyong Noh. Salad: Skeleton-aware latent diffusion for text-driven motion generation and editing. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 7158–7168, 2025

work page 2025
[23]

Energymogen: Compositional human motion genera- tion with energy-based diffusion model in latent space

Jianrong Zhang, Hehe Fan, and Yi Yang. Energymogen: Compositional human motion genera- tion with energy-based diffusion model in latent space. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17592–17602, 2025

work page 2025
[24]

Mola: Motion generation and editing with latent diffusion enhanced by adversarial training.arXiv preprint arXiv:2406.01867, 2024

Kengo Uchida, Takashi Shibuya, Yuhta Takida, Naoki Murata, Julian Tanke, Shusuke Takahashi, and Yuki Mitsufuji. Mola: Motion generation and editing with latent diffusion enhanced by adversarial training.arXiv preprint arXiv:2406.01867, 2024

work page arXiv 2024
[25]

Motionflow: Efficient motion generation with latent flow matching.IEEE Transactions on Multimedia, pages 1–13, 2026

Kun Dong, Jian Xue, Xing Lan, Qingyuan Liu, and Ke Lu. Motionflow: Efficient motion generation with latent flow matching.IEEE Transactions on Multimedia, pages 1–13, 2026

work page 2026
[26]

DartControl: A diffusion-based autoregressive motion model for real-time text-driven motion control

Kaifeng Zhao, Gen Li, and Siyu Tang. DartControl: A diffusion-based autoregressive motion model for real-time text-driven motion control. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025

work page 2025
[27]

ActionPlan: Future-aware streaming motion synthesis via frame-level action planning.arXiv preprint, 2026

Eric Nazarenus, Chuqiao Li, Yannan He, Xianghui Xie, Jan Eric Lenssen, and Gerard Pons-Moll. ActionPlan: Future-aware streaming motion synthesis via frame-level action planning.arXiv preprint, 2026

work page 2026
[28]

T2m-gpt: Generating human motion from textual descriptions with discrete representations

Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Shaoli Huang, Yong Zhang, Hongwei Zhao, Hongtao Lu, and Xi Shen. T2m-gpt: Generating human motion from textual descriptions with discrete representations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023
[29]

Mmm: Generative masked motion model

Ekkasit Pinyoanuntapong, Pu Wang, Minwoo Lee, and Chen Chen. Mmm: Generative masked motion model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1546–1555, 2024

work page 2024
[30]

Bamm: Bidirectional autoregressive motion model

Ekkasit Pinyoanuntapong, Muhammad Usama Saleem, Pu Wang, Minwoo Lee, Srijan Das, and Chen Chen. Bamm: Bidirectional autoregressive motion model. InEuropean Conference on Computer Vision (ECCV), pages 172–190. Springer, 2025

work page 2025
[31]

Mogents: Motion generation based on spatial-temporal joint modeling

Weihao Yuan, Weichao Shen, Yisheng HE, Yuan Dong, Xiaodong Gu, Zilong Dong, Liefeng Bo, and Qixing Huang. Mogents: Motion generation based on spatial-temporal joint modeling. Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[32]

Light-t2m: A lightweight and fast model for text-to-motion generation

Ling-An Zeng, Guohong Huang, Gaojie Wu, and Wei-Shi Zheng. Light-t2m: A lightweight and fast model for text-to-motion generation. InProceedings of the AAAI Conference on Artificial Intelligence, 2025

work page 2025
[33]

Temporal consistency-aware text-to-motion generation.Visual Intelligence, 4(1):7, 2026

Hongsong Wang, Wenjing Yan, Qiuxia Lai, and Xin Geng. Temporal consistency-aware text-to-motion generation.Visual Intelligence, 4(1):7, 2026. 13

work page 2026
[34]

Segmo: Segment- aligned text to 3d human motion generation

Bowen Dang, Lin Wu, Xiaohang Yang, Zheng Yuan, and Zhixiang Chen. Segmo: Segment- aligned text to 3d human motion generation. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6946–6955, 2026

work page 2026
[35]

Motion- agent: A conversational framework for human motion generation with llms

Qi Wu, Yubo Zhao, Yifan Wang, Xinhang Liu, Yu-Wing Tai, and Chi-Keung Tang. Motion- agent: A conversational framework for human motion generation with llms. InInternational Conference on Learning Representations, 2024

work page 2024
[36]

Motiongpt3: Human motion as a second modality, 2025

Bingfan Zhu, Biao Jiang, Sunyi Wang, Shixiang Tang, Tao Chen, Linjie Luo, Youyi Zheng, and Xin Chen. Motiongpt3: Human motion as a second modality, 2025

work page 2025
[37]

Mg-motionllm: A unified framework for motion comprehension and generation across multiple granularities.arXiv preprint arXiv:2504.02478, 2025

Bizhu Wu, Jinheng Xie, Keming Shen, Zhe Kong, Jianfeng Ren, Ruibin Bai, Rong Qu, and Linlin Shen. Mg-motionllm: A unified framework for motion comprehension and generation across multiple granularities.arXiv preprint arXiv:2504.02478, 2025

work page arXiv 2025
[38]

Motion-r1: Chain-of-thought reasoning and reinforcement learning for human motion generation.arXiv preprint arXiv:2506.10353, 2025

Runqi Ouyang, Haoyun Li, Zhenyuan Zhang, Xiaofeng Wang, Zheng Zhu, Guan Huang, and Xingang Wang. Motion-r1: Chain-of-thought reasoning and reinforcement learning for human motion generation.arXiv preprint arXiv:2506.10353, 2025

work page arXiv 2025
[39]

arXiv preprint arXiv:2509.04058 , year=

Lei Zhong, Yi Yang, and Changjian Li. Smoogpt: Stylized motion generation using large language models.arXiv preprint arXiv:2509.04058, 2025

work page arXiv 2025
[40]

Planmogpt: Flow-enhanced progressive planning for text to motion synthesis.arXiv preprint arXiv:2506.17912, 2025

Chuhao Jin, Haosen Li, Bingzi Zhang, Che Liu, Xiting Wang, Ruihua Song, Wenbing Huang, Ying Qin, Fuzheng Zhang, and Di Zhang. Planmogpt: Flow-enhanced progressive planning for text to motion synthesis.arXiv preprint arXiv:2506.17912, 2025

work page arXiv 2025
[41]

Ginosar, A

S. Ginosar, A. Bar, G. Kohavi, C. Chan, A. Owens, and J. Malik. Learning individual styles of conversational gesture. InComputer Vision and Pattern Recognition (CVPR). IEEE, 2019

work page 2019
[42]

No gestures left behind: Learning relationships between spoken language and freeform gestures

Chaitanya Ahuja, Dong Won Lee, Ryo Ishii, and Louis-Philippe Morency. No gestures left behind: Learning relationships between spoken language and freeform gestures. In Trevor Cohn, Yulan He, and Yang Liu, editors,Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1884–1895, Online, November 2020. Association for Computational Linguistics

work page 2020
[43]

Listen, denoise, action! audio-driven motion synthesis with diffusion models.ACM Trans

Simon Alexanderson, Rajmund Nagy, Jonas Beskow, and Gustav Eje Henter. Listen, denoise, action! audio-driven motion synthesis with diffusion models.ACM Trans. Graph., 42(4):44:1– 44:20, 2023

work page 2023
[44]

Qpgesture: Quantization-based and phase-guided motion matching for natural speech- driven gesture generation

Sicheng Yang, Zhiyong Wu, Minglei Li, Zhensong Zhang, Lei Hao, Weihong Bao, and Haolin Zhuang. Qpgesture: Quantization-based and phase-guided motion matching for natural speech- driven gesture generation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2321–2330, 2023

work page 2023
[45]

Gesturediffuclip: Gesture diffusion model with clip latents, 2023

Tenglong Ao, Zeyi Zhang, and Libin Liu. Gesturediffuclip: Gesture diffusion model with clip latents, 2023

work page 2023
[46]

Diffusion-based co-speech gesture generation using joint text and audio representation

Anna Deichler, Shivam Mehta, Simon Alexanderson, and Jonas Beskow. Diffusion-based co-speech gesture generation using joint text and audio representation. InGENEA: Generation and Evaluation of Non-verbal Behaviour for Embodied Agents Challenge 2023, 2023

work page 2023
[47]

Liu et al

X. Liu et al. Audio-driven co-speech gesture video generation. InProceedings of the Interna- tional Conference on Neural Information Processing Systems (NeurIPS), pages 21386–21399, 2022

work page 2022
[48]

Generating holistic 3d human motion from speech

Hongwei Yi, Hualin Liang, Yifei Liu, Qiong Cao, Yandong Wen, Timo Bolkart, Dacheng Tao, and Michael J Black. Generating holistic 3d human motion from speech. InCVPR, 2023

work page 2023
[49]

Diffsheg: A diffusion-based approach for real-time speech-driven holistic 3d expression and gesture generation

Junming Chen, Yunfei Liu, Jianan Wang, Ailing Zeng, Yu Li, and Qifeng Chen. Diffsheg: A diffusion-based approach for real-time speech-driven holistic 3d expression and gesture generation. InCVPR, 2024

work page 2024
[50]

M3g: Multi-granular gesture generator for audio- driven full-body human motion synthesis.arXiv preprint arXiv:2505.08293, 2025

Zhizhuo Yin, Yuk Hang Tsui, and Pan Hui. M3g: Multi-granular gesture generator for audio- driven full-body human motion synthesis.arXiv preprint arXiv:2505.08293, 2025. 14

work page arXiv 2025
[51]

Conversational co-speech gesture generation via modeling dialog intention, emotion, and context with diffusion models

Haiwei Xue, Sicheng Yang, Zhensong Zhang, Zhiyong Wu, Minglei Li, Zonghong Dai, and Helen Meng. Conversational co-speech gesture generation via modeling dialog intention, emotion, and context with diffusion models. InICASSP 2024, pages 8296–8300, 2024

work page 2024
[52]

Diffugesture: Generating human gesture from two-person dialogue with diffusion models

Weiyu Zhao, Liangxiao Hu, and Shengping Zhang. Diffugesture: Generating human gesture from two-person dialogue with diffusion models. InGENEA Challenge 2023, 2023

work page 2023
[53]

Co3gesture: Towards coherent concurrent co-speech 3d gesture generation with interactive diffusion

Xingqun Qi, Yatian Wang, Hengyuan Zhang, Jiahao Pan, Wei Xue, Shanghang Zhang, Wenhan Luo, Qifeng Liu, and Yike Guo. Co3gesture: Towards coherent concurrent co-speech 3d gesture generation with interactive diffusion. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[54]

Hamza Mughal, Rishabh Dabral, Vera Demberg, and Christian Theobalt

M. Hamza Mughal, Rishabh Dabral, Vera Demberg, and Christian Theobalt. Miburi: Towards expressive interactive gesture synthesis. InComputer Vision and Pattern Recognition (CVPR), 2026

work page 2026
[55]

Semantic gesticulator: Semantics-aware co-speech gesture synthesis.ACM Transactions on Graphics (TOG), 43(4):1–17, 2024

Zeyi Zhang, Tenglong Ao, Yuyao Zhang, Qingzhe Gao, Chuan Lin, Baoquan Chen, and Libin Liu. Semantic gesticulator: Semantics-aware co-speech gesture synthesis.ACM Transactions on Graphics (TOG), 43(4):1–17, 2024

work page 2024
[56]

The language of motion: Unifying verbal and non- verbal language of 3d human motion.CVPR, 2025

Changan Chen, Juze Zhang, Shrinidhi K Lakshmikanth, Yusu Fang, Ruizhi Shao, Gordon Wetzstein, Li Fei-Fei, and Ehsan Adeli. The language of motion: Unifying verbal and non- verbal language of 3d human motion.CVPR, 2025

work page 2025
[57]

Motion-example- controlled co-speech gesture generation leveraging large language models

Bohong Chen, Yumeng Li, Youyi Zheng, Yao-Xiang Ding, and Kun Zhou. Motion-example- controlled co-speech gesture generation leveraging large language models. InSIGGRAPH Conference Papers ’25, 2025

work page 2025
[58]

T3m: Text guided 3d human motion synthesis from speech

Wenshuo Peng, Kaipeng Zhang, and Sai Qian Zhang. T3m: Text guided 3d human motion synthesis from speech. InFindings of the Association for Computational Linguistics: NAACL 2024, pages 1168–1177, 2024

work page 2024
[59]

Kimodo: Scaling controllable human motion generation.arXiv, 2026

Davis Rempe, Mathis Petrovich, Ye Yuan, Haotian Zhang, Xue Bin Peng, Yifeng Jiang, Tingwu Wang, Umar Iqbal, et al. Kimodo: Scaling controllable human motion generation.arXiv, 2026

work page 2026
[60]

Genmo: A generalist model for human motion

Jiefeng Li, Jinkun Cao, Haotian Zhang, Davis Rempe, Jan Kautz, Umar Iqbal, and Ye Yuan. Genmo: A generalist model for human motion. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

work page 2025
[61]

The quest for generalizable motion generation: Data, model, and evaluation.arXiv preprint arXiv:2510.26794, 2025

Jing Lin, Ruisi Wang, Junzhe Lu, Ziqi Huang, Guorui Song, Ailing Zeng, Xian Liu, Chen Wei, Wanqi Yin, Qingping Sun, et al. The quest for generalizable motion generation: Data, model, and evaluation.arXiv preprint arXiv:2510.26794, 2025

work page arXiv 2025
[62]

Vimorag: Video-based retrieval-augmented 3d motion generation for motion language models.arXiv preprint arXiv:2508.12081, 2025

Haidong Xu, Guangwei Xu, Zhedong Zheng, Xiatian Zhu, Wei Ji, Xiangtai Li, Ruijie Guo, Meishan Zhang, Hao Fei, et al. Vimorag: Video-based retrieval-augmented 3d motion generation for motion language models.arXiv preprint arXiv:2508.12081, 2025

work page arXiv 2025
[63]

KinMo: Kinematic-aware Human Motion Understanding and Generation

Pengfei Zhang, Pinxin Liu, Pablo Garrido, Hyeongwoo Kim, and Bindita Chaudhuri. KinMo: Kinematic-aware Human Motion Understanding and Generation. InIEEE/CVF International Conference on Computer Vision, 2025

work page 2025
[64]

Mm-conv: A multi-modal conversational dataset for virtual humans.arXiv preprint arXiv:2410.00253, 2024

Anna Deichler, Jim O’Regan, and Jonas Beskow. Mm-conv: A multi-modal conversational dataset for virtual humans.arXiv preprint arXiv:2410.00253, 2024

work page arXiv 2024
[65]

Ross, and Angjoo Kanazawa

Ruilong Li, Shan Yang, David A. Ross, and Angjoo Kanazawa. Ai choreographer: Music conditioned 3d dance generation with aist++, 2021

work page 2021
[66]

Music-aligned holistic 3d dance generation via hierarchical motion modeling, 2025

Xiaojie Li, Ronghui Li, Shukai Fang, Shuzhao Xie, Xiaoyang Guo, Jiaqing Zhou, Junkun Peng, and Zhi Wang. Music-aligned holistic 3d dance generation via hierarchical motion modeling, 2025

work page 2025
[67]

Embody 3d: A large-scale multimodal motion and behavior dataset, 2025

Claire McLean et al. Embody 3d: A large-scale multimodal motion and behavior dataset, 2025. 15

work page 2025
[68]

Nymeria: A massive collection of multimodal egocentric daily motion in the wild

Lingni Ma, Yuting Ye, Fangzhou Hong, Vladimir Guzov, Yifeng Jiang, et al. Nymeria: A massive collection of multimodal egocentric daily motion in the wild. InEuropean Conference on Computer Vision (ECCV), 2024

work page 2024
[69]

Seam- less interaction: Dyadic audiovisual motion modeling and large-scale dataset.arXiv preprint arXiv:2506.22554, 2025

Vasu Agrawal, Akinniyi Akinyemi, Kathryn Alvero, Morteza Behrooz, Julia Buffalini, Fabio Maria Carlucci, Joy Chen, Junming Chen, Zhang Chen, Shiyang Cheng, et al. Seam- less interaction: Dyadic audiovisual motion modeling and large-scale dataset.arXiv preprint arXiv:2506.22554, 2025

work page arXiv 2025
[70]

Three-dimensional reconstruction of human interactions

Mihai Fieraru, Mihai Zanfir, Elisabeta Oneata, Alin-Ionut Popa, Vlad Olaru, and Cristian Sminchisescu. Three-dimensional reconstruction of human interactions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7214–7223, 2020

work page 2020
[71]

Intend to move: A multimodal dataset for intention-aware human motion understanding

Ryo Umagami, Liu Yue, Xuangeng Chu, Ryuto Fukushima, Tetsuya Narita, Yusuke Mukuta, Tomoyuki Takahata, Jianfei Yang, and Tatsuya Harada. Intend to move: A multimodal dataset for intention-aware human motion understanding. InNeurIPS Datasets and Benchmarks Track, 2025

work page 2025
[72]

Personabooth: Personalized text-to-motion generation.arXiv preprint arXiv:2503.07390, 2025

Boeun Kim, Hea In Jeong, JungHoon Sung, Yihua Cheng, Jeongmin Lee, Ju Yong Chang, et al. Personabooth: Personalized text-to-motion generation.arXiv preprint arXiv:2503.07390, 2025

work page arXiv 2025
[73]

Hy-motion 1.0: Scaling flow matching models for text-to-motion generation.arXiv preprint arXiv:2512.23464, 2025

Tencent Hunyuan 3D Digital Human Team. Hy-motion 1.0: Scaling flow matching models for text-to-motion generation.arXiv preprint arXiv:2512.23464, 2025. 16 A Character Profile SuSu is designed as a cohabiting companion character whose personality balances warmth with playful reserve. Table 10 and Table 11 detail her basic attributes and behavioral design, ...

work page arXiv 2025