Recognition: no theorem link
SentiAvatar: Towards Expressive and Interactive Digital Humans
Pith reviewed 2026-05-13 20:11 UTC · model grok-4.3
The pith
SentiAvatar generates real-time 3D digital humans that speak, gesture, and emote by decoupling semantic planning from prosody interpolation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SentiAvatar is an end-to-end framework that first assembles SuSuInterActs, a 21K-clip dialogue corpus, pre-trains a Motion Foundation Model on 200K+ motion sequences, and then applies an audio-aware plan-then-infill architecture. Sentence-level semantic planning selects appropriate gestures and expressions while frame-level prosody interpolation aligns motion timing and dynamics to the incoming speech waveform, yielding motions that are both contextually appropriate and rhythmically natural.
What carries the argument
The audio-aware plan-then-infill architecture that decouples sentence-level semantic planning from frame-level prosody-driven interpolation.
If this is right
- The system reaches R@1 of 43.64 percent on SuSuInterActs, nearly twice the best reported baseline.
- It records FGD of 4.941 and BC of 8.078 on BEATv2 while generating six seconds of motion in 0.3 seconds.
- Unlimited multi-turn streaming becomes feasible because the architecture separates long-horizon planning from local interpolation.
- The same pre-trained motion priors can be reused for non-conversational actions beyond the dialogue domain.
- The released dataset and model weights enable direct replication and extension by other researchers.
Where Pith is reading between the lines
- Similar plan-then-infill separation may improve other conditional motion tasks such as music-driven dance or sign-language synthesis.
- The 37-hour single-character corpus could serve as a seed for few-shot personalization of new avatars.
- Real-time performance opens direct integration into live virtual agents or game engines without offline rendering.
Load-bearing premise
The pre-trained motion foundation model supplies useful priors for conversational gestures and the decoupled planning-infill step produces coherent motions without introducing artifacts or losing naturalness.
What would settle it
A side-by-side evaluation in which human raters consistently judge SentiAvatar motions as less natural or less speech-synchronized than the best baseline on held-out multi-turn dialogues.
Figures
read the original abstract
We present SentiAvatar, a framework for building expressive interactive 3D digital humans, and use it to create SuSu, a virtual character that speaks, gestures, and emotes in real time. Achieving such a system remains challenging, as it requires jointly addressing three key problems: the lack of large-scale, high-quality multimodal data, robust semantic-to-motion mapping, and fine-grained frame-level motion-prosody synchronization. To solve these problems, first, we build SuSuInterActs (21K clips, 37 hours), a dialogue corpus captured via optical motion capture around a single character with synchronized speech, full-body motion, and facial expressions. Second, we pre-train a Motion Foundation Model on 200K+ motion sequences, equipping it with rich action priors that go well beyond the conversation. We then propose an audio-aware plan-then-infill architecture that decouples sentence-level semantic planning from frame-level prosody-driven interpolation, so that generated motions are both semantically appropriate and rhythmically aligned with speech. Experiments show that SentiAvatar achieves state-of-the-art on both SuSuInterActs (R@1 43.64%, nearly 2 times the best baseline) and BEATv2 (FGD 4.941, BC 8.078), producing 6s of output in 0.3s with unlimited multi-turn streaming. The source code, model, and dataset are available at https://sentiavatar.github.io.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents SentiAvatar, a framework for expressive interactive 3D digital humans. It introduces the SuSuInterActs dataset (21K clips, 37 hours) with synchronized speech, full-body motion, and facial expressions; pre-trains a Motion Foundation Model on 200K+ sequences; and proposes an audio-aware plan-then-infill architecture that decouples sentence-level semantic planning from frame-level prosody-driven interpolation. Experiments claim state-of-the-art results on SuSuInterActs (R@1 43.64%, nearly 2x best baseline) and BEATv2 (FGD 4.941, BC 8.078), with real-time performance (6s output in 0.3s) and unlimited multi-turn streaming support. Code, model, and dataset are released.
Significance. If the empirical results hold after addressing evaluation gaps, the work would advance real-time conversational avatar systems by releasing a large-scale multimodal dialogue corpus, rich motion priors from a foundation model, and a decoupled architecture for semantic-rhythmic alignment. The open-sourcing of resources is a clear strength that supports reproducibility and extension in the field.
major comments (3)
- [§5 Experiments] §5 Experiments and §5.1: The SOTA claims (R@1 43.64% on SuSuInterActs, nearly 2x baseline; FGD/BC on BEATv2) are reported without details on baseline re-implementations, hyper-parameter choices, training data splits, or potential data leakage between SuSuInterActs and the 200K+ pre-training sequences. This leaves the central performance gains only moderately supported.
- [§4.2 Architecture] §4.2 Architecture and §5.2: The plan-then-infill decoupling is presented as solving semantic appropriateness and rhythmic alignment, yet no quantitative metrics or ablations evaluate cross-turn coherence, sentence-boundary discontinuities, foot-sliding, or streaming artifacts. The reported R@1/FGD/BC scores do not directly test the weakest assumption for the multi-turn interactive claim.
- [§4.3 Ablations] §4.3 and Table 2: Ablation studies on the contribution of the pre-trained Motion Foundation Model priors versus the infill module are missing; without them, it is unclear whether the reported gains stem from the proposed decoupling or from the foundation model alone.
minor comments (2)
- [Figure 4] Figure 4 and §4.1: The caption and text do not specify the exact input/output dimensions or conditioning signals for the semantic planner versus the prosody interpolator.
- [§3.2] §3.2: Notation for motion sequences (e.g., use of M_t vs. P_t) is introduced without a consolidated table of symbols.
Simulated Author's Rebuttal
We appreciate the referee's detailed feedback and constructive suggestions. We address each major comment below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [§5 Experiments] §5 Experiments and §5.1: The SOTA claims (R@1 43.64% on SuSuInterActs, nearly 2x baseline; FGD/BC on BEATv2) are reported without details on baseline re-implementations, hyper-parameter choices, training data splits, or potential data leakage between SuSuInterActs and the 200K+ pre-training sequences. This leaves the central performance gains only moderately supported.
Authors: We agree that additional details are necessary to fully support the SOTA claims. In the revised manuscript, we will expand §5.1 to include comprehensive information on baseline re-implementations, specific hyper-parameter choices, and the exact training/validation/test splits used. For data leakage, we will clarify that the pre-training sequences come from established public motion datasets (e.g., AMASS, HumanML3D) that do not overlap with our newly captured SuSuInterActs dataset, which was recorded in a controlled setting with a specific character. This will be explicitly documented to ensure transparency. revision: yes
-
Referee: [§4.2 Architecture] §4.2 Architecture and §5.2: The plan-then-infill decoupling is presented as solving semantic appropriateness and rhythmic alignment, yet no quantitative metrics or ablations evaluate cross-turn coherence, sentence-boundary discontinuities, foot-sliding, or streaming artifacts. The reported R@1/FGD/BC scores do not directly test the weakest assumption for the multi-turn interactive claim.
Authors: We acknowledge that the current evaluation focuses on per-clip metrics, which primarily assess semantic appropriateness (via R@1) and rhythmic alignment (via FGD/BC). However, the plan-then-infill design inherently supports multi-turn coherence by planning at the sentence level and infilling frames consistently. To address this, we will add new quantitative evaluations in the revised §5.2, including metrics for cross-turn coherence (e.g., motion continuity scores across turns) and analysis of sentence-boundary discontinuities. We will also report on foot-sliding artifacts and streaming performance in multi-turn scenarios. While the real-time streaming capability is demonstrated qualitatively in the supplementary video, we will include supporting quantitative results. revision: partial
-
Referee: [§4.3 Ablations] §4.3 and Table 2: Ablation studies on the contribution of the pre-trained Motion Foundation Model priors versus the infill module are missing; without them, it is unclear whether the reported gains stem from the proposed decoupling or from the foundation model alone.
Authors: We agree that isolating the contributions is important. In the revised version, we will expand the ablation studies in §4.3 and Table 2 to include experiments that compare: (1) the full model, (2) the model without the pre-trained Motion Foundation Model (using random initialization), and (3) variants without the infill module. This will clarify the role of each component in achieving the reported performance gains. revision: yes
Circularity Check
No circularity detected; claims rest on new data and empirical evaluation
full rationale
The paper constructs a new dataset (SuSuInterActs with 21K clips) and pre-trains a Motion Foundation Model on 200K+ external sequences, then proposes and trains an audio-aware plan-then-infill architecture whose outputs are evaluated on standard metrics (R@1, FGD, BC) and runtime benchmarks. No step reduces by construction to its inputs, no self-citation is load-bearing for the central claims, and the derivation chain is falsifiable through the reported experiments rather than tautological. The architecture choices are presented as design decisions trained on data, not derived from prior self-referential results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pre-trained general motion priors transfer effectively to conversational full-body and facial motions
Forward citations
Cited by 1 Pith paper
-
AudioFace: Language-Assisted Speech-Driven Facial Animation with Multimodal Language Models
AudioFace improves speech-driven facial animation by guiding blendshape prediction with linguistic and articulatory information extracted via multimodal language models.
Reference graph
Works this paper leans on
-
[1]
Action2motion: Conditioned generation of 3d human motions
Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. Action2motion: Conditioned generation of 3d human motions. InProceedings of the 28th ACM International Conference on Multimedia (ACM MM), pages 2021–2029, 2020. 11
work page 2021
-
[2]
Mathis Petrovich, Michael J. Black, and Gül Varol. Action-conditioned 3d human motion synthesis with transformer vae. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10985–10995, 2021
work page 2021
-
[3]
T2m-gpt: Generating human motion from textual descriptions with discrete representations, 2023
Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Shaoli Huang, Yong Zhang, Hongwei Zhao, Hongtao Lu, and Xi Shen. T2m-gpt: Generating human motion from textual descriptions with discrete representations, 2023
work page 2023
-
[4]
Motiongpt: Human motion as a foreign language, 2023
Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. Motiongpt: Human motion as a foreign language, 2023
work page 2023
-
[5]
Guy Tevet, Sigal Raab, Brian Gordon, Yoni Shafir, Daniel Cohen-or, and Amit Haim Bermano. Human motion diffusion model. InThe Eleventh International Conference on Learning Repre- sentations, 2023
work page 2023
-
[6]
Momask: Generative masked modeling of 3d human motions
Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Sen Wang, and Li Cheng. Momask: Generative masked modeling of 3d human motions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1900–1910, 2024
work page 1900
-
[7]
Haiyang Liu, Zihao Zhu, Giorgio Becherini, Yichen Peng, Mingyang Su, You Zhou, Xuefei Zhe, Naoya Iwamoto, Bo Zheng, and Michael J Black. Emage: Towards unified holistic co-speech gesture generation via expressive masked audio gesture modeling. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1144–1154, 2024
work page 2024
-
[8]
Youngwoo Yoon, Bok Cha, Joo-Haeng Lee, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. Speech gesture generation from the trimodal context of text, audio, and speaker identity.ACM Transactions on Graphics, 39(6):1–16, 2020
work page 2020
-
[9]
Learning hierarchical cross-modal association for co-speech gesture generation
Xian Liu, Qianyi Wu, Hang Zhou, Yinghao Xu, Rui Qian, Xinyi Lin, Xiaowei Zhou, Wayne Wu, Bo Dai, and Bolei Zhou. Learning hierarchical cross-modal association for co-speech gesture generation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 10462–10472, 2022
work page 2022
-
[10]
Taming diffusion models for audio-driven co-speech gesture generation
Lingting Zhu, Xian Liu, Xuanyu Liu, Rui Qian, Ziwei Liu, and Lequan Yu. Taming diffusion models for audio-driven co-speech gesture generation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 10544–10553, 2023
work page 2023
-
[11]
Tenglong Ao, Qingzhe Gao, Yuke Lou, Baoquan Chen, and Libin Liu. Rhythmic gesticula- tor: Rhythm-aware co-speech gesture synthesis with hierarchical neural embeddings.ACM Transactions on Graphics, 41(6):1–19, 2022
work page 2022
-
[12]
Generating diverse and natural 3d human motions from text
Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5152–5161, June 2022
work page 2022
-
[13]
Punnakkal, Arjun Chandrasekaran, Nikos Athanasiou, Alejandra Quiros-Ramirez, and Michael J
Abhinanda R. Punnakkal, Arjun Chandrasekaran, Nikos Athanasiou, Alejandra Quiros-Ramirez, and Michael J. Black. BABEL: Bodies, action and behavior with english labels. InProceedings IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), pages 722–731, 2021
work page 2021
-
[14]
Jing Lin, Ailing Zeng, Shunlin Lu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, and Lei Zhang. Motion-x: A large-scale 3d expressive whole-body human motion dataset.Advances in Neural Information Processing Systems, 2023
work page 2023
-
[15]
Haiyang Liu, Zihao Zhu, Naoya Iwamoto, Yichen Peng, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. Beat: A large-scale semantic and emotional multi-modal dataset for conversa- tional gestures synthesis.arXiv preprint arXiv:2203.05297, 2022
-
[16]
Snapmogen: Human motion generation from expressive texts, 2025
Chuan Guo, Inwoo Hwang, Jian Wang, and Bing Zhou. Snapmogen: Human motion generation from expressive texts, 2025
work page 2025
-
[17]
L. Xu, X. Lv, Y . Yan, X. Jin, S. Wu, C. Xu, Y . Liu, Y . Zhou, F. Rao, X. Sheng, Y . Liu, W. Zeng, and X. Yang. Inter-x: Towards versatile human-human interaction analysis, 2023. 12
work page 2023
- [18]
-
[19]
Charactereval: A chinese benchmark for role-playing conversational agent evaluation
Quan Tu, Shilong Fan, Zihang Tian, Tianhao Shen, Shuo Shang, Xin Gao, and Rui Yan. Charactereval: A chinese benchmark for role-playing conversational agent evaluation. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11836–11850, 2024
work page 2024
-
[20]
Motionclip: Exposing human motion generation to clip space
Guy Tevet, Brian Gordon, Amir Hertz, Amit H Bermano, and Daniel Cohen-Or. Motionclip: Exposing human motion generation to clip space. InComputer Vision–ECCV 2022, pages 358–374. Springer, 2022
work page 2022
-
[21]
Finemogen: Fine-grained spatio-temporal motion generation and editing.NeurIPS, 2023
Mingyuan Zhang, Huirong Li, Zhongang Cai, Jiawei Ren, Lei Yang, and Ziwei Liu. Finemogen: Fine-grained spatio-temporal motion generation and editing.NeurIPS, 2023
work page 2023
-
[22]
Salad: Skeleton-aware latent diffusion for text-driven motion generation and editing
Seokhyeon Hong, Chaelin Kim, Serin Yoon, Junghyun Nam, Sihun Cha, and Junyong Noh. Salad: Skeleton-aware latent diffusion for text-driven motion generation and editing. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 7158–7168, 2025
work page 2025
-
[23]
Jianrong Zhang, Hehe Fan, and Yi Yang. Energymogen: Compositional human motion genera- tion with energy-based diffusion model in latent space. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17592–17602, 2025
work page 2025
-
[24]
Kengo Uchida, Takashi Shibuya, Yuhta Takida, Naoki Murata, Julian Tanke, Shusuke Takahashi, and Yuki Mitsufuji. Mola: Motion generation and editing with latent diffusion enhanced by adversarial training.arXiv preprint arXiv:2406.01867, 2024
-
[25]
Kun Dong, Jian Xue, Xing Lan, Qingyuan Liu, and Ke Lu. Motionflow: Efficient motion generation with latent flow matching.IEEE Transactions on Multimedia, pages 1–13, 2026
work page 2026
-
[26]
DartControl: A diffusion-based autoregressive motion model for real-time text-driven motion control
Kaifeng Zhao, Gen Li, and Siyu Tang. DartControl: A diffusion-based autoregressive motion model for real-time text-driven motion control. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025
work page 2025
-
[27]
Eric Nazarenus, Chuqiao Li, Yannan He, Xianghui Xie, Jan Eric Lenssen, and Gerard Pons-Moll. ActionPlan: Future-aware streaming motion synthesis via frame-level action planning.arXiv preprint, 2026
work page 2026
-
[28]
T2m-gpt: Generating human motion from textual descriptions with discrete representations
Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Shaoli Huang, Yong Zhang, Hongwei Zhao, Hongtao Lu, and Xi Shen. T2m-gpt: Generating human motion from textual descriptions with discrete representations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023
work page 2023
-
[29]
Mmm: Generative masked motion model
Ekkasit Pinyoanuntapong, Pu Wang, Minwoo Lee, and Chen Chen. Mmm: Generative masked motion model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1546–1555, 2024
work page 2024
-
[30]
Bamm: Bidirectional autoregressive motion model
Ekkasit Pinyoanuntapong, Muhammad Usama Saleem, Pu Wang, Minwoo Lee, Srijan Das, and Chen Chen. Bamm: Bidirectional autoregressive motion model. InEuropean Conference on Computer Vision (ECCV), pages 172–190. Springer, 2025
work page 2025
-
[31]
Mogents: Motion generation based on spatial-temporal joint modeling
Weihao Yuan, Weichao Shen, Yisheng HE, Yuan Dong, Xiaodong Gu, Zilong Dong, Liefeng Bo, and Qixing Huang. Mogents: Motion generation based on spatial-temporal joint modeling. Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[32]
Light-t2m: A lightweight and fast model for text-to-motion generation
Ling-An Zeng, Guohong Huang, Gaojie Wu, and Wei-Shi Zheng. Light-t2m: A lightweight and fast model for text-to-motion generation. InProceedings of the AAAI Conference on Artificial Intelligence, 2025
work page 2025
-
[33]
Temporal consistency-aware text-to-motion generation.Visual Intelligence, 4(1):7, 2026
Hongsong Wang, Wenjing Yan, Qiuxia Lai, and Xin Geng. Temporal consistency-aware text-to-motion generation.Visual Intelligence, 4(1):7, 2026. 13
work page 2026
-
[34]
Segmo: Segment- aligned text to 3d human motion generation
Bowen Dang, Lin Wu, Xiaohang Yang, Zheng Yuan, and Zhixiang Chen. Segmo: Segment- aligned text to 3d human motion generation. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6946–6955, 2026
work page 2026
-
[35]
Motion- agent: A conversational framework for human motion generation with llms
Qi Wu, Yubo Zhao, Yifan Wang, Xinhang Liu, Yu-Wing Tai, and Chi-Keung Tang. Motion- agent: A conversational framework for human motion generation with llms. InInternational Conference on Learning Representations, 2024
work page 2024
-
[36]
Motiongpt3: Human motion as a second modality, 2025
Bingfan Zhu, Biao Jiang, Sunyi Wang, Shixiang Tang, Tao Chen, Linjie Luo, Youyi Zheng, and Xin Chen. Motiongpt3: Human motion as a second modality, 2025
work page 2025
-
[37]
Bizhu Wu, Jinheng Xie, Keming Shen, Zhe Kong, Jianfeng Ren, Ruibin Bai, Rong Qu, and Linlin Shen. Mg-motionllm: A unified framework for motion comprehension and generation across multiple granularities.arXiv preprint arXiv:2504.02478, 2025
-
[38]
Runqi Ouyang, Haoyun Li, Zhenyuan Zhang, Xiaofeng Wang, Zheng Zhu, Guan Huang, and Xingang Wang. Motion-r1: Chain-of-thought reasoning and reinforcement learning for human motion generation.arXiv preprint arXiv:2506.10353, 2025
-
[39]
arXiv preprint arXiv:2509.04058 , year=
Lei Zhong, Yi Yang, and Changjian Li. Smoogpt: Stylized motion generation using large language models.arXiv preprint arXiv:2509.04058, 2025
-
[40]
Chuhao Jin, Haosen Li, Bingzi Zhang, Che Liu, Xiting Wang, Ruihua Song, Wenbing Huang, Ying Qin, Fuzheng Zhang, and Di Zhang. Planmogpt: Flow-enhanced progressive planning for text to motion synthesis.arXiv preprint arXiv:2506.17912, 2025
-
[41]
S. Ginosar, A. Bar, G. Kohavi, C. Chan, A. Owens, and J. Malik. Learning individual styles of conversational gesture. InComputer Vision and Pattern Recognition (CVPR). IEEE, 2019
work page 2019
-
[42]
No gestures left behind: Learning relationships between spoken language and freeform gestures
Chaitanya Ahuja, Dong Won Lee, Ryo Ishii, and Louis-Philippe Morency. No gestures left behind: Learning relationships between spoken language and freeform gestures. In Trevor Cohn, Yulan He, and Yang Liu, editors,Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1884–1895, Online, November 2020. Association for Computational Linguistics
work page 2020
-
[43]
Listen, denoise, action! audio-driven motion synthesis with diffusion models.ACM Trans
Simon Alexanderson, Rajmund Nagy, Jonas Beskow, and Gustav Eje Henter. Listen, denoise, action! audio-driven motion synthesis with diffusion models.ACM Trans. Graph., 42(4):44:1– 44:20, 2023
work page 2023
-
[44]
Sicheng Yang, Zhiyong Wu, Minglei Li, Zhensong Zhang, Lei Hao, Weihong Bao, and Haolin Zhuang. Qpgesture: Quantization-based and phase-guided motion matching for natural speech- driven gesture generation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2321–2330, 2023
work page 2023
-
[45]
Gesturediffuclip: Gesture diffusion model with clip latents, 2023
Tenglong Ao, Zeyi Zhang, and Libin Liu. Gesturediffuclip: Gesture diffusion model with clip latents, 2023
work page 2023
-
[46]
Diffusion-based co-speech gesture generation using joint text and audio representation
Anna Deichler, Shivam Mehta, Simon Alexanderson, and Jonas Beskow. Diffusion-based co-speech gesture generation using joint text and audio representation. InGENEA: Generation and Evaluation of Non-verbal Behaviour for Embodied Agents Challenge 2023, 2023
work page 2023
- [47]
-
[48]
Generating holistic 3d human motion from speech
Hongwei Yi, Hualin Liang, Yifei Liu, Qiong Cao, Yandong Wen, Timo Bolkart, Dacheng Tao, and Michael J Black. Generating holistic 3d human motion from speech. InCVPR, 2023
work page 2023
-
[49]
Junming Chen, Yunfei Liu, Jianan Wang, Ailing Zeng, Yu Li, and Qifeng Chen. Diffsheg: A diffusion-based approach for real-time speech-driven holistic 3d expression and gesture generation. InCVPR, 2024
work page 2024
-
[50]
Zhizhuo Yin, Yuk Hang Tsui, and Pan Hui. M3g: Multi-granular gesture generator for audio- driven full-body human motion synthesis.arXiv preprint arXiv:2505.08293, 2025. 14
-
[51]
Haiwei Xue, Sicheng Yang, Zhensong Zhang, Zhiyong Wu, Minglei Li, Zonghong Dai, and Helen Meng. Conversational co-speech gesture generation via modeling dialog intention, emotion, and context with diffusion models. InICASSP 2024, pages 8296–8300, 2024
work page 2024
-
[52]
Diffugesture: Generating human gesture from two-person dialogue with diffusion models
Weiyu Zhao, Liangxiao Hu, and Shengping Zhang. Diffugesture: Generating human gesture from two-person dialogue with diffusion models. InGENEA Challenge 2023, 2023
work page 2023
-
[53]
Co3gesture: Towards coherent concurrent co-speech 3d gesture generation with interactive diffusion
Xingqun Qi, Yatian Wang, Hengyuan Zhang, Jiahao Pan, Wei Xue, Shanghang Zhang, Wenhan Luo, Qifeng Liu, and Yike Guo. Co3gesture: Towards coherent concurrent co-speech 3d gesture generation with interactive diffusion. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[54]
Hamza Mughal, Rishabh Dabral, Vera Demberg, and Christian Theobalt
M. Hamza Mughal, Rishabh Dabral, Vera Demberg, and Christian Theobalt. Miburi: Towards expressive interactive gesture synthesis. InComputer Vision and Pattern Recognition (CVPR), 2026
work page 2026
-
[55]
Zeyi Zhang, Tenglong Ao, Yuyao Zhang, Qingzhe Gao, Chuan Lin, Baoquan Chen, and Libin Liu. Semantic gesticulator: Semantics-aware co-speech gesture synthesis.ACM Transactions on Graphics (TOG), 43(4):1–17, 2024
work page 2024
-
[56]
The language of motion: Unifying verbal and non- verbal language of 3d human motion.CVPR, 2025
Changan Chen, Juze Zhang, Shrinidhi K Lakshmikanth, Yusu Fang, Ruizhi Shao, Gordon Wetzstein, Li Fei-Fei, and Ehsan Adeli. The language of motion: Unifying verbal and non- verbal language of 3d human motion.CVPR, 2025
work page 2025
-
[57]
Motion-example- controlled co-speech gesture generation leveraging large language models
Bohong Chen, Yumeng Li, Youyi Zheng, Yao-Xiang Ding, and Kun Zhou. Motion-example- controlled co-speech gesture generation leveraging large language models. InSIGGRAPH Conference Papers ’25, 2025
work page 2025
-
[58]
T3m: Text guided 3d human motion synthesis from speech
Wenshuo Peng, Kaipeng Zhang, and Sai Qian Zhang. T3m: Text guided 3d human motion synthesis from speech. InFindings of the Association for Computational Linguistics: NAACL 2024, pages 1168–1177, 2024
work page 2024
-
[59]
Kimodo: Scaling controllable human motion generation.arXiv, 2026
Davis Rempe, Mathis Petrovich, Ye Yuan, Haotian Zhang, Xue Bin Peng, Yifeng Jiang, Tingwu Wang, Umar Iqbal, et al. Kimodo: Scaling controllable human motion generation.arXiv, 2026
work page 2026
-
[60]
Genmo: A generalist model for human motion
Jiefeng Li, Jinkun Cao, Haotian Zhang, Davis Rempe, Jan Kautz, Umar Iqbal, and Ye Yuan. Genmo: A generalist model for human motion. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025
work page 2025
-
[61]
Jing Lin, Ruisi Wang, Junzhe Lu, Ziqi Huang, Guorui Song, Ailing Zeng, Xian Liu, Chen Wei, Wanqi Yin, Qingping Sun, et al. The quest for generalizable motion generation: Data, model, and evaluation.arXiv preprint arXiv:2510.26794, 2025
-
[62]
Haidong Xu, Guangwei Xu, Zhedong Zheng, Xiatian Zhu, Wei Ji, Xiangtai Li, Ruijie Guo, Meishan Zhang, Hao Fei, et al. Vimorag: Video-based retrieval-augmented 3d motion generation for motion language models.arXiv preprint arXiv:2508.12081, 2025
-
[63]
KinMo: Kinematic-aware Human Motion Understanding and Generation
Pengfei Zhang, Pinxin Liu, Pablo Garrido, Hyeongwoo Kim, and Bindita Chaudhuri. KinMo: Kinematic-aware Human Motion Understanding and Generation. InIEEE/CVF International Conference on Computer Vision, 2025
work page 2025
-
[64]
Anna Deichler, Jim O’Regan, and Jonas Beskow. Mm-conv: A multi-modal conversational dataset for virtual humans.arXiv preprint arXiv:2410.00253, 2024
-
[65]
Ruilong Li, Shan Yang, David A. Ross, and Angjoo Kanazawa. Ai choreographer: Music conditioned 3d dance generation with aist++, 2021
work page 2021
-
[66]
Music-aligned holistic 3d dance generation via hierarchical motion modeling, 2025
Xiaojie Li, Ronghui Li, Shukai Fang, Shuzhao Xie, Xiaoyang Guo, Jiaqing Zhou, Junkun Peng, and Zhi Wang. Music-aligned holistic 3d dance generation via hierarchical motion modeling, 2025
work page 2025
-
[67]
Embody 3d: A large-scale multimodal motion and behavior dataset, 2025
Claire McLean et al. Embody 3d: A large-scale multimodal motion and behavior dataset, 2025. 15
work page 2025
-
[68]
Nymeria: A massive collection of multimodal egocentric daily motion in the wild
Lingni Ma, Yuting Ye, Fangzhou Hong, Vladimir Guzov, Yifeng Jiang, et al. Nymeria: A massive collection of multimodal egocentric daily motion in the wild. InEuropean Conference on Computer Vision (ECCV), 2024
work page 2024
-
[69]
Vasu Agrawal, Akinniyi Akinyemi, Kathryn Alvero, Morteza Behrooz, Julia Buffalini, Fabio Maria Carlucci, Joy Chen, Junming Chen, Zhang Chen, Shiyang Cheng, et al. Seam- less interaction: Dyadic audiovisual motion modeling and large-scale dataset.arXiv preprint arXiv:2506.22554, 2025
-
[70]
Three-dimensional reconstruction of human interactions
Mihai Fieraru, Mihai Zanfir, Elisabeta Oneata, Alin-Ionut Popa, Vlad Olaru, and Cristian Sminchisescu. Three-dimensional reconstruction of human interactions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7214–7223, 2020
work page 2020
-
[71]
Intend to move: A multimodal dataset for intention-aware human motion understanding
Ryo Umagami, Liu Yue, Xuangeng Chu, Ryuto Fukushima, Tetsuya Narita, Yusuke Mukuta, Tomoyuki Takahata, Jianfei Yang, and Tatsuya Harada. Intend to move: A multimodal dataset for intention-aware human motion understanding. InNeurIPS Datasets and Benchmarks Track, 2025
work page 2025
-
[72]
Personabooth: Personalized text-to-motion generation.arXiv preprint arXiv:2503.07390, 2025
Boeun Kim, Hea In Jeong, JungHoon Sung, Yihua Cheng, Jeongmin Lee, Ju Yong Chang, et al. Personabooth: Personalized text-to-motion generation.arXiv preprint arXiv:2503.07390, 2025
-
[73]
Tencent Hunyuan 3D Digital Human Team. Hy-motion 1.0: Scaling flow matching models for text-to-motion generation.arXiv preprint arXiv:2512.23464, 2025. 16 A Character Profile SuSu is designed as a cohabiting companion character whose personality balances warmth with playful reserve. Table 10 and Table 11 detail her basic attributes and behavioral design, ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.