MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding

Dan Xu; Di Huang; Jile Jiao; Shixiang Tang; Wanli Ouyang; Xuetao Feng; Yaqi Zhang; Yuan Wang

arxiv: 2410.21747 · v2 · pith:OAV34FOGnew · submitted 2024-10-29 · 💻 cs.CV

MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding

Yuan Wang , Di Huang , Yaqi Zhang , Wanli Ouyang , Jile Jiao , Xuetao Feng , Dan Xu , Shixiang Tang This is my paper

classification 💻 cs.CV

keywords motionmotiongpt-2generationbodycontrolfocuslargemodel

0 comments

read the original abstract

Generating lifelike human motions from descriptive texts has experienced remarkable research focus in the recent years, propelled by the emerging requirements of digital humans.Despite impressive advances, existing approaches are often constrained by limited control modalities, task specificity, and focus solely on body motion representations.In this paper, we present MotionGPT-2, a unified Large Motion-Language Model (LMLM) that addresses these limitations. MotionGPT-2 accommodates multiple motion-relevant tasks and supporting multimodal control conditions through pre-trained Large Language Models (LLMs). It quantizes multimodal inputs-such as text and single-frame poses-into discrete, LLM-interpretable tokens, seamlessly integrating them into the LLM's vocabulary. These tokens are then organized into unified prompts, guiding the LLM to generate motion outputs through a pretraining-then-finetuning paradigm. We also show that the proposed MotionGPT-2 is highly adaptable to the challenging 3D holistic motion generation task, enabled by the innovative motion discretization framework, Part-Aware VQVAE, which ensures fine-grained representations of body and hand movements. Extensive experiments and visualizations validate the effectiveness of our method, demonstrating the adaptability of MotionGPT-2 across motion generation, motion captioning, and generalized motion completion tasks.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Encoder-Free Human Motion Understanding via Structured Motion Descriptions
cs.CV 2026-04 unverdicted novelty 7.0

SMD converts human motion data into structured text descriptions, enabling LLMs to reach new state-of-the-art results on motion question answering and captioning without learned encoders.
ExpertEdit: Learning Skill-Aware Motion Editing from Expert Videos
cs.CV 2026-04 unverdicted novelty 7.0

ExpertEdit edits novice motions to expert skill levels by learning a motion prior from unpaired videos and infilling masked skill-critical spans.
ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body
cs.CV 2025-12 unverdicted novelty 7.0

ViBES introduces a speech-language-behavior model using modality-specific transformer experts that jointly generates dialogue and 3D body actions, showing gains over separate co-speech and text-to-motion baselines on ...
SCRIPT: Scalable Diffusion Policy with Multi-stage Training for Language-driven Physics-based Humanoid Control
cs.GR 2026-05 unverdicted novelty 6.0

A new diffusion transformer policy with joint attention over actions, states, and text plus RL post-training outperforms prior methods on language alignment and motion quality for humanoid control.
Make it Simple, Make it Dance: Dance Motion Simplification to Support Novices' Dance Learning
cs.HC 2026-04 unverdicted novelty 6.0

Rule-based and learning-based algorithms simplify dance motions to help novices learn more effectively while maintaining naturalness and style.
Next-Scale Autoregressive Models for Text-to-Motion Generation
cs.CV 2026-04 unverdicted novelty 6.0

MoScale introduces a hierarchical next-scale autoregressive framework for text-to-motion generation that achieves state-of-the-art performance by refining motions from coarse to fine temporal resolutions.
LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens
cs.CV 2026-02 unverdicted novelty 6.0

LLaMo scales pretrained LLMs for unified motion-language tasks by encoding motion into continuous causal latents and adding a flow-matching head for real-time autoregressive generation and captioning.