MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding
read the original abstract
Generating lifelike human motions from descriptive texts has experienced remarkable research focus in the recent years, propelled by the emerging requirements of digital humans.Despite impressive advances, existing approaches are often constrained by limited control modalities, task specificity, and focus solely on body motion representations.In this paper, we present MotionGPT-2, a unified Large Motion-Language Model (LMLM) that addresses these limitations. MotionGPT-2 accommodates multiple motion-relevant tasks and supporting multimodal control conditions through pre-trained Large Language Models (LLMs). It quantizes multimodal inputs-such as text and single-frame poses-into discrete, LLM-interpretable tokens, seamlessly integrating them into the LLM's vocabulary. These tokens are then organized into unified prompts, guiding the LLM to generate motion outputs through a pretraining-then-finetuning paradigm. We also show that the proposed MotionGPT-2 is highly adaptable to the challenging 3D holistic motion generation task, enabled by the innovative motion discretization framework, Part-Aware VQVAE, which ensures fine-grained representations of body and hand movements. Extensive experiments and visualizations validate the effectiveness of our method, demonstrating the adaptability of MotionGPT-2 across motion generation, motion captioning, and generalized motion completion tasks.
This paper has not been read by Pith yet.
Forward citations
Cited by 7 Pith papers
-
Encoder-Free Human Motion Understanding via Structured Motion Descriptions
SMD converts human motion data into structured text descriptions, enabling LLMs to reach new state-of-the-art results on motion question answering and captioning without learned encoders.
-
ExpertEdit: Learning Skill-Aware Motion Editing from Expert Videos
ExpertEdit edits novice motions to expert skill levels by learning a motion prior from unpaired videos and infilling masked skill-critical spans.
-
ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body
ViBES introduces a speech-language-behavior model using modality-specific transformer experts that jointly generates dialogue and 3D body actions, showing gains over separate co-speech and text-to-motion baselines on ...
-
SCRIPT: Scalable Diffusion Policy with Multi-stage Training for Language-driven Physics-based Humanoid Control
A new diffusion transformer policy with joint attention over actions, states, and text plus RL post-training outperforms prior methods on language alignment and motion quality for humanoid control.
-
Make it Simple, Make it Dance: Dance Motion Simplification to Support Novices' Dance Learning
Rule-based and learning-based algorithms simplify dance motions to help novices learn more effectively while maintaining naturalness and style.
-
Next-Scale Autoregressive Models for Text-to-Motion Generation
MoScale introduces a hierarchical next-scale autoregressive framework for text-to-motion generation that achieves state-of-the-art performance by refining motions from coarse to fine temporal resolutions.
-
LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens
LLaMo scales pretrained LLMs for unified motion-language tasks by encoding motion into continuous causal latents and adding a flow-matching head for real-time autoregressive generation and captioning.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.