pith. machine review for the scientific record. sign in

arxiv: 2209.14916 · v2 · submitted 2022-09-29 · 💻 cs.CV · cs.GR

Recognition: 2 theorem links

· Lean Theorem

Human Motion Diffusion Model

Authors on Pith no claims yet

Pith reviewed 2026-05-15 03:18 UTC · model grok-4.3

classification 💻 cs.CV cs.GR
keywords human motion generationdiffusion modelstext-to-motionaction-to-motionclassifier-free diffusiontransformergeometric lossesmotion synthesis
0
0 comments X

The pith

A diffusion model for human motion generates natural sequences from text or actions by predicting the clean sample at each step instead of noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Motion Diffusion Model (MDM), a transformer-based classifier-free diffusion model for human motion. It deliberately predicts the clean motion sample rather than the added noise during the reverse diffusion process. This choice lets the training directly apply geometric losses on joint locations and velocities, including foot contact constraints. The resulting model supports flexible conditioning inputs and multiple generation tasks while using modest compute. It reaches leading performance on standard text-to-motion and action-to-motion benchmarks.

Core claim

MDM is a transformer-based classifier-free diffusion model for human motion that predicts the clean sample at each diffusion step rather than the noise. This design permits the direct incorporation of geometric losses on the motion's locations and velocities. The model supports different conditioning modes for tasks such as text-to-motion and action-to-motion, trains with lightweight resources, and attains state-of-the-art results on leading benchmarks.

What carries the argument

The central mechanism is the prediction of the clean motion sample (instead of noise) at each diffusion step, which enables direct application of geometric losses on joint positions and velocities.

If this is right

  • Different conditioning modes become usable for a range of motion generation tasks.
  • Geometric losses directly improve physical plausibility such as foot contact.
  • State-of-the-art performance is reached on text-to-motion and action-to-motion benchmarks.
  • Training succeeds with only lightweight computational resources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The clean-sample prediction strategy may transfer to diffusion models for other sequential data such as video frames or audio waveforms.
  • Direct geometric losses could ease coupling with physics simulators for longer-horizon motion planning.
  • Modest training cost suggests feasible adaptation for interactive or on-device animation tools.

Load-bearing premise

Predicting the clean sample rather than the noise at each diffusion step, together with geometric losses, will reliably yield higher-quality and more controllable motions than standard noise-prediction diffusion on the chosen motion datasets and benchmarks.

What would settle it

A side-by-side experiment training an otherwise identical noise-prediction diffusion model on the same datasets and showing equal or better results on text-to-motion and action-to-motion benchmarks would falsify the central claim.

read the original abstract

Natural and expressive human motion generation is the holy grail of computer animation. It is a challenging task, due to the diversity of possible motion, human perceptual sensitivity to it, and the difficulty of accurately describing it. Therefore, current generative solutions are either low-quality or limited in expressiveness. Diffusion models, which have already shown remarkable generative capabilities in other domains, are promising candidates for human motion due to their many-to-many nature, but they tend to be resource hungry and hard to control. In this paper, we introduce Motion Diffusion Model (MDM), a carefully adapted classifier-free diffusion-based generative model for the human motion domain. MDM is transformer-based, combining insights from motion generation literature. A notable design-choice is the prediction of the sample, rather than the noise, in each diffusion step. This facilitates the use of established geometric losses on the locations and velocities of the motion, such as the foot contact loss. As we demonstrate, MDM is a generic approach, enabling different modes of conditioning, and different generation tasks. We show that our model is trained with lightweight resources and yet achieves state-of-the-art results on leading benchmarks for text-to-motion and action-to-motion. https://guytevet.github.io/mdm-page/ .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces the Motion Diffusion Model (MDM), a transformer-based classifier-free diffusion model for human motion generation. It adapts standard diffusion by predicting the clean sample (rather than noise) at each step to enable direct application of geometric losses on joint locations, velocities, and foot contacts. MDM is presented as a generic framework supporting multiple conditioning modes and tasks, trained with lightweight resources while claiming state-of-the-art results on text-to-motion and action-to-motion benchmarks.

Significance. If the performance claims and the contribution of the sample-prediction design hold, the work would offer a practical advance in controllable motion synthesis by showing how diffusion models can be made efficient and compatible with domain-specific geometric constraints, potentially improving quality and expressiveness over prior generative approaches in animation and robotics.

major comments (2)
  1. [Method (design choice description)] The central design claim that predicting the clean sample (instead of noise) facilitates geometric losses and yields higher-quality motions lacks an ablation that holds architecture, conditioning mechanism, and loss terms fixed while varying only the prediction target. Without this isolation, gains on the benchmarks cannot be confidently attributed to the sample-prediction choice rather than the transformer backbone or classifier-free guidance.
  2. [Abstract and Experiments] The abstract asserts SOTA results and lightweight training yet supplies no quantitative metrics, baseline comparisons, ablation tables, or error analysis. The experiments section must include concrete numbers (e.g., FID, R-Precision, diversity scores) with statistical significance and direct comparisons to prior diffusion and non-diffusion methods to substantiate the performance claims.
minor comments (1)
  1. [Implementation details] Clarify the exact number of diffusion timesteps and training compute (GPU-hours) in the main text or a table so readers can verify the 'lightweight' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and constructive suggestions. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Method (design choice description)] The central design claim that predicting the clean sample (instead of noise) facilitates geometric losses and yields higher-quality motions lacks an ablation that holds architecture, conditioning mechanism, and loss terms fixed while varying only the prediction target. Without this isolation, gains on the benchmarks cannot be confidently attributed to the sample-prediction choice rather than the transformer backbone or classifier-free guidance.

    Authors: We appreciate this point. Our current experiments compare MDM to prior methods and include some component ablations, but we acknowledge the value of an isolated ablation on the prediction target. In the revised version, we will add a new experiment training a noise-prediction model with identical architecture, conditioning, and as many loss terms as feasible to directly compare the impact of sample prediction. revision: yes

  2. Referee: [Abstract and Experiments] The abstract asserts SOTA results and lightweight training yet supplies no quantitative metrics, baseline comparisons, ablation tables, or error analysis. The experiments section must include concrete numbers (e.g., FID, R-Precision, diversity scores) with statistical significance and direct comparisons to prior diffusion and non-diffusion methods to substantiate the performance claims.

    Authors: The experiments section of the manuscript does contain the requested quantitative metrics, tables with FID, R-Precision, diversity scores, and comparisons to baselines including both diffusion and non-diffusion methods. However, we agree the abstract is high-level. We will revise the abstract to include specific performance highlights and ensure the experiments section explicitly discusses statistical significance and includes any additional error analysis. revision: partial

Circularity Check

0 steps flagged

No significant circularity in MDM derivation chain

full rationale

The paper frames MDM as an empirical adaptation of classifier-free diffusion models to human motion, using a transformer backbone and a design choice to predict the clean sample (rather than noise) at each step in order to enable geometric losses on locations, velocities, and foot contact. No equations, derivations, or self-referential definitions are presented that reduce the claimed architecture, conditioning modes, or benchmark performance to quantities defined by the model itself. Results are validated against external text-to-motion and action-to-motion benchmarks rather than internal consistency loops, and no load-bearing self-citations or uniqueness theorems are invoked. The central claims therefore remain independent of the inputs they are evaluated against.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The abstract relies on standard diffusion assumptions and the premise that human motion sequences can be tokenized and processed by transformers with added geometric regularizers; no new physical entities are introduced.

free parameters (1)
  • number of diffusion timesteps
    Standard hyperparameter in diffusion models; value chosen to balance quality and compute.
axioms (1)
  • domain assumption Human motion can be represented as fixed-length sequences of joint positions and velocities suitable for transformer attention.
    Implicit in the choice of transformer backbone and geometric losses.

pith-pipeline@v0.9.0 · 5526 in / 1241 out tokens · 52374 ms · 2026-05-15T03:18:23.208904+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation Jcost_cosh_identity echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    A notable design-choice is the prediction of the sample, rather than the noise, in each diffusion step. This facilitates the use of established geometric losses on the locations and velocities of the motion, such as the foot contact loss.

  • Foundation.DimensionForcing dimension_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    MDM is a generic approach, enabling different modes of conditioning, and different generation tasks. We show that our model is trained with lightweight resources and yet achieves state-of-the-art results on leading benchmarks for text-to-motion and action-to-motion.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow

    cs.CV 2026-05 unverdicted novelty 7.0

    R-DMesh generates high-fidelity 4D meshes aligned to video by disentangling base mesh, motion, and a learned rectification jump offset inside a VAE, then using Triflow Attention and rectified-flow diffusion.

  2. Stylized Text-to-Motion Generation via Hypernetwork-Driven Low-Rank Adaptation

    cs.CV 2026-05 unverdicted novelty 7.0

    A hypernetwork maps style motion embeddings to LoRA updates that stylize text-driven motion diffusion models with improved generalization to unseen styles via contrastive structuring of the style space.

  3. ScaleMoGen: Autoregressive Next-Scale Prediction for Human Motion Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    ScaleMoGen introduces a scale-wise autoregressive framework that quantizes motions into hierarchical discrete tokens and predicts next-scale maps to achieve SOTA FID 0.030 on HumanML3D and text-guided editing.

  4. MaMi-HOI: Harmonizing Global Kinematics and Local Geometry for Human-Object Interaction Generation

    cs.RO 2026-05 unverdicted novelty 7.0

    MaMi-HOI counters geometric forgetting in diffusion models via a Geometry-Aware Proximity Adapter for precise contacts and a Kinematic Harmony Adapter for natural whole-body postures in human-object interactions.

  5. MotionGRPO: Overcoming Low Intra-Group Diversity in GRPO-Based Egocentric Motion Recovery

    cs.CV 2026-05 unverdicted novelty 7.0

    MotionGRPO models diffusion sampling as a Markov decision process optimized with Group Relative Policy Optimization, using hybrid rewards and noise injection to boost sample diversity and local joint precision in egoc...

  6. DanceCrafter: Fine-Grained Text-Driven Controllable Dance Generation via Choreographic Syntax

    cs.CV 2026-04 unverdicted novelty 7.0

    DanceCrafter generates high-fidelity, text-controlled dance sequences using a new Choreographic Syntax framework and a large fine-grained motion dataset.

  7. TeMuDance: Contrastive Alignment-Based Textual Control for Music-Driven Dance Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    TeMuDance enables text-based semantic control over music-conditioned dance generation by using motion as a bridge to align existing unpaired datasets and training a lightweight text branch on a frozen diffusion backbo...

  8. ExpertEdit: Learning Skill-Aware Motion Editing from Expert Videos

    cs.CV 2026-04 unverdicted novelty 7.0

    ExpertEdit edits novice motions to expert skill levels by learning a motion prior from unpaired videos and infilling masked skill-critical spans.

  9. Towards Continuous Sign Language Conversation from Isolated Signs

    cs.CV 2026-05 unverdicted novelty 6.0

    Constructs continuous sign conversation data from isolated signs using retrieval and diffusion models to train a direct sign-to-sign conversational AI.

  10. PhysiGen: Integrating Collision-Aware Physical Constraints for High-Fidelity Human-Human Interaction Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    PhysiGen reduces interpenetration in text-driven 3D human interaction generation by simplifying meshes to geometric primitives for fast collision detection and guiding optimization with collision regions.

  11. Exploring the Potential of Probabilistic Transformer for Time Series Modeling: A Report on the ST-PT Framework

    cs.LG 2026-04 unverdicted novelty 6.0

    ST-PT turns transformers into explicit factor graphs for time series, enabling structural injection of symbolic priors, per-sample conditional generation, and principled latent autoregressive forecasting via MFVI iterations.

  12. IAM: Identity-Aware Human Motion and Shape Joint Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    IAM jointly synthesizes motion sequences and body shape parameters conditioned on multimodal identity signals to achieve more realistic and identity-consistent human motions.

  13. Stability-Driven Motion Generation for Object-Guided Human-Human Co-Manipulation

    cs.CV 2026-04 unverdicted novelty 6.0

    A flow-matching model derives manipulation strategies from object affordance, adds an adversarial interaction prior, and uses stability simulation to generate natural, effective human-human co-manipulation motions.

  14. Visually-grounded Humanoid Agents

    cs.CV 2026-04 unverdicted novelty 6.0

    A coupled world-agent framework uses 3D Gaussian reconstruction and first-person RGB-D perception with iterative planning to enable goal-directed, collision-avoiding humanoid behavior in novel reconstructed scenes.

  15. Next-Scale Autoregressive Models for Text-to-Motion Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    MoScale introduces a hierarchical next-scale autoregressive framework for text-to-motion generation that achieves state-of-the-art performance by refining motions from coarse to fine temporal resolutions.

  16. Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control

    cs.RO 2026-05 unverdicted novelty 5.0

    DAJI learns future-aware joint intents from language to enable proactive humanoid control, reporting 94.42% rollout success on HumanML3D-style tasks and 0.152 subsequence FID on BABEL.

  17. R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow

    cs.CV 2026-05 unverdicted novelty 5.0

    R-DMesh uses a VAE with a learned rectification jump offset and Triflow Attention inside a rectified-flow diffusion transformer to produce video-aligned 4D meshes despite initial pose misalignment.

  18. KAN Text to Vision? The Exploration of Kolmogorov-Arnold Networks for Multi-Scale Sequence-Based Pose Animation from Sign Language Notation

    cs.CV 2026-05 unverdicted novelty 5.0

    KANMultiSign generates sign language poses from notation via coarse-to-fine multi-scale supervision and compact KAN-Transformer modules, achieving lower DTW joint error with fewer parameters than baselines on several ...

  19. MotionGRPO: Overcoming Low Intra-Group Diversity in GRPO-Based Egocentric Motion Recovery

    cs.CV 2026-05 unverdicted novelty 5.0

    MotionGRPO applies GRPO with noise injection and hybrid rewards to diffusion-based egocentric motion recovery, overcoming vanishing gradients from low intra-group diversity to reach state-of-the-art performance.

  20. Uni-HOI:A Unified framework for Learning the Joint distribution of Text and Human-Object Interaction

    cs.CV 2026-04 unverdicted novelty 5.0

    Uni-HOI learns the joint distribution of text, human motion, and object motion using LLMs and VQ-VAEs in a two-stage training process for multiple HOI tasks.

  21. EgoMotion: Hierarchical Reasoning and Diffusion for Egocentric Vision-Language Motion Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    EgoMotion decouples reasoning from motion synthesis in egocentric vision-language tasks by mapping inputs to motion primitives via VLM then using diffusion to produce grounded and coherent 3D trajectories.

  22. Exploring Motion-Language Alignment for Text-driven Motion Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    MLA-Gen advances text-driven motion synthesis by aligning global motion patterns with fine-grained text semantics and mitigating attention sink effects via new masking techniques.

  23. MuSteerNet: Human Reaction Generation from Videos via Observation-Reaction Mutual Steering

    cs.CV 2026-03 unverdicted novelty 5.0

    MuSteerNet generates realistic 3D human reactions from videos by mutually steering visual observations and reaction motions to reduce content mismatch.

  24. AnimateAnyMesh++: A Flexible 4D Foundation Model for High-Fidelity Text-Driven Mesh Animation

    cs.CV 2026-04 unverdicted novelty 4.0

    AnimateAnyMesh++ animates arbitrary 3D meshes from text using an expanded 300K-identity DyMesh-XL dataset, a power-law topology-aware DyMeshVAE-Flex, and a variable-length rectified-flow generator to produce semantica...

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 22 Pith papers · 9 internal anchors

  1. [1]

    Accessed: 2021-12-25

    URL https://www.mixamo.com. Accessed: 2021-12-25. Chaitanya Ahuja and Louis-Philippe Morency. Language2pose: Natural language grounded pose forecasting. In 2019 International Conference on 3D Vision (3DV) , pp. 719–728. IEEE,

  2. [2]

    A spatio-temporal transformer for 3d human motion prediction

    Emre Aksan, Manuel Kaufmann, Peng Cao, and Otmar Hilliges. A spatio-temporal transformer for 3d human motion prediction. In 2021 International Conference on 3D Vision (3DV), pp. 565–574. IEEE,

  3. [3]

    Text2gestures: A transformer-based network for generating emotive body ges- tures for virtual agents

    Uttaran Bhattacharya, Nicholas Rewkowski, Abhishek Banerjee, Pooja Guhan, Aniket Bera, and Dinesh Manocha. Text2gestures: A transformer-based network for generating emotive body ges- tures for virtual agents. In 2021 IEEE Virtual Reality and 3D User Interfaces (VR) , pp. 1–10. IEEE,

  4. [4]

    Implicit neural representations for variable length human motion generation

    Pablo Cervantes, Yusuke Sekikawa, Ikuro Sato, and Koichi Shinoda. Implicit neural representations for variable length human motion generation. arXiv preprint arXiv:2203.13694,

  5. [5]

    Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

    Kyunghyun Cho, Bart Van Merri¨enboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Hol- ger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078,

  6. [6]

    Single-shot motion completion with transformer

    Yinglin Duan, Tianyang Shi, Zhengxia Zou, Yenan Lin, Zhehui Qian, Bohan Zhang, and Yi Yuan. Single-shot motion completion with transformer. arXiv preprint arXiv:2103.00776,

  7. [7]

    Action2motion: Conditioned generation of 3d human motions

    Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. Action2motion: Conditioned generation of 3d human motions. In Proceedings of the 28th ACM International Conference on Multimedia , pp. 2021–2029,

  8. [8]

    Generating diverse and natural 3d human motions from text

    Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5152–5161, 2022a. Wen Guo, Yuming Du, Xi Shen, Vincent Lepetit, Xavier Alameda-Pineda, and Francesc Moreno- Noguer. Back to mlp: A...

  9. [9]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598,

  10. [10]

    Video Diffusion Models

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. arXiv preprint arXiv:2204.03458,

  11. [11]

    Con- volutional autoencoders for human motion infilling

    Manuel Kaufmann, Emre Aksan, Jie Song, Fabrizio Pece, Remo Ziegler, and Otmar Hilliges. Con- volutional autoencoders for human motion infilling. In 2020 International Conference on 3D Vision (3DV), pp. 918–927. IEEE,

  12. [12]

    Flame: Free-form language-based motion synthesis & editing

    Jihoon Kim, Jiseob Kim, and Sungjoon Choi. Flame: Free-form language-based motion synthesis & editing. arXiv preprint arXiv:2209.00349,

  13. [13]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,

  14. [14]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741,

  15. [15]

    Modi: Unconditional motion synthesis from diverse data

    Sigal Raab, Inbal Leibovitch, Peizhuo Li, Kfir Aberman, Olga Sorkine-Hornung, and Daniel Cohen- Or. Modi: Unconditional motion synthesis from diverse data. arXiv preprint arXiv:2206.08010,

  16. [16]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text- conditional image generation with clip latents. arXiv preprint arXiv:2204.06125,

  17. [17]

    Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

    Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pp. 1–10, 2022a. Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kam- yar Seyed Ghasemipour, Burcu Karagol Ayan, ...

  18. [18]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020a. Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. Advances in neural information processing systems , 33:12438–12448,

  19. [19]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020b. Guy Tevet, Brian Gordon, Amir Hertz, Amit H Bermano, and Daniel Cohen-Or. Motionclip: Ex- posing human motion generation to clip space. arXiv p...

  20. [20]

    Motiondiffuse: Text-driven human motion generation with diffusion model

    Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu. Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001,