Motion-Adapter: A Diffusion Model Adapter for Text-to-Motion Generation of Compound Actions

Bingxin Yun; Liuyuxin Yang; Mingyu Yang; Yang Xu; Yue Jiang; Yuhe Zhang

arxiv: 2604.16135 · v2 · submitted 2026-04-17 · 💻 cs.CV

Motion-Adapter: A Diffusion Model Adapter for Text-to-Motion Generation of Compound Actions

Yue Jiang , Mingyu Yang , Liuyuxin Yang , Yang Xu , Bingxin Yun , Yuhe Zhang This is my paper

Pith reviewed 2026-05-10 08:09 UTC · model grok-4.3

classification 💻 cs.CV

keywords text-to-motion generationdiffusion modelscompound actionscross-attention mapsmotion synthesisadapter moduledenoising processhuman motion

0 comments

The pith

A plug-and-play adapter enables text-to-motion diffusion models to generate coherent compound actions from simple text by using decoupled cross-attention maps as structural masks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to overcome two specific failures in current diffusion-based motion generators when asked to produce compound actions such as walking while waving. Earlier actions get overwritten and attention layers collapse when multiple motions must occur together, forcing users to supply overly detailed prompts or external language models. The authors introduce the Motion-Adapter module that inserts into an existing model without retraining and computes separate cross-attention maps to mask the denoising steps. These masks preserve distinct action structures across time, yielding full-body sequences that remain faithful to the input text. Readers care because the method removes the need for hand-crafted body-part instructions and produces more natural simultaneous behaviors.

Core claim

We propose the Motion-Adapter, a plug-and-play module that guides text-to-motion diffusion models in generating compound actions by computing decoupled cross-attention maps, which serve as structural masks during the denoising process. This directly counters catastrophic neglect of earlier actions and attention collapse from excessive feature fusion, allowing the model to produce faithful and coherent full-body motions from diverse textual prompts without requiring detailed descriptions, explicit body-part edits, or large language models.

What carries the argument

The Motion-Adapter module, which computes decoupled cross-attention maps that act as structural masks to guide the denoising process in a pre-trained text-to-motion diffusion model.

If this is right

Compound actions can be synthesized from concise text without explicit body-part specifications or external language models.
Existing diffusion models gain the ability to handle concurrent motions while preserving temporal order and physical coherence.
Performance improves consistently across varied textual prompts compared with prior state-of-the-art methods.
The approach maintains plug-and-play compatibility, so no full model retraining is required.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The masking technique could transfer to other diffusion tasks that require simultaneous generation of multiple elements, such as multi-object scene synthesis.
Integration with longer or more complex motion sequences would test whether the structural masks scale without additional modifications.
The method opens a route for combining motion generation with other modalities like speech or music by treating them as additional conditioning signals under the same masking logic.

Load-bearing premise

Decoupled cross-attention maps will separate concurrent actions reliably enough to prevent overwriting and collapse across all prompts and motion types without introducing new artifacts or needing base-model retraining.

What would settle it

Running the adapter on prompts that combine two independent actions, such as 'greeting while walking,' and checking whether both actions appear simultaneously and without one overwriting the other in the generated sequence.

Figures

Figures reproduced from arXiv: 2604.16135 by Bingxin Yun, Liuyuxin Yang, Mingyu Yang, Yang Xu, Yue Jiang, Yuhe Zhang.

**Figure 1.** Figure 1: Motion sequences generated by our Motion-Adapter given a textual prompt and a pre-trained motion diffusion model ( [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Comparison of attention maps from SALAD [3] and our Motion [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 4.** Figure 4: Illustration of the skeletal pooling on the HumanML3D dataset [36]. [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 3.** Figure 3: Overview of the Motion-Adapter integrated into the diffusion model [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 5.** Figure 5: The architecture of the decoupled cross-attention. [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison of compound actions combining ’greeting’ [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 8.** Figure 8: Qualitative comparison results of compound actions combining [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗

**Figure 7.** Figure 7: Qualitative comparison results of compound actions combining [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 9.** Figure 9: Motion editing results of SALAD [3]. The black text represents the [PITH_FULL_IMAGE:figures/full_fig_p006_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative comparison results of compound actions with complex textual prompts. To enhance visual clarity, the later frames are rendered with [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

**Figure 11.** Figure 11: Results of our Motion-Adapter MotionDiffuse. F. User Study To further evaluate the perceptual quality and fidelity of the generated motions, we conducted a user study involving 65 participants. Each participant was asked to complete three main tasks: (i) rate 15 videos per method based on the semantic alignment between the motion and the corresponding textual prompt, providing a measure of fidelity for th… view at source ↗

**Figure 12.** Figure 12: Attention maps extracted at t = 700, 600, 500 along with the resulting motion generated by applying the masks throughout all denoising steps. For visual clarity, we show the skeletons. As shown in Table IV, removing the masking step constraints leads to decreased performance across all evaluation metrics, suggesting that unconstrained masking places excessive emphasis on specific joints. We further obse… view at source ↗

read the original abstract

Recent advances in generative motion synthesis have enabled the production of realistic human motions from diverse input modalities. However, synthesizing compound actions from texts, which integrate multiple concurrent actions into coherent full-body sequences, remains a major challenge. We identify two key limitations in current text-to-motion diffusion models: (i) catastrophic neglect, where earlier actions are overwritten by later ones due to improper handling of temporal information, and (ii) attention collapse, which arises from excessive feature fusion in cross-attention mechanisms. As a result, existing approaches often depend on overly detailed textual descriptions (e.g., raising right hand), explicit body-part specifications (e.g., editing the upper body), or the use of large language models (LLMs) for body-part interpretation. These strategies lead to deficient semantic representations of physical structures and kinematic mechanisms, limiting the ability to incorporate natural behaviors such as greeting while walking. To address these issues, we propose the Motion-Adapter, a plug-and-play module that guides text-to-motion diffusion models in generating compound actions by computing decoupled cross-attention maps, which serve as structural masks during the denoising process. Extensive experiments demonstrate that our method consistently produces more faithful and coherent compound motions across diverse textual prompts, surpassing state-of-the-art approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The Motion-Adapter is a plug-and-play module that decouples cross-attention to mask and preserve multiple concurrent actions in text-to-motion diffusion, addressing a practical gap without retraining the base model.

read the letter

This paper's core idea is a Motion-Adapter that plugs into existing text-to-motion diffusion models and uses decoupled cross-attention maps as structural masks to generate compound actions more coherently. The authors flag two concrete failure modes in current systems: catastrophic neglect, where later actions overwrite earlier ones across time steps, and attention collapse from over-fused features. Their fix separates the attention computation so each action can influence the denoising process independently, which should let simple prompts produce natural combinations like walking while waving without extra LLM parsing or body-part edits. That design choice is the main novelty and it fits the plug-and-play requirement, so it could be added to models already in use for animation or robotics. The approach is explained clearly enough that the technical rationale holds together on its own terms. What the paper does well is tie the proposed module directly to the identified problems instead of offering another generic conditioning trick. The stress-test note is right that no obvious internal contradiction appears in the description. The soft spots are mostly around evidence. The abstract asserts that extensive experiments show consistent gains in faithfulness and coherence over prior methods, but without the actual metrics, dataset splits, ablation tables, or qualitative failure cases visible here, it is impossible to judge how large the improvement is or whether new artifacts appear in edge cases. Generalization across base diffusion models and prompt styles also needs checking. This is aimed at people already working on diffusion-based motion synthesis who want a lightweight way to extend their models to multi-action sequences. Readers who care about practical fixes for compound motion will get something usable from the method section even if the gains turn out modest. It is coherent enough and addresses a real subfield need, so it deserves a serious referee to examine the experiments and code. I would send it for peer review rather than desk reject.

Referee Report

2 major / 1 minor

Summary. The paper claims to solve the problem of generating compound actions (multiple concurrent motions) from text using diffusion models by introducing Motion-Adapter. This plug-and-play module computes decoupled cross-attention maps that act as structural masks in the denoising process to prevent catastrophic neglect of early actions and attention collapse. It asserts that this leads to more coherent motions from simple prompts and outperforms existing methods in experiments.

Significance. Should the proposed adapter prove effective, it would represent a meaningful advance in text-to-motion generation by enabling natural compound behaviors without reliance on overly specific prompts or auxiliary LLMs. This could broaden the applicability of diffusion-based motion synthesis in fields like computer animation and human-robot interaction. The plug-and-play design is particularly promising for adoption.

major comments (2)

The mechanism for decoupling cross-attention maps and integrating them as masks during denoising is described conceptually but lacks the precise algorithmic steps or pseudocode needed to fully evaluate its impact on temporal information handling.
While the abstract states that the method surpasses SOTA, the experimental section should provide detailed quantitative results, including specific metrics, dataset information, and comparisons to baselines to substantiate the claims of consistent improvement across diverse prompts.

minor comments (1)

Include at least one key performance metric to support the superiority claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation of our work's potential impact and for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate clarifications and additional details where appropriate.

read point-by-point responses

Referee: The mechanism for decoupling cross-attention maps and integrating them as masks during denoising is described conceptually but lacks the precise algorithmic steps or pseudocode needed to fully evaluate its impact on temporal information handling.

Authors: We appreciate this observation. Section 3 of the manuscript describes the decoupling of cross-attention maps and their use as structural masks, including how they preserve temporal information across denoising steps to mitigate catastrophic neglect. However, we agree that explicit algorithmic steps and pseudocode would improve reproducibility and allow better assessment of the temporal handling. In the revised version, we will add a dedicated algorithm box with precise pseudocode outlining the map computation, decoupling, masking, and integration into the diffusion process. revision: yes
Referee: While the abstract states that the method surpasses SOTA, the experimental section should provide detailed quantitative results, including specific metrics, dataset information, and comparisons to baselines to substantiate the claims of consistent improvement across diverse prompts.

Authors: We thank the referee for this point. The experimental section reports quantitative results on standard benchmarks (HumanML3D and KIT) using metrics such as FID, R-Precision, and diversity scores, along with comparisons to baselines including MDM, MLD, and others, plus user studies on compound action coherence. These demonstrate consistent improvements. To further substantiate the claims, we will expand the section in revision with additional tables containing exact numerical values, dataset statistics and splits, and more baseline comparisons across a broader set of diverse compound prompts. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes Motion-Adapter as an independent plug-and-play module that computes decoupled cross-attention maps to serve as structural masks in existing text-to-motion diffusion models. This directly addresses the stated problems of catastrophic neglect and attention collapse without any derivation that reduces to self-definition, fitted inputs renamed as predictions, or self-citation chains. The approach is presented as an additive technical contribution with claimed experimental support across prompts, and no equation or step in the provided description equates to its own inputs by construction. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract introduces no explicit free parameters, mathematical axioms, or new invented entities; the contribution is an architectural adapter module whose internal details are not specified here.

pith-pipeline@v0.9.0 · 5535 in / 1104 out tokens · 55658 ms · 2026-05-10T08:09:16.972859+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 1 internal anchor

[1]

Human motion diffusion model,

G. Tevet, S. Raab, B. Gordon, Y . Shafir, D. Cohen-or, and A. H. Bermano, “Human motion diffusion model,” inThe Eleventh Interna- tional Conference on Learning Representations, 2023

work page 2023
[2]

Action2motion: Conditioned generation of 3d human motions,

C. Guo, X. Zuo, S. Wang, S. Zou, Q. Sun, A. Deng, M. Gong, and L. Cheng, “Action2motion: Conditioned generation of 3d human motions,” inProceedings of the 28th ACM International Conference on Multimedia, ser. MM ’20. New York, NY , USA: Association for Computing Machinery, 2020, p. 2021–2029

work page 2020
[3]

Salad: Skeleton- aware latent diffusion for text-driven motion generation and editing,

S. Hong, C. Kim, S. Yoon, J. Nam, S. Cha, and J. Noh, “Salad: Skeleton- aware latent diffusion for text-driven motion generation and editing,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2025, p. 13836

work page 2025
[4]

arXiv preprint arXiv:2509.04058 , year=

L. Zhong, Y . Yang, and L. Changjian, “Smoogpt: Stylized motion generation using large language models,” inarXiv:2509.04058, 2025

work page arXiv 2025
[5]

Motion keyframe interpolation for any human skeleton via temporally consistent point cloud sampling and reconstruction,

C. Mo, K. Hu, C. Long, D. Yuan, and Z. Wang, “Motion keyframe interpolation for any human skeleton via temporally consistent point cloud sampling and reconstruction,” inProceedings of the European Conference on Computer Vision (ECCV), ser. Lecture Notes in Computer Science, vol. 15140. Springer, Cham, 2024, pp. 159–175

work page 2024
[6]

Temos: Generating diverse human motions from textual descriptions,

M. Petrovich, M. J. Black, and G. Varol, “Temos: Generating diverse human motions from textual descriptions,” inProceedings of the Euro- pean Conference on Computer Vision (ECCV). Cham: Springer Nature Switzerland, 2022, pp. 480–497

work page 2022
[7]

Optimizing diffusion noise can serve as universal motion priors,

K. Karunratanakul, K. Preechakul, E. Aksan, T. Beeler, S. Suwajanakorn, and S. Tang, “Optimizing diffusion noise can serve as universal motion priors,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 1334–1345

work page 2024
[8]

Sinc: Spatial composition of 3d human motions for simultaneous action generation,

N. Athanasiou, M. Petrovich, M. J. Black, and G. Varol, “Sinc: Spatial composition of 3d human motions for simultaneous action generation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 9984–9995

work page 2023
[9]

Multi-track timeline control for text-driven 3d human motion generation,

M. Petrovich, O. Litany, U. Iqbal, M. J. Black, G. Varol, X. B. Peng, and D. Rempe, “Multi-track timeline control for text-driven 3d human motion generation,” inCVPR Workshop on Human Motion Generation, 2024

work page 2024
[10]

Mmm: Generative masked motion model,

E. Pinyoanuntapong, P. Wang, M. Lee, and C. Chen, “Mmm: Generative masked motion model,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 1546–1555

work page 2024
[11]

Motionfix: Text-driven 3d human motion editing,

N. Athanasiou, A. Cseke, M. Diomataris, M. J. Black, and G. Varol, “Motionfix: Text-driven 3d human motion editing,” inSIGGRAPH Asia 2024 Conference Papers. ACM, 2024. [Online]. Available: https://dl.acm.org/doi/10.1145/3680528.3687559

work page doi:10.1145/3680528.3687559 2024
[12]

Motiondiffuse: Text-driven human motion generation with diffusion model,

M. Zhang, Z. Cai, L. Pan, F. Hong, X. Guo, L. Yang, and Z. Liu, “Motiondiffuse: Text-driven human motion generation with diffusion model,”IEEE Transactions on Pattern Analysis and Machine Intelli- gence, vol. 46, no. 6, pp. 4115–4128, 2024

work page 2024
[13]

Action-conditioned 3d human motion synthesis with transformer vae,

M. Petrovich, M. J. Black, and G. Varol, “Action-conditioned 3d human motion synthesis with transformer vae,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 10 985–10 995

work page 2021
[14]

Global-local motion transformer for unsupervised skeleton-based action learning,

B. Kim, H. J. Chang, J. Kim, and J. Y . Choi, “Global-local motion transformer for unsupervised skeleton-based action learning,” inCom- puter Vision – ECCV 2022. Cham: Springer Nature Switzerland, 2022, pp. 209–225

work page 2022
[15]

Weakly-supervised action transition learning for stochastic human motion prediction,

W. Mao, M. Liu, and M. Salzmann, “Weakly-supervised action transition learning for stochastic human motion prediction,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 8151–8160

work page 2022
[16]

Learning uncoupled-modulation cvae for 3d action-conditioned human motion synthesis,

C. Zhong, L. Hu, Z. Zhang, and S. Xia, “Learning uncoupled-modulation cvae for 3d action-conditioned human motion synthesis,” inComputer Vision–ECCV 2022: 17th European Conference, 2022, pp. 716–732

work page 2022
[17]

Posegpt: Quantization-based 3d human motion generation and forecasting,

T. Lucas, F. Baradel, P. Weinzaepfel, and G. Rogez, “Posegpt: Quantization-based 3d human motion generation and forecasting,” in Proceedings of the European Conference on Computer Vision (ECCV). Cham: Springer Nature Switzerland, 2022, pp. 417–435

work page 2022
[18]

Language2pose: Natural language grounded pose forecasting,

C. Ahuja and L.-P. Morency, “Language2pose: Natural language grounded pose forecasting,” in2019 International Conference on 3D Vision (3DV), 2019, pp. 719–728

work page 2019
[19]

Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts,

C. Gu, X. Zuo, S. Wang, and L. Cheng, “Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts,” inEuropean Conference on Computer Vision (ECCV), 2022

work page 2022
[20]

Generating diverse and natural 3d human motions from text,

C. Guo, S. Zou, X. Zuo, S. Wang, W. Ji, X. Li, and L. Cheng, “Generating diverse and natural 3d human motions from text,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 5152–5161

work page 2022
[21]

Exploring vision transformers for 3d human motion-language models with motion patches,

Q. Yu, M. Tanaka, and K. Fujiwara, “Exploring vision transformers for 3d human motion-language models with motion patches,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 937–946

work page 2024
[22]

Momask: Generative masked modeling of 3d human motions,

C. Guo, Y . Mu, M. G. Javed, S. Wang, and L. Cheng, “Momask: Generative masked modeling of 3d human motions,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 1900–1910

work page 2024
[23]

Emdm: Efficient motion diffusion model for fast and high-quality motion generation,

W. Zhou, Z. Dou, Z. Cao, Z. Liao, J. Wang, W. Wang, Y . Liu, T. Komura, W. Wang, and L. Liu, “Emdm: Efficient motion diffusion model for fast and high-quality motion generation,” inProceedings of the European Conference on Computer Vision (ECCV). Cham: Springer Nature Switzerland, 2025, pp. 18–38. 12

work page 2025
[24]

Seamless human motion composition with blended positional encodings,

G. Barquero, S. Escalera, and C. Palmero, “Seamless human motion composition with blended positional encodings,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 457–469

work page 2024
[25]

Motion mamba: Efficient and long sequence motion generation,

Z. Zhang, A. Liu, I. Reid, R. Hartley, B. Zhuang, and H. Tang, “Motion mamba: Efficient and long sequence motion generation,” inComputer Vision – ECCV 2024. Cham: Springer Nature Switzerland, 2025, pp. 265–282

work page 2024
[26]

Motionflux: Efficient text-guided motion generation through rectified flow matching and preference alignment,

Z. Gao, D. Song, D. Jiang, C. Xue, and A.-A. Liu, “Motionflux: Efficient text-guided motion generation through rectified flow matching and preference alignment,” inarxiv:2508.19527, 2025

work page arXiv 2025
[27]

Motionclip: Exposing human motion generation to clip space,

G. Tevet, B. Gordon, A. Hertz, A. H. Bermano, and D. Cohen-Or, “Motionclip: Exposing human motion generation to clip space,” in Proceedings of the European Conference on Computer Vision (ECCV). Cham: Springer Nature Switzerland, 2022, pp. 358–374

work page 2022
[28]

Flame: Free-form language-based motion synthesis & editing,

J. Kim, J. Kim, and S. Choi, “Flame: Free-form language-based motion synthesis & editing,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 7. AAAI Press, 2023, pp. 8255–8263

work page 2023
[29]

Transactions on Machine Learning Research (2022) https: //doi.org/10.1007/978-3-031-73397-0 4

Y . Huang, W. Wan, Y . Yang, C. Callison-Burch, M. Yatskar, and L. Liu, “Como: Controllable motion generation through language guided pose code editing,” inProceedings of the European Conference on Computer Vision (ECCV). Springer, 2024, pp. 180–196. [Online]. Available: https://link.springer.com/chapter/10.1007/978-3-031-73397-0 11

work page doi:10.1007/978-3-031-73397-0 2024
[30]

Finemogen: Fine- grained spatio-temporal motion generation and editing,

M. Zhang, H. Li, Z. Cai, J. Ren, L. Yang, and Z. Liu, “Finemogen: Fine- grained spatio-temporal motion generation and editing,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023. [Online]. Available: https://arxiv.org/abs/2312.15004

work page arXiv 2023
[31]

TEACH: Temporal Action Composition for 3D Humans ,

N. Athanasiou, M. Petrovich, M. J. Black, and G. Varol, “ TEACH: Temporal Action Composition for 3D Humans ,” in2022 International Conference on 3D Vision (3DV). Los Alamitos, CA, USA: IEEE Computer Society, Sep. 2022, pp. 414–423

work page 2022
[32]

Priormdm: Human motion diffusion as a generative prior,

Y . Shafir, G. Tevet, R. Kapon, and A. H. Bermano, “Priormdm: Human motion diffusion as a generative prior,” inThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[33]

Mogents: Motion generation based on spatial-temporal joint modeling,

W. Yuan, W. Shen, Y . He, Y . Dong, X. Gu, Z. Dong, L. Bo, and Q. Huang, “Mogents: Motion generation based on spatial-temporal joint modeling,” inConference on Neural Information Processing Systems, 2024

work page 2024
[34]

Generation of complex 3d human motion by temporal and spatial composition of diffusion models,

L. Mandelli and S. Berretti, “Generation of complex 3d human motion by temporal and spatial composition of diffusion models,” in2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 1279–1288

work page 2025
[35]

Energymogen: Compositional human motion generation with energy-based diffusion model in latent space,

J. Zhang, H. Fan, and Y . Yang, “Energymogen: Compositional human motion generation with energy-based diffusion model in latent space,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[36]

Generating diverse and natural 3d human motions from text,

C. Guo, S. Zou, X. Zuo, S. Wang, L. Wang, and Y . Zhou, “Generating diverse and natural 3d human motions from text,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 5152–5161

work page 2022
[37]

Spatial temporal graph convolutional networks for skeleton-based action recognition,

S. Yan, Y . Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018

work page 2018
[38]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc., 2017

work page 2017
[39]

Learning transferable visual models from natural language supervi- sion,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” inProceedings of the 38th International Conference on Machine Learning (ICML), vol. 139. PMLR, 2021, pp. 8748–8763

work page 2021
[40]

Learning repre- sentations by back-propagating errors,

D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning repre- sentations by back-propagating errors,”Nature, vol. 323, no. 6088, pp. 533–536, 1986

work page 1986
[41]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2019

work page internal anchor Pith review Pith/arXiv arXiv 2019
[42]

Real-time inverse kinematics techniques for anthropomorphic limbs,

D. Tolani, A. Goswami, and N. I. Badler, “Real-time inverse kinematics techniques for anthropomorphic limbs,” inGraphical Models, vol. 62, no. 5. Elsevier, 2000, pp. 353–388

work page 2000
[43]

Smpl: A skinned multi-person linear model,

M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, “Smpl: A skinned multi-person linear model,”ACM Transactions on Graphics (TOG), vol. 34, no. 6, pp. 248:1–248:16, 2015

work page 2015
[44]

Motionlab: Unified human mo- tion generation and editing via the motion-condition-motion paradigm,

Z. Guo, Z. Hu, N. Zhao, and D. W. Soh, “Motionlab: Unified human mo- tion generation and editing via the motion-condition-motion paradigm,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

work page 2025
[45]

Attend-and- excite: Attention-based semantic guidance for text-to-image diffusion models,

H. Chefer, Y . Alaluf, Y . Vinker, L. Wolf, and D. Cohen-Or, “Attend-and- excite: Attention-based semantic guidance for text-to-image diffusion models,”ACM Trans. Graph., vol. 42, no. 4, Jul. 2023. Yue Jiangreceived the B.S. degree in Software Engineering from Northwest University, China, in

work page 2023
[46]

degree in Software Engineering at Northwest University

Since July 2023, She has been pursuing the M.S. degree in Software Engineering at Northwest University. Her research interests include computer graphics, motion synthesis, and deep learning. Mingyu Yanghas been pursuing the B.S. degree in Software Engineering at the School of Computer Science, Northwest University of China, since 2022. His research intere...

work page 2023
[47]

Her research interests include visualized analysis and deep learning

She is currently working toward the M.S degree in software engineering with the School of Computer Science, Northwest University of China. Her research interests include visualized analysis and deep learning. Yang Xureceived his B.E. and Ph.D. degrees from Beihang University in 2014 and 2020, respectively. He is currently an associate professor in the Sch...

work page 2014

[1] [1]

Human motion diffusion model,

G. Tevet, S. Raab, B. Gordon, Y . Shafir, D. Cohen-or, and A. H. Bermano, “Human motion diffusion model,” inThe Eleventh Interna- tional Conference on Learning Representations, 2023

work page 2023

[2] [2]

Action2motion: Conditioned generation of 3d human motions,

C. Guo, X. Zuo, S. Wang, S. Zou, Q. Sun, A. Deng, M. Gong, and L. Cheng, “Action2motion: Conditioned generation of 3d human motions,” inProceedings of the 28th ACM International Conference on Multimedia, ser. MM ’20. New York, NY , USA: Association for Computing Machinery, 2020, p. 2021–2029

work page 2020

[3] [3]

Salad: Skeleton- aware latent diffusion for text-driven motion generation and editing,

S. Hong, C. Kim, S. Yoon, J. Nam, S. Cha, and J. Noh, “Salad: Skeleton- aware latent diffusion for text-driven motion generation and editing,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2025, p. 13836

work page 2025

[4] [4]

arXiv preprint arXiv:2509.04058 , year=

L. Zhong, Y . Yang, and L. Changjian, “Smoogpt: Stylized motion generation using large language models,” inarXiv:2509.04058, 2025

work page arXiv 2025

[5] [5]

Motion keyframe interpolation for any human skeleton via temporally consistent point cloud sampling and reconstruction,

C. Mo, K. Hu, C. Long, D. Yuan, and Z. Wang, “Motion keyframe interpolation for any human skeleton via temporally consistent point cloud sampling and reconstruction,” inProceedings of the European Conference on Computer Vision (ECCV), ser. Lecture Notes in Computer Science, vol. 15140. Springer, Cham, 2024, pp. 159–175

work page 2024

[6] [6]

Temos: Generating diverse human motions from textual descriptions,

M. Petrovich, M. J. Black, and G. Varol, “Temos: Generating diverse human motions from textual descriptions,” inProceedings of the Euro- pean Conference on Computer Vision (ECCV). Cham: Springer Nature Switzerland, 2022, pp. 480–497

work page 2022

[7] [7]

Optimizing diffusion noise can serve as universal motion priors,

K. Karunratanakul, K. Preechakul, E. Aksan, T. Beeler, S. Suwajanakorn, and S. Tang, “Optimizing diffusion noise can serve as universal motion priors,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 1334–1345

work page 2024

[8] [8]

Sinc: Spatial composition of 3d human motions for simultaneous action generation,

N. Athanasiou, M. Petrovich, M. J. Black, and G. Varol, “Sinc: Spatial composition of 3d human motions for simultaneous action generation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 9984–9995

work page 2023

[9] [9]

Multi-track timeline control for text-driven 3d human motion generation,

M. Petrovich, O. Litany, U. Iqbal, M. J. Black, G. Varol, X. B. Peng, and D. Rempe, “Multi-track timeline control for text-driven 3d human motion generation,” inCVPR Workshop on Human Motion Generation, 2024

work page 2024

[10] [10]

Mmm: Generative masked motion model,

E. Pinyoanuntapong, P. Wang, M. Lee, and C. Chen, “Mmm: Generative masked motion model,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 1546–1555

work page 2024

[11] [11]

Motionfix: Text-driven 3d human motion editing,

N. Athanasiou, A. Cseke, M. Diomataris, M. J. Black, and G. Varol, “Motionfix: Text-driven 3d human motion editing,” inSIGGRAPH Asia 2024 Conference Papers. ACM, 2024. [Online]. Available: https://dl.acm.org/doi/10.1145/3680528.3687559

work page doi:10.1145/3680528.3687559 2024

[12] [12]

Motiondiffuse: Text-driven human motion generation with diffusion model,

M. Zhang, Z. Cai, L. Pan, F. Hong, X. Guo, L. Yang, and Z. Liu, “Motiondiffuse: Text-driven human motion generation with diffusion model,”IEEE Transactions on Pattern Analysis and Machine Intelli- gence, vol. 46, no. 6, pp. 4115–4128, 2024

work page 2024

[13] [13]

Action-conditioned 3d human motion synthesis with transformer vae,

M. Petrovich, M. J. Black, and G. Varol, “Action-conditioned 3d human motion synthesis with transformer vae,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 10 985–10 995

work page 2021

[14] [14]

Global-local motion transformer for unsupervised skeleton-based action learning,

B. Kim, H. J. Chang, J. Kim, and J. Y . Choi, “Global-local motion transformer for unsupervised skeleton-based action learning,” inCom- puter Vision – ECCV 2022. Cham: Springer Nature Switzerland, 2022, pp. 209–225

work page 2022

[15] [15]

Weakly-supervised action transition learning for stochastic human motion prediction,

W. Mao, M. Liu, and M. Salzmann, “Weakly-supervised action transition learning for stochastic human motion prediction,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 8151–8160

work page 2022

[16] [16]

Learning uncoupled-modulation cvae for 3d action-conditioned human motion synthesis,

C. Zhong, L. Hu, Z. Zhang, and S. Xia, “Learning uncoupled-modulation cvae for 3d action-conditioned human motion synthesis,” inComputer Vision–ECCV 2022: 17th European Conference, 2022, pp. 716–732

work page 2022

[17] [17]

Posegpt: Quantization-based 3d human motion generation and forecasting,

T. Lucas, F. Baradel, P. Weinzaepfel, and G. Rogez, “Posegpt: Quantization-based 3d human motion generation and forecasting,” in Proceedings of the European Conference on Computer Vision (ECCV). Cham: Springer Nature Switzerland, 2022, pp. 417–435

work page 2022

[18] [18]

Language2pose: Natural language grounded pose forecasting,

C. Ahuja and L.-P. Morency, “Language2pose: Natural language grounded pose forecasting,” in2019 International Conference on 3D Vision (3DV), 2019, pp. 719–728

work page 2019

[19] [19]

Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts,

C. Gu, X. Zuo, S. Wang, and L. Cheng, “Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts,” inEuropean Conference on Computer Vision (ECCV), 2022

work page 2022

[20] [20]

Generating diverse and natural 3d human motions from text,

C. Guo, S. Zou, X. Zuo, S. Wang, W. Ji, X. Li, and L. Cheng, “Generating diverse and natural 3d human motions from text,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 5152–5161

work page 2022

[21] [21]

Exploring vision transformers for 3d human motion-language models with motion patches,

Q. Yu, M. Tanaka, and K. Fujiwara, “Exploring vision transformers for 3d human motion-language models with motion patches,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 937–946

work page 2024

[22] [22]

Momask: Generative masked modeling of 3d human motions,

C. Guo, Y . Mu, M. G. Javed, S. Wang, and L. Cheng, “Momask: Generative masked modeling of 3d human motions,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 1900–1910

work page 2024

[23] [23]

Emdm: Efficient motion diffusion model for fast and high-quality motion generation,

W. Zhou, Z. Dou, Z. Cao, Z. Liao, J. Wang, W. Wang, Y . Liu, T. Komura, W. Wang, and L. Liu, “Emdm: Efficient motion diffusion model for fast and high-quality motion generation,” inProceedings of the European Conference on Computer Vision (ECCV). Cham: Springer Nature Switzerland, 2025, pp. 18–38. 12

work page 2025

[24] [24]

Seamless human motion composition with blended positional encodings,

G. Barquero, S. Escalera, and C. Palmero, “Seamless human motion composition with blended positional encodings,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 457–469

work page 2024

[25] [25]

Motion mamba: Efficient and long sequence motion generation,

Z. Zhang, A. Liu, I. Reid, R. Hartley, B. Zhuang, and H. Tang, “Motion mamba: Efficient and long sequence motion generation,” inComputer Vision – ECCV 2024. Cham: Springer Nature Switzerland, 2025, pp. 265–282

work page 2024

[26] [26]

Motionflux: Efficient text-guided motion generation through rectified flow matching and preference alignment,

Z. Gao, D. Song, D. Jiang, C. Xue, and A.-A. Liu, “Motionflux: Efficient text-guided motion generation through rectified flow matching and preference alignment,” inarxiv:2508.19527, 2025

work page arXiv 2025

[27] [27]

Motionclip: Exposing human motion generation to clip space,

G. Tevet, B. Gordon, A. Hertz, A. H. Bermano, and D. Cohen-Or, “Motionclip: Exposing human motion generation to clip space,” in Proceedings of the European Conference on Computer Vision (ECCV). Cham: Springer Nature Switzerland, 2022, pp. 358–374

work page 2022

[28] [28]

Flame: Free-form language-based motion synthesis & editing,

J. Kim, J. Kim, and S. Choi, “Flame: Free-form language-based motion synthesis & editing,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 7. AAAI Press, 2023, pp. 8255–8263

work page 2023

[29] [29]

Transactions on Machine Learning Research (2022) https: //doi.org/10.1007/978-3-031-73397-0 4

Y . Huang, W. Wan, Y . Yang, C. Callison-Burch, M. Yatskar, and L. Liu, “Como: Controllable motion generation through language guided pose code editing,” inProceedings of the European Conference on Computer Vision (ECCV). Springer, 2024, pp. 180–196. [Online]. Available: https://link.springer.com/chapter/10.1007/978-3-031-73397-0 11

work page doi:10.1007/978-3-031-73397-0 2024

[30] [30]

Finemogen: Fine- grained spatio-temporal motion generation and editing,

M. Zhang, H. Li, Z. Cai, J. Ren, L. Yang, and Z. Liu, “Finemogen: Fine- grained spatio-temporal motion generation and editing,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023. [Online]. Available: https://arxiv.org/abs/2312.15004

work page arXiv 2023

[31] [31]

TEACH: Temporal Action Composition for 3D Humans ,

N. Athanasiou, M. Petrovich, M. J. Black, and G. Varol, “ TEACH: Temporal Action Composition for 3D Humans ,” in2022 International Conference on 3D Vision (3DV). Los Alamitos, CA, USA: IEEE Computer Society, Sep. 2022, pp. 414–423

work page 2022

[32] [32]

Priormdm: Human motion diffusion as a generative prior,

Y . Shafir, G. Tevet, R. Kapon, and A. H. Bermano, “Priormdm: Human motion diffusion as a generative prior,” inThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[33] [33]

Mogents: Motion generation based on spatial-temporal joint modeling,

W. Yuan, W. Shen, Y . He, Y . Dong, X. Gu, Z. Dong, L. Bo, and Q. Huang, “Mogents: Motion generation based on spatial-temporal joint modeling,” inConference on Neural Information Processing Systems, 2024

work page 2024

[34] [34]

Generation of complex 3d human motion by temporal and spatial composition of diffusion models,

L. Mandelli and S. Berretti, “Generation of complex 3d human motion by temporal and spatial composition of diffusion models,” in2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 1279–1288

work page 2025

[35] [35]

Energymogen: Compositional human motion generation with energy-based diffusion model in latent space,

J. Zhang, H. Fan, and Y . Yang, “Energymogen: Compositional human motion generation with energy-based diffusion model in latent space,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[36] [36]

Generating diverse and natural 3d human motions from text,

C. Guo, S. Zou, X. Zuo, S. Wang, L. Wang, and Y . Zhou, “Generating diverse and natural 3d human motions from text,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 5152–5161

work page 2022

[37] [37]

Spatial temporal graph convolutional networks for skeleton-based action recognition,

S. Yan, Y . Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018

work page 2018

[38] [38]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc., 2017

work page 2017

[39] [39]

Learning transferable visual models from natural language supervi- sion,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” inProceedings of the 38th International Conference on Machine Learning (ICML), vol. 139. PMLR, 2021, pp. 8748–8763

work page 2021

[40] [40]

Learning repre- sentations by back-propagating errors,

D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning repre- sentations by back-propagating errors,”Nature, vol. 323, no. 6088, pp. 533–536, 1986

work page 1986

[41] [41]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2019

work page internal anchor Pith review Pith/arXiv arXiv 2019

[42] [42]

Real-time inverse kinematics techniques for anthropomorphic limbs,

D. Tolani, A. Goswami, and N. I. Badler, “Real-time inverse kinematics techniques for anthropomorphic limbs,” inGraphical Models, vol. 62, no. 5. Elsevier, 2000, pp. 353–388

work page 2000

[43] [43]

Smpl: A skinned multi-person linear model,

M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, “Smpl: A skinned multi-person linear model,”ACM Transactions on Graphics (TOG), vol. 34, no. 6, pp. 248:1–248:16, 2015

work page 2015

[44] [44]

Motionlab: Unified human mo- tion generation and editing via the motion-condition-motion paradigm,

Z. Guo, Z. Hu, N. Zhao, and D. W. Soh, “Motionlab: Unified human mo- tion generation and editing via the motion-condition-motion paradigm,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

work page 2025

[45] [45]

Attend-and- excite: Attention-based semantic guidance for text-to-image diffusion models,

H. Chefer, Y . Alaluf, Y . Vinker, L. Wolf, and D. Cohen-Or, “Attend-and- excite: Attention-based semantic guidance for text-to-image diffusion models,”ACM Trans. Graph., vol. 42, no. 4, Jul. 2023. Yue Jiangreceived the B.S. degree in Software Engineering from Northwest University, China, in

work page 2023

[46] [46]

degree in Software Engineering at Northwest University

Since July 2023, She has been pursuing the M.S. degree in Software Engineering at Northwest University. Her research interests include computer graphics, motion synthesis, and deep learning. Mingyu Yanghas been pursuing the B.S. degree in Software Engineering at the School of Computer Science, Northwest University of China, since 2022. His research intere...

work page 2023

[47] [47]

Her research interests include visualized analysis and deep learning

She is currently working toward the M.S degree in software engineering with the School of Computer Science, Northwest University of China. Her research interests include visualized analysis and deep learning. Yang Xureceived his B.E. and Ph.D. degrees from Beihang University in 2014 and 2020, respectively. He is currently an associate professor in the Sch...

work page 2014