pith. sign in

arxiv: 2604.16135 · v2 · submitted 2026-04-17 · 💻 cs.CV

Motion-Adapter: A Diffusion Model Adapter for Text-to-Motion Generation of Compound Actions

Pith reviewed 2026-05-10 08:09 UTC · model grok-4.3

classification 💻 cs.CV
keywords text-to-motion generationdiffusion modelscompound actionscross-attention mapsmotion synthesisadapter moduledenoising processhuman motion
0
0 comments X

The pith

A plug-and-play adapter enables text-to-motion diffusion models to generate coherent compound actions from simple text by using decoupled cross-attention maps as structural masks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to overcome two specific failures in current diffusion-based motion generators when asked to produce compound actions such as walking while waving. Earlier actions get overwritten and attention layers collapse when multiple motions must occur together, forcing users to supply overly detailed prompts or external language models. The authors introduce the Motion-Adapter module that inserts into an existing model without retraining and computes separate cross-attention maps to mask the denoising steps. These masks preserve distinct action structures across time, yielding full-body sequences that remain faithful to the input text. Readers care because the method removes the need for hand-crafted body-part instructions and produces more natural simultaneous behaviors.

Core claim

We propose the Motion-Adapter, a plug-and-play module that guides text-to-motion diffusion models in generating compound actions by computing decoupled cross-attention maps, which serve as structural masks during the denoising process. This directly counters catastrophic neglect of earlier actions and attention collapse from excessive feature fusion, allowing the model to produce faithful and coherent full-body motions from diverse textual prompts without requiring detailed descriptions, explicit body-part edits, or large language models.

What carries the argument

The Motion-Adapter module, which computes decoupled cross-attention maps that act as structural masks to guide the denoising process in a pre-trained text-to-motion diffusion model.

If this is right

  • Compound actions can be synthesized from concise text without explicit body-part specifications or external language models.
  • Existing diffusion models gain the ability to handle concurrent motions while preserving temporal order and physical coherence.
  • Performance improves consistently across varied textual prompts compared with prior state-of-the-art methods.
  • The approach maintains plug-and-play compatibility, so no full model retraining is required.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The masking technique could transfer to other diffusion tasks that require simultaneous generation of multiple elements, such as multi-object scene synthesis.
  • Integration with longer or more complex motion sequences would test whether the structural masks scale without additional modifications.
  • The method opens a route for combining motion generation with other modalities like speech or music by treating them as additional conditioning signals under the same masking logic.

Load-bearing premise

Decoupled cross-attention maps will separate concurrent actions reliably enough to prevent overwriting and collapse across all prompts and motion types without introducing new artifacts or needing base-model retraining.

What would settle it

Running the adapter on prompts that combine two independent actions, such as 'greeting while walking,' and checking whether both actions appear simultaneously and without one overwriting the other in the generated sequence.

Figures

Figures reproduced from arXiv: 2604.16135 by Bingxin Yun, Liuyuxin Yang, Mingyu Yang, Yang Xu, Yue Jiang, Yuhe Zhang.

Figure 1
Figure 1. Figure 1: Motion sequences generated by our Motion-Adapter given a textual prompt and a pre-trained motion diffusion model ( [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of attention maps from SALAD [3] and our Motion [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of the skeletal pooling on the HumanML3D dataset [36]. [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the Motion-Adapter integrated into the diffusion model [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: The architecture of the decoupled cross-attention. [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of compound actions combining ’greeting’ [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison results of compound actions combining [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison results of compound actions combining [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Motion editing results of SALAD [3]. The black text represents the [PITH_FULL_IMAGE:figures/full_fig_p006_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative comparison results of compound actions with complex textual prompts. To enhance visual clarity, the later frames are rendered with [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Results of our Motion-Adapter MotionDiffuse. F. User Study To further evaluate the perceptual quality and fidelity of the generated motions, we conducted a user study involving 65 participants. Each participant was asked to complete three main tasks: (i) rate 15 videos per method based on the semantic alignment between the motion and the corresponding textual prompt, providing a measure of fidelity for th… view at source ↗
Figure 12
Figure 12. Figure 12: Attention maps extracted at t = 700, 600, 500 along with the resulting motion generated by applying the masks throughout all denoising steps. For visual clarity, we show the skeletons. As shown in Table IV, removing the masking step con￾straints leads to decreased performance across all evaluation metrics, suggesting that unconstrained masking places exces￾sive emphasis on specific joints. We further obse… view at source ↗
read the original abstract

Recent advances in generative motion synthesis have enabled the production of realistic human motions from diverse input modalities. However, synthesizing compound actions from texts, which integrate multiple concurrent actions into coherent full-body sequences, remains a major challenge. We identify two key limitations in current text-to-motion diffusion models: (i) catastrophic neglect, where earlier actions are overwritten by later ones due to improper handling of temporal information, and (ii) attention collapse, which arises from excessive feature fusion in cross-attention mechanisms. As a result, existing approaches often depend on overly detailed textual descriptions (e.g., raising right hand), explicit body-part specifications (e.g., editing the upper body), or the use of large language models (LLMs) for body-part interpretation. These strategies lead to deficient semantic representations of physical structures and kinematic mechanisms, limiting the ability to incorporate natural behaviors such as greeting while walking. To address these issues, we propose the Motion-Adapter, a plug-and-play module that guides text-to-motion diffusion models in generating compound actions by computing decoupled cross-attention maps, which serve as structural masks during the denoising process. Extensive experiments demonstrate that our method consistently produces more faithful and coherent compound motions across diverse textual prompts, surpassing state-of-the-art approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to solve the problem of generating compound actions (multiple concurrent motions) from text using diffusion models by introducing Motion-Adapter. This plug-and-play module computes decoupled cross-attention maps that act as structural masks in the denoising process to prevent catastrophic neglect of early actions and attention collapse. It asserts that this leads to more coherent motions from simple prompts and outperforms existing methods in experiments.

Significance. Should the proposed adapter prove effective, it would represent a meaningful advance in text-to-motion generation by enabling natural compound behaviors without reliance on overly specific prompts or auxiliary LLMs. This could broaden the applicability of diffusion-based motion synthesis in fields like computer animation and human-robot interaction. The plug-and-play design is particularly promising for adoption.

major comments (2)
  1. The mechanism for decoupling cross-attention maps and integrating them as masks during denoising is described conceptually but lacks the precise algorithmic steps or pseudocode needed to fully evaluate its impact on temporal information handling.
  2. While the abstract states that the method surpasses SOTA, the experimental section should provide detailed quantitative results, including specific metrics, dataset information, and comparisons to baselines to substantiate the claims of consistent improvement across diverse prompts.
minor comments (1)
  1. Include at least one key performance metric to support the superiority claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation of our work's potential impact and for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate clarifications and additional details where appropriate.

read point-by-point responses
  1. Referee: The mechanism for decoupling cross-attention maps and integrating them as masks during denoising is described conceptually but lacks the precise algorithmic steps or pseudocode needed to fully evaluate its impact on temporal information handling.

    Authors: We appreciate this observation. Section 3 of the manuscript describes the decoupling of cross-attention maps and their use as structural masks, including how they preserve temporal information across denoising steps to mitigate catastrophic neglect. However, we agree that explicit algorithmic steps and pseudocode would improve reproducibility and allow better assessment of the temporal handling. In the revised version, we will add a dedicated algorithm box with precise pseudocode outlining the map computation, decoupling, masking, and integration into the diffusion process. revision: yes

  2. Referee: While the abstract states that the method surpasses SOTA, the experimental section should provide detailed quantitative results, including specific metrics, dataset information, and comparisons to baselines to substantiate the claims of consistent improvement across diverse prompts.

    Authors: We thank the referee for this point. The experimental section reports quantitative results on standard benchmarks (HumanML3D and KIT) using metrics such as FID, R-Precision, and diversity scores, along with comparisons to baselines including MDM, MLD, and others, plus user studies on compound action coherence. These demonstrate consistent improvements. To further substantiate the claims, we will expand the section in revision with additional tables containing exact numerical values, dataset statistics and splits, and more baseline comparisons across a broader set of diverse compound prompts. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes Motion-Adapter as an independent plug-and-play module that computes decoupled cross-attention maps to serve as structural masks in existing text-to-motion diffusion models. This directly addresses the stated problems of catastrophic neglect and attention collapse without any derivation that reduces to self-definition, fitted inputs renamed as predictions, or self-citation chains. The approach is presented as an additive technical contribution with claimed experimental support across prompts, and no equation or step in the provided description equates to its own inputs by construction. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract introduces no explicit free parameters, mathematical axioms, or new invented entities; the contribution is an architectural adapter module whose internal details are not specified here.

pith-pipeline@v0.9.0 · 5535 in / 1104 out tokens · 55658 ms · 2026-05-10T08:09:16.972859+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 1 internal anchor

  1. [1]

    Human motion diffusion model,

    G. Tevet, S. Raab, B. Gordon, Y . Shafir, D. Cohen-or, and A. H. Bermano, “Human motion diffusion model,” inThe Eleventh Interna- tional Conference on Learning Representations, 2023

  2. [2]

    Action2motion: Conditioned generation of 3d human motions,

    C. Guo, X. Zuo, S. Wang, S. Zou, Q. Sun, A. Deng, M. Gong, and L. Cheng, “Action2motion: Conditioned generation of 3d human motions,” inProceedings of the 28th ACM International Conference on Multimedia, ser. MM ’20. New York, NY , USA: Association for Computing Machinery, 2020, p. 2021–2029

  3. [3]

    Salad: Skeleton- aware latent diffusion for text-driven motion generation and editing,

    S. Hong, C. Kim, S. Yoon, J. Nam, S. Cha, and J. Noh, “Salad: Skeleton- aware latent diffusion for text-driven motion generation and editing,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2025, p. 13836

  4. [4]

    arXiv preprint arXiv:2509.04058 , year=

    L. Zhong, Y . Yang, and L. Changjian, “Smoogpt: Stylized motion generation using large language models,” inarXiv:2509.04058, 2025

  5. [5]

    Motion keyframe interpolation for any human skeleton via temporally consistent point cloud sampling and reconstruction,

    C. Mo, K. Hu, C. Long, D. Yuan, and Z. Wang, “Motion keyframe interpolation for any human skeleton via temporally consistent point cloud sampling and reconstruction,” inProceedings of the European Conference on Computer Vision (ECCV), ser. Lecture Notes in Computer Science, vol. 15140. Springer, Cham, 2024, pp. 159–175

  6. [6]

    Temos: Generating diverse human motions from textual descriptions,

    M. Petrovich, M. J. Black, and G. Varol, “Temos: Generating diverse human motions from textual descriptions,” inProceedings of the Euro- pean Conference on Computer Vision (ECCV). Cham: Springer Nature Switzerland, 2022, pp. 480–497

  7. [7]

    Optimizing diffusion noise can serve as universal motion priors,

    K. Karunratanakul, K. Preechakul, E. Aksan, T. Beeler, S. Suwajanakorn, and S. Tang, “Optimizing diffusion noise can serve as universal motion priors,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 1334–1345

  8. [8]

    Sinc: Spatial composition of 3d human motions for simultaneous action generation,

    N. Athanasiou, M. Petrovich, M. J. Black, and G. Varol, “Sinc: Spatial composition of 3d human motions for simultaneous action generation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 9984–9995

  9. [9]

    Multi-track timeline control for text-driven 3d human motion generation,

    M. Petrovich, O. Litany, U. Iqbal, M. J. Black, G. Varol, X. B. Peng, and D. Rempe, “Multi-track timeline control for text-driven 3d human motion generation,” inCVPR Workshop on Human Motion Generation, 2024

  10. [10]

    Mmm: Generative masked motion model,

    E. Pinyoanuntapong, P. Wang, M. Lee, and C. Chen, “Mmm: Generative masked motion model,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 1546–1555

  11. [11]

    Motionfix: Text-driven 3d human motion editing,

    N. Athanasiou, A. Cseke, M. Diomataris, M. J. Black, and G. Varol, “Motionfix: Text-driven 3d human motion editing,” inSIGGRAPH Asia 2024 Conference Papers. ACM, 2024. [Online]. Available: https://dl.acm.org/doi/10.1145/3680528.3687559

  12. [12]

    Motiondiffuse: Text-driven human motion generation with diffusion model,

    M. Zhang, Z. Cai, L. Pan, F. Hong, X. Guo, L. Yang, and Z. Liu, “Motiondiffuse: Text-driven human motion generation with diffusion model,”IEEE Transactions on Pattern Analysis and Machine Intelli- gence, vol. 46, no. 6, pp. 4115–4128, 2024

  13. [13]

    Action-conditioned 3d human motion synthesis with transformer vae,

    M. Petrovich, M. J. Black, and G. Varol, “Action-conditioned 3d human motion synthesis with transformer vae,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 10 985–10 995

  14. [14]

    Global-local motion transformer for unsupervised skeleton-based action learning,

    B. Kim, H. J. Chang, J. Kim, and J. Y . Choi, “Global-local motion transformer for unsupervised skeleton-based action learning,” inCom- puter Vision – ECCV 2022. Cham: Springer Nature Switzerland, 2022, pp. 209–225

  15. [15]

    Weakly-supervised action transition learning for stochastic human motion prediction,

    W. Mao, M. Liu, and M. Salzmann, “Weakly-supervised action transition learning for stochastic human motion prediction,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 8151–8160

  16. [16]

    Learning uncoupled-modulation cvae for 3d action-conditioned human motion synthesis,

    C. Zhong, L. Hu, Z. Zhang, and S. Xia, “Learning uncoupled-modulation cvae for 3d action-conditioned human motion synthesis,” inComputer Vision–ECCV 2022: 17th European Conference, 2022, pp. 716–732

  17. [17]

    Posegpt: Quantization-based 3d human motion generation and forecasting,

    T. Lucas, F. Baradel, P. Weinzaepfel, and G. Rogez, “Posegpt: Quantization-based 3d human motion generation and forecasting,” in Proceedings of the European Conference on Computer Vision (ECCV). Cham: Springer Nature Switzerland, 2022, pp. 417–435

  18. [18]

    Language2pose: Natural language grounded pose forecasting,

    C. Ahuja and L.-P. Morency, “Language2pose: Natural language grounded pose forecasting,” in2019 International Conference on 3D Vision (3DV), 2019, pp. 719–728

  19. [19]

    Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts,

    C. Gu, X. Zuo, S. Wang, and L. Cheng, “Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts,” inEuropean Conference on Computer Vision (ECCV), 2022

  20. [20]

    Generating diverse and natural 3d human motions from text,

    C. Guo, S. Zou, X. Zuo, S. Wang, W. Ji, X. Li, and L. Cheng, “Generating diverse and natural 3d human motions from text,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 5152–5161

  21. [21]

    Exploring vision transformers for 3d human motion-language models with motion patches,

    Q. Yu, M. Tanaka, and K. Fujiwara, “Exploring vision transformers for 3d human motion-language models with motion patches,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 937–946

  22. [22]

    Momask: Generative masked modeling of 3d human motions,

    C. Guo, Y . Mu, M. G. Javed, S. Wang, and L. Cheng, “Momask: Generative masked modeling of 3d human motions,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 1900–1910

  23. [23]

    Emdm: Efficient motion diffusion model for fast and high-quality motion generation,

    W. Zhou, Z. Dou, Z. Cao, Z. Liao, J. Wang, W. Wang, Y . Liu, T. Komura, W. Wang, and L. Liu, “Emdm: Efficient motion diffusion model for fast and high-quality motion generation,” inProceedings of the European Conference on Computer Vision (ECCV). Cham: Springer Nature Switzerland, 2025, pp. 18–38. 12

  24. [24]

    Seamless human motion composition with blended positional encodings,

    G. Barquero, S. Escalera, and C. Palmero, “Seamless human motion composition with blended positional encodings,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 457–469

  25. [25]

    Motion mamba: Efficient and long sequence motion generation,

    Z. Zhang, A. Liu, I. Reid, R. Hartley, B. Zhuang, and H. Tang, “Motion mamba: Efficient and long sequence motion generation,” inComputer Vision – ECCV 2024. Cham: Springer Nature Switzerland, 2025, pp. 265–282

  26. [26]

    Motionflux: Efficient text-guided motion generation through rectified flow matching and preference alignment,

    Z. Gao, D. Song, D. Jiang, C. Xue, and A.-A. Liu, “Motionflux: Efficient text-guided motion generation through rectified flow matching and preference alignment,” inarxiv:2508.19527, 2025

  27. [27]

    Motionclip: Exposing human motion generation to clip space,

    G. Tevet, B. Gordon, A. Hertz, A. H. Bermano, and D. Cohen-Or, “Motionclip: Exposing human motion generation to clip space,” in Proceedings of the European Conference on Computer Vision (ECCV). Cham: Springer Nature Switzerland, 2022, pp. 358–374

  28. [28]

    Flame: Free-form language-based motion synthesis & editing,

    J. Kim, J. Kim, and S. Choi, “Flame: Free-form language-based motion synthesis & editing,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 7. AAAI Press, 2023, pp. 8255–8263

  29. [29]

    Transactions on Machine Learning Research (2022) https: //doi.org/10.1007/978-3-031-73397-0 4

    Y . Huang, W. Wan, Y . Yang, C. Callison-Burch, M. Yatskar, and L. Liu, “Como: Controllable motion generation through language guided pose code editing,” inProceedings of the European Conference on Computer Vision (ECCV). Springer, 2024, pp. 180–196. [Online]. Available: https://link.springer.com/chapter/10.1007/978-3-031-73397-0 11

  30. [30]

    Finemogen: Fine- grained spatio-temporal motion generation and editing,

    M. Zhang, H. Li, Z. Cai, J. Ren, L. Yang, and Z. Liu, “Finemogen: Fine- grained spatio-temporal motion generation and editing,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023. [Online]. Available: https://arxiv.org/abs/2312.15004

  31. [31]

    TEACH: Temporal Action Composition for 3D Humans ,

    N. Athanasiou, M. Petrovich, M. J. Black, and G. Varol, “ TEACH: Temporal Action Composition for 3D Humans ,” in2022 International Conference on 3D Vision (3DV). Los Alamitos, CA, USA: IEEE Computer Society, Sep. 2022, pp. 414–423

  32. [32]

    Priormdm: Human motion diffusion as a generative prior,

    Y . Shafir, G. Tevet, R. Kapon, and A. H. Bermano, “Priormdm: Human motion diffusion as a generative prior,” inThe Twelfth International Conference on Learning Representations, 2024

  33. [33]

    Mogents: Motion generation based on spatial-temporal joint modeling,

    W. Yuan, W. Shen, Y . He, Y . Dong, X. Gu, Z. Dong, L. Bo, and Q. Huang, “Mogents: Motion generation based on spatial-temporal joint modeling,” inConference on Neural Information Processing Systems, 2024

  34. [34]

    Generation of complex 3d human motion by temporal and spatial composition of diffusion models,

    L. Mandelli and S. Berretti, “Generation of complex 3d human motion by temporal and spatial composition of diffusion models,” in2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 1279–1288

  35. [35]

    Energymogen: Compositional human motion generation with energy-based diffusion model in latent space,

    J. Zhang, H. Fan, and Y . Yang, “Energymogen: Compositional human motion generation with energy-based diffusion model in latent space,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  36. [36]

    Generating diverse and natural 3d human motions from text,

    C. Guo, S. Zou, X. Zuo, S. Wang, L. Wang, and Y . Zhou, “Generating diverse and natural 3d human motions from text,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 5152–5161

  37. [37]

    Spatial temporal graph convolutional networks for skeleton-based action recognition,

    S. Yan, Y . Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018

  38. [38]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc., 2017

  39. [39]

    Learning transferable visual models from natural language supervi- sion,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” inProceedings of the 38th International Conference on Machine Learning (ICML), vol. 139. PMLR, 2021, pp. 8748–8763

  40. [40]

    Learning repre- sentations by back-propagating errors,

    D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning repre- sentations by back-propagating errors,”Nature, vol. 323, no. 6088, pp. 533–536, 1986

  41. [41]

    Decoupled Weight Decay Regularization

    I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2019

  42. [42]

    Real-time inverse kinematics techniques for anthropomorphic limbs,

    D. Tolani, A. Goswami, and N. I. Badler, “Real-time inverse kinematics techniques for anthropomorphic limbs,” inGraphical Models, vol. 62, no. 5. Elsevier, 2000, pp. 353–388

  43. [43]

    Smpl: A skinned multi-person linear model,

    M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, “Smpl: A skinned multi-person linear model,”ACM Transactions on Graphics (TOG), vol. 34, no. 6, pp. 248:1–248:16, 2015

  44. [44]

    Motionlab: Unified human mo- tion generation and editing via the motion-condition-motion paradigm,

    Z. Guo, Z. Hu, N. Zhao, and D. W. Soh, “Motionlab: Unified human mo- tion generation and editing via the motion-condition-motion paradigm,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

  45. [45]

    Attend-and- excite: Attention-based semantic guidance for text-to-image diffusion models,

    H. Chefer, Y . Alaluf, Y . Vinker, L. Wolf, and D. Cohen-Or, “Attend-and- excite: Attention-based semantic guidance for text-to-image diffusion models,”ACM Trans. Graph., vol. 42, no. 4, Jul. 2023. Yue Jiangreceived the B.S. degree in Software Engineering from Northwest University, China, in

  46. [46]

    degree in Software Engineering at Northwest University

    Since July 2023, She has been pursuing the M.S. degree in Software Engineering at Northwest University. Her research interests include computer graphics, motion synthesis, and deep learning. Mingyu Yanghas been pursuing the B.S. degree in Software Engineering at the School of Computer Science, Northwest University of China, since 2022. His research intere...

  47. [47]

    Her research interests include visualized analysis and deep learning

    She is currently working toward the M.S degree in software engineering with the School of Computer Science, Northwest University of China. Her research interests include visualized analysis and deep learning. Yang Xureceived his B.E. and Ph.D. degrees from Beihang University in 2014 and 2020, respectively. He is currently an associate professor in the Sch...