pith. sign in

arxiv: 2606.22726 · v1 · pith:7HGA7BA6new · submitted 2026-06-22 · 💻 cs.AI

Text Dictates, Music Decorates: Energy-based Attention for Editable Dance Motion Generation

Pith reviewed 2026-06-26 09:13 UTC · model grok-4.3

classification 💻 cs.AI
keywords dance motion generationdiffusion transformermultimodal conditioningattention modulechoreographic semanticstext-to-motionmusic synchronization
0
0 comments X

The pith

STREAM separates text semantics and music beats in a diffusion transformer to generate dance motions that follow both without one overwriting the other.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes that modality collapse in text-and-music conditioned dance generation can be avoided by using separate conditioning mechanisms. Global text prompts dictate the overall kinematic structure of the motion through Adaptive Layer Normalization. A new Bimodal Energy-Based Attention Module then aligns those features to the musical beats. This decoupling is claimed to deliver state-of-the-art music alignment while fully retaining the choreographic meaning specified by text. The work introduces a new dance dataset and an editability evaluation protocol to support these claims.

Core claim

STREAM is a modality-decoupled diffusion transformer in which AdaLN injects global text semantics to control kinematic structure and the BEAM module routes the resulting features to musical beats without overwriting the semantics, resulting in state-of-the-art alignment between motion and music while perfectly preserving choreographic semantics.

What carries the argument

The Bimodal Energy-Based Attention Module (BEAM) which routes text-conditioned motion features to musical beats while the Adaptive Layer Normalization (AdaLN) separately handles global text semantics.

If this is right

  • Models can achieve both high music synchronization and text-based editability in dance generation.
  • User controllability is maintained even when strong rhythmic conditioning from music is applied.
  • Frame-level semantic annotations in datasets improve training for semantic preservation.
  • New protocols like the Exchange Evaluation Protocol allow measurement of zero-shot editability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This separation strategy may apply to other tasks where one modality risks dominating another, such as text and audio in video generation.
  • Future work could test whether the same decoupling improves controllability in non-dance motion tasks.
  • If the method scales, it suggests that explicit pathway separation is a general solution to modality interference in generative models.

Load-bearing premise

That using separate pathways with AdaLN for text semantics and BEAM for music beats will eliminate modality collapse and preserve controllability without any loss in motion quality or alignment.

What would settle it

An experiment in which text prompts are exchanged mid-generation and the resulting motions fail to reflect the new semantics while still matching the music would falsify the preservation of controllability.

Figures

Figures reproduced from arXiv: 2606.22726 by Cornelia Ferm\"uller, Felix Gu, Seong Jong Yoo, Siyuan Peng, Stratis Aloimonos.

Figure 1
Figure 1. Figure 1: t-SNE visualization of dance (Motorica) and general motion (HumanML3D) of kinematic features [44] and geometric features [42]. and a unified metric, the Editable Dance Score (EDS). Existing editable dance generation pipelines rarely evaluate zero-shot editability under conflicting conditions (e.g., forcing a slow semantic dance onto a fast acoustic beat). By evaluating models on mismatched text-music pairs… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of STREAM. (a) Left: STREAM is first controlled by text (high-level concept and a low-level detailed description). Second, the music condition modulates the motion via the music alignment energy function, transforming general motion into dance-like motion aligned to musical beats. (b) Right: The Bimodal Energy Attention Module (BEAM) adaptively updates the text conditions (red line) via MAP estima… view at source ↗
Figure 3
Figure 3. Figure 3: Visualization example of self-similarity matrices of different music and BPM at first layer of D-EBCA. Text-AdaLN: Our text condition comprises both high-level information and low-level descriptions of body movements. Therefore, we apply text conditions globally to motion query through AdaLN [46]. Specifically, the motion query x is modulated as x ′ = γ(ct)⊙LayerNorm(x)+β(ct), where the scale γ and shift β… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative dance motion generation results with text and music conditioned, compared with other SoTA models. Baselines: Since we propose a ’text + music to motion generation pipeline’ with a new dataset, we retrain state-of-the-art dance models [40, 63, 64], text to motion models [6, 61, 70], and text+music-conditioned models [15, 66], following their original training recipes. 5.2 Evaluation of Dance Mot… view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of dance motion editing example. The original motion contains two dance techniques: Charleston Opposites (green) and Charleston Messaround (blue). We first edit Charleston Opposites to Charleston Side to Side (red) while preserving the other. Similarly, we can edit one more time to Charleston Knock Knees (yellow). modality but weaker alignment in the other. Furthermore, existing audio-text mu… view at source ↗
Figure 6
Figure 6. Figure 6: Genre distribution and Motorica++ dance description examples. concentrates primarily on editing one dance motion into another, and MDD focuses on duet dance motions. On the other hand, Motorica++ is annotated with a total of 183 dance techniques and their corresponding descriptions ( [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Custom video description correction tool. On the right window, we can edit the originally generated description, or we can regenerate by editing the prompt. 10 Method 10.1 Dual Energy-Based Cross Attention (D-EBCA) In the main manuscript, we propose a new energy function at D-EBCA mod￾ule( Sec. 4.3), which is E(Q;K_t,K_m) &= - \frac {1}{\beta }\sum _{l=1}^L \text {lse}(A_l, \beta ) + \mathcal {R}(Q, K_m) \… view at source ↗
Figure 8
Figure 8. Figure 8: Detailed Ablation architecture diagrams. Late Fusion is represented multiple layers, while others show one details of each layer. Black dots represents concatenation. For Late Fusion, we use N1 = 0.7N and N2 = 0.3N, where N is total number of layers. 11 Further Experiments Results 11.1 Ablation Study We conduct an ablation study on the effects of text context Bayesian update described at Sec. 4.3. As shown… view at source ↗
Figure 9
Figure 9. Figure 9: Visualization of music beats (red) and motion beats (blue) generated by the text-only STREAM model. Since the model only uses text information, it lacks the ability to adapt motion to different music. However, because dance motion inherently contains natural motion beats, the BAS can still be high even when paired with different music. adaptation (Smusic). Specifically, we use the finetuned normalized TMR … view at source ↗
Figure 10
Figure 10. Figure 10: Collaborative Dance Design Studio: A Collaborate choreography environment Tool. This conceptual Dance Design Studio demo consists of three main panels: Visual￾ization panel (blue), Timeline panel (green), and Editing panel (red). The Visualization panel displays a 3D generated character along with the current dance genre (Charleston) and dance technique (Opposites). The Timeline panel shows the performanc… view at source ↗
read the original abstract

Choreographic motion generation poses unique challenges for AI, demanding precise semantic control over complex, temporally structured, and expressive full-body dynamics. While existing models can synthesize motion from music, they remain largely black boxes. Conversely, attempting to condition generation on both text and music frequently leads to modality collapse, where dense acoustic rhythms overwhelm sparse semantic text prompts, destroying user controllability. To resolve this spatial-temporal conflict, we propose STREAM (Structural-Temporal Rhythmic Energy-based Attention for Motion), a modality-decoupled diffusion transformer. STREAM strictly separates conditioning pathways: global text semantics dictate the kinematic structure via Adaptive Layer Normalization (AdaLN), while a novel Bimodal Energy-Based Attention Module (BEAM) routes these features to the musical beat without overwriting the semantics. We further introduce Motorica++, a newly curated dataset enriched with domain-specific dance vocabulary and frame-level semantic annotations from existing Motorica dataset. Additionally, to rigorously quantify zero-shot editability, we propose the Exchange Evaluation Protocol and Editable Dance Score (EDS). Through extensive experiments, STREAM achieves state-of-the-art alignment between motion and music while perfectly preserving choreographic semantics, positioning AI not merely as a reactive synthesizer, but as a controllable, collaborative partner for artistic direction. The source code and datasets are available at https://github.com/SeongJong-Yoo/STREAM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces STREAM, a modality-decoupled diffusion transformer for generating editable dance motions from text and music inputs. It separates conditioning pathways by using Adaptive Layer Normalization (AdaLN) to let global text semantics dictate kinematic structure while a new Bimodal Energy-Based Attention Module (BEAM) routes features to musical beats. The work also contributes the Motorica++ dataset with frame-level semantic annotations and proposes the Exchange Evaluation Protocol together with the Editable Dance Score (EDS) metric to quantify zero-shot editability. The central claim is that this architecture achieves state-of-the-art music-motion alignment while perfectly preserving choreographic semantics, avoiding the modality collapse that typically occurs when both modalities are combined.

Significance. If the decoupling mechanism and quantitative claims are substantiated, the result would advance controllable multi-modal motion synthesis by demonstrating that text can dictate structure and music can decorate rhythm without semantic degradation. The Motorica++ dataset and EDS protocol could become useful benchmarks for editability in dance generation if externally validated.

major comments (2)
  1. [Abstract] Abstract: the assertion of 'state-of-the-art alignment' and 'perfectly preserving choreographic semantics' is presented without any quantitative results, ablation tables, or numerical EDS scores. This is load-bearing for the central claim because the paper's contribution rests on the superiority and zero-trade-off properties of the AdaLN+BEAM separation.
  2. [Abstract] Abstract: no ablation is described that compares semantic fidelity (via EDS or the Exchange Protocol) under text-only conditioning versus text+music conditioning on identical prompts. Without this comparison, it is impossible to verify that BEAM routing introduces zero semantic degradation, which directly tests the weakest assumption underlying the 'perfect preservation' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address the two major comments on the abstract below, agreeing that stronger quantitative grounding is needed for the central claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion of 'state-of-the-art alignment' and 'perfectly preserving choreographic semantics' is presented without any quantitative results, ablation tables, or numerical EDS scores. This is load-bearing for the central claim because the paper's contribution rests on the superiority and zero-trade-off properties of the AdaLN+BEAM separation.

    Authors: We agree that the abstract presents the claims at a high level without numerical support. The body of the manuscript (Section 4 and Table 2) contains the EDS scores, music-alignment metrics, and ablations demonstrating SOTA performance and semantic preservation. To make the abstract self-contained and directly responsive to this concern, we will revise it to include the key quantitative results (e.g., EDS values and alignment deltas) from the experiments. revision: yes

  2. Referee: [Abstract] Abstract: no ablation is described that compares semantic fidelity (via EDS or the Exchange Protocol) under text-only conditioning versus text+music conditioning on identical prompts. Without this comparison, it is impossible to verify that BEAM routing introduces zero semantic degradation, which directly tests the weakest assumption underlying the 'perfect preservation' claim.

    Authors: The manuscript reports that adding BEAM does not degrade EDS relative to text-only baselines, but we acknowledge that an explicit side-by-side ablation on identical prompts (text-only vs. text+music) is not described in the abstract and is only summarized rather than isolated in the main text. We will add this targeted comparison (new table or subsection) in the revision to directly substantiate the zero-degradation claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on new empirical components

full rationale

The paper introduces a novel architecture (STREAM using AdaLN for text and BEAM for music routing), a new dataset (Motorica++ with added annotations), and new metrics/protocols (Exchange Evaluation Protocol and EDS). All central claims of SOTA alignment and semantic preservation are presented as outcomes of experiments on these new elements rather than any derivation that reduces to fitted inputs, self-citations, or renamed known results. No load-bearing self-citations, uniqueness theorems, or ansatzes appear in the provided text; the work is self-contained against external benchmarks with no evidence of predictions equaling inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based solely on abstract; full model details, training procedure, and assumptions unavailable. Standard diffusion model assumptions are implicit.

axioms (1)
  • domain assumption Diffusion transformers can be conditioned on text and audio modalities
    Standard assumption in generative motion models.
invented entities (1)
  • Bimodal Energy-Based Attention Module (BEAM) no independent evidence
    purpose: Routes text features to musical beats without overwriting semantics
    Newly proposed module to address modality conflict

pith-pipeline@v0.9.1-grok · 5781 in / 1071 out tokens · 18032 ms · 2026-06-26T09:13:42.189238+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

73 extracted references · 31 canonical work pages · 6 internal anchors

  1. [1]

    ACM Trans

    Alexanderson, S., Nagy, R., Beskow, J., Henter, G.E.: Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models. ACM Trans. Graph.42(4), 44:1–44:20 (2023).https://doi.org/10.1145/3592458 , https://dl.acm.org/ doi/10.1145/3592458

  2. [2]

    Routledge, London (Aug 2023).https://doi.org/10.4324/9781003009764

    Angelov, V.: You, the Choreographer: Creating and Crafting Dance. Routledge, London (Aug 2023).https://doi.org/10.4324/9781003009764

  3. [3]

    Athanasiou, N., Ceske, A., Diomataris, M., Black, M.J., Varol, G.: MotionFix: Text-Driven 3D Human Motion Editing (Sep 2024).https://doi.org/10.48550/ arXiv.2408.00712,http://arxiv.org/abs/2408.00712

  4. [4]

    Frontiers in Human Neuroscience14, 584312 (Jan 2021)

    Basso, J.C., Satyal, M.K., Rugh, R.: Dance on the Brain: Enhancing Intra- and Inter-Brain Synchrony. Frontiers in Human Neuroscience14, 584312 (Jan 2021). https://doi.org/10.3389/fnhum.2020.584312 , https://pmc.ncbi.nlm.nih. gov/articles/PMC7832346/

  5. [5]

    Frontiers in Psychology6(Apr 2015)

    Cameron, D.J., Bentley, J., Grahn, J.A.: Cross-cultural influences on rhythm process- ing: Reproduction, discrimination, and beat tapping. Frontiers in Psychology6(Apr 2015). https://doi.org/10.3389/fpsyg.2015.00366, https://www.frontiersin. org/journals/psychology/articles/10.3389/fpsyg.2015.00366/full

  6. [6]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Chen, X., Jiang, B., Liu, W., Huang, Z., Fu, B., Chen, T., Yu, G.: Executing Your Commands via Motion Diffusion in Latent Space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18000–18010 (2023), https://openaccess.thecvf.com/content/CVPR2023/html/Chen_Executing_ Your_Commands_via_Motion_Diffusion_in_Latent_Spac...

  7. [7]

    arXiv preprint arXiv:2005.00341 (2020),https: //assets.pubpub.org/2gnzbcnd/11608661311181.pdf

    Dhariwal, P., Jun, H., Payne, C., Kim, J.W., Radford, A., Sutskever, I.: Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341 (2020),https: //assets.pubpub.org/2gnzbcnd/11608661311181.pdf

  8. [8]

    Du, Y., Durkan, C., Strudel, R., Tenenbaum, J.B., Dieleman, S., Fergus, R., Sohl- Dickstein, J., Doucet, A., Grathwohl, W.: Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models and MCMC (Sep 2024).https: //doi.org/10.48550/arXiv.2302.11552,http://arxiv.org/abs/2302.11552

  9. [9]

    In: Advances in Neural Information Processing Systems

    Du, Y., Li, S., Mordatch, I.: Compositional Visual Generation with Energy Based Models. In: Advances in Neural Information Processing Systems. vol. 33, pp. 6637–

  10. [10]

    (2020),https://proceedings.neurips.cc/paper/ 2020/hash/49856ed476ad01fcff881d57e161d73f-Abstract.html

    Curran Associates, Inc. (2020),https://proceedings.neurips.cc/paper/ 2020/hash/49856ed476ad01fcff881d57e161d73f-Abstract.html

  11. [11]

    https://doi.org/10.48550/arXiv.2406.11179 , http:// arxiv.org/abs/2406.11179

    Du, Y., Mao, J., Tenenbaum, J.B.: Learning Iterative Reasoning through Energy Diffusion (Jun 2024). https://doi.org/10.48550/arXiv.2406.11179 , http:// arxiv.org/abs/2406.11179

  12. [12]

    arXiv preprint arXiv:2503.17340 (2025) 16 S.J

    Fan, C., Guan, J., Zhao, X., Xu, D., Lin, Y., Ye, T., Feng, P., Pan, H.: Align your rhythm: Generating highly aligned dance poses with gating-enhanced rhythm-aware feature representation. arXiv preprint arXiv:2503.17340 (2025) 16 S.J. Yoo, et al

  13. [13]

    In: Proceedings of the Seventh ACM International Conference on Multimedia (Part 1)

    Foote, J.: Visualizing music and audio using self-similarity. In: Proceedings of the Seventh ACM International Conference on Multimedia (Part 1). pp. 77–80. MULTIMEDIA ’99, Association for Computing Machinery, New York, NY, USA (Oct 1999). https://doi.org/10.1145/319463.319472 , https://dl.acm.org/ doi/10.1145/319463.319472

  14. [14]

    Gladstone, A., Nanduru, G., Islam, M.M., Han, P., Ha, H., Chadha, A., Du, Y., Ji, H., Li, J., Iqbal, T.: Energy-Based Transformers are Scalable Learners and Thinkers (Jul 2025).https://doi.org/10.48550/arXiv.2507.02092, http: //arxiv.org/abs/2507.02092

  15. [15]

    In: ACM SIGGRAPH 2024 Conference Papers

    Goel, P., Wang, K.C., Liu, C.K., Fatahalian, K.: Iterative Motion Editing with Natural Language. In: ACM SIGGRAPH 2024 Conference Papers. pp. 1–9. SIG- GRAPH ’24, Association for Computing Machinery, New York, NY, USA (Jul 2024). https://doi.org/10.1145/3641519.3657447, https://dl.acm.org/doi/ 10.1145/3641519.3657447

  16. [16]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vi- sion

    Gong, K., Lian, D., Chang, H., Guo, C., Jiang, Z., Zuo, X., Mi, M.B., Wang, X.: TM2D: Bimodality Driven 3D Dance Generation via Music-Text Integration. In: Proceedings of the IEEE/CVF International Conference on Computer Vi- sion. pp. 9942–9952 (2023),https://openaccess.thecvf.com/content/ICCV2023/ html/Gong_TM2D_Bimodality_Driven_3D_Dance_Generation_via_...

  17. [17]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Gong, K., Lian, D., Chang, H., Guo, C., Jiang, Z., Zuo, X., Mi, M.B., Wang, X.: Tm2d: Bimodality driven 3d dance generation via music-text integration. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9942–9952 (2023)

  18. [18]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Guo, C., Zou, S., Zuo, X., Wang, S., Ji, W., Li, X., Cheng, L.: Generating Diverse and Natural 3D Human Motions From Text. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5152–5161 (2022), https://openaccess.thecvf.com/content/CVPR2022/html/Guo_Generating_ Diverse_and_Natural_3D_Human_Motions_From_Text_CVPR_2022_...

  19. [19]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2025)

    Gupta, P., Fotso-Puepi, J.A., Li, Z., Mehta, J., Bera, A.: Mdd: A dataset for text- and-music conditioned duet dance generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2025)

  20. [20]

    In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics

    Gutmann, M., Hyvärinen, A.: Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. pp. 297–304. JMLR Workshop and Conference Proceedings (Mar 2010),https://proceedings.mlr. press/v9/gutmann10a.html

  21. [21]

    arXiv preprint arXiv:2312.15946 (2023)

    Han, B., Ren, Y., Peng, H., Zhang, T., Ling, Z., Yin, X., Han, F.: Enchant- dance: Unveiling the potential of music-driven dance movement. arXiv preprint arXiv:2312.15946 (2023)

  22. [22]

    In: Advances in Neural Information Processing Systems

    Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilib- rium. In: Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc. (2017), https://proceedings.neurips.cc/paper/2017/hash/ 8a1d694707eb0fefe65871369074926d-Abstract.html

  23. [23]

    Neural Computation14(8), 1771–1800 (Aug 2002).https://doi.org/10

    Hinton, G.E.: Training Products of Experts by Minimizing Contrastive Diver- gence. Neural Computation14(8), 1771–1800 (Aug 2002).https://doi.org/10. 1162/089976602760128018, https://ieeexplore.ieee.org/abstract/document/ 6789337 STREAM 17

  24. [24]

    In: Ad- vances in Neural Information Processing Systems

    Ho, J., Jain, A., Abbeel, P.: Denoising Diffusion Probabilistic Models. In: Ad- vances in Neural Information Processing Systems. vol. 33, pp. 6840–6851. Curran Associates, Inc. (2020), https://proceedings.neurips.cc/paper/2020/hash/ 4c5bcfec8584af0d967f1ab10179ca4b-Abstract.html

  25. [25]

    Classifier-Free Diffusion Guidance

    Ho, J., Salimans, T.: Classifier-Free Diffusion Guidance (Jul 2022).https://doi. org/10.48550/arXiv.2207.12598,http://arxiv.org/abs/2207.12598

  26. [26]

    https://doi.org/10.48550/arXiv.2006.06119 , http://arxiv.org/abs/2006

    Huang, R., Hu, H., Wu, W., Sawada, K., Zhang, M., Jiang, D.: Dance Revolution: Long-Term Dance Generation with Music via Curriculum Learning (Sep 2023). https://doi.org/10.48550/arXiv.2006.06119 , http://arxiv.org/abs/2006. 06119

  27. [27]

    In: European Conference on Computer Vision (ECCV) (2024)

    Huang, Y., Wan, W., Yang, Y., Callison-Burch, C., Yatskar, M., Liu, L.: Como: Controllable motion generation through language guided pose code editing. In: European Conference on Computer Vision (ECCV) (2024)

  28. [28]

    In: European Conference on Computer Vision

    Huang, Z., Xu, X., Xu, C., Zhang, H., Zheng, C., Qin, J., He, S.: Beat-it: Beat- synchronized multi-condition 3d dance generation. In: European Conference on Computer Vision. pp. 273–290. Springer (2024)

  29. [29]

    Journal of Machine Learning Research6(4) (2005),https://www.jmlr

    Hyvärinen, A., Dayan, P.: Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research6(4) (2005),https://www.jmlr. org/papers/volume6/hyvarinen05a/hyvarinen05a.pdf

  30. [30]

    Dance Research Journal32(1), 116–125 (2000).https://doi.org/10.2307/1478285, https://www

    Kaeppler, A.L.: Dance Ethnology and the Anthropology of Dance. Dance Research Journal32(1), 116–125 (2000).https://doi.org/10.2307/1478285, https://www. jstor.org/stable/1478285

  31. [31]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition

    Kulkarni,N.,Rempe,D.,Genova,K.,Kundu,A.,Johnson,J.,Fouhey,D.,Guibas,L.: NIFTY: Neural Object Interaction Fields for Guided Human Motion Synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition. pp. 947–957 (2024),https://openaccess.thecvf.com/content/CVPR2024/ html/Kulkarni_NIFTY_Neural_Object_Interaction_Fields_f...

  32. [32]

    05897,http://arxiv.org/abs/2212.05897

    Lee, T., Moon, G., Lee, K.M.: MultiAct: Long-Term 3D Human Motion Generation from Multiple Action Labels (Feb 2023).https://doi.org/10.48550/arXiv.2212. 05897,http://arxiv.org/abs/2212.05897

  33. [33]

    Li, J., Yin, Y., Chu, H., Zhou, Y., Wang, T., Fidler, S., Li, H.: Learning to Generate Diverse Dance Motions with Transformer (Aug 2020).https://doi.org/10.48550/ arXiv.2008.08171,http://arxiv.org/abs/2008.08171

  34. [34]

    Li, J., Cao, J., Zhang, H., Rempe, D., Kautz, J., Iqbal, U., Yuan, Y.: GENMO: A GENeralist Model for Human MOtion (May 2025).https://doi.org/10.48550/ arXiv.2505.01425,http://arxiv.org/abs/2505.01425

  35. [35]

    Chemberta-2: Towards chemical foundation models.CoRR, abs/2209.01712, 2022

    Li, R., Zhang, Y., Zhang, Y., Zhang, H., Guo, J., Zhang, Y., Liu, Y., Li, X.: Lodge: A Coarse to Fine Diffusion Network for Long Dance Generation Guided by the Characteristic Dance Primitives (Apr 2024).https://doi.org/10.48550/arXiv. 2403.10518,http://arxiv.org/abs/2403.10518

  36. [36]

    In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision

    Li, R., Zhao, J., Zhang, Y., Su, M., Ren, Z., Zhang, H., Tang, Y., Li, X.: FineDance: A Fine-grained Choreography Dataset for 3D Full Body Dance Generation. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10234– 10243 (2023), https://openaccess.thecvf.com/content/ICCV2023/html/Li_ FineDance_A_Fine-grained_Choreography_Da...

  37. [37]

    Li, R., Yang, S., Ross, D.A., Kanazawa, A.: Ai choreographer: Music conditioned 3d dance generation with aist++ (2021)

  38. [38]

    Yoo, et al

    Li, R., Yang, S., Ross, D.A., Kanazawa, A.: Learn to dance with aist++: Music conditioned 3d dance generation (2021) 18 S.J. Yoo, et al

  39. [39]

    Advances in Neural Information Processing Systems36, 25268– 25280 (Dec 2023), https : / / proceedings

    Lin, J., Zeng, A., Lu, S., Cai, Y., Zhang, R., Wang, H., Zhang, L.: Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset. Advances in Neural Information Processing Systems36, 25268– 25280 (Dec 2023), https : / / proceedings . neurips . cc / paper _ files / paper / 2023 / hash / 4f8e27f6036c1d8b4a66b5b3a947dd7b - Abstract - Datasets _ and ...

  40. [40]

    Loper, N

    Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: A skinned multi-person linear model. ACM Transactions on Graphics34(6), 248:1–248:16 (2015). https://doi.org/10.1145/2816795.2818013, https://dl.acm.org/doi/ 10.1145/2816795.2818013

  41. [41]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Luo, Z., Ren, M., Hu, X., Huang, Y., Yao, L.: POPDG: Popular 3D Dance Genera- tion with PopDanceSet. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 26984–26993 (2024)

  42. [42]

    https://doi.org/10.48550/arXiv.2505.19377 , http://arxiv.org/abs/2505.19377

    Meng, Z., Han, Z., Peng, X., Xie, Y., Jiang, H.: Absolute Coordinates Make Motion Generation Easy (May 2025). https://doi.org/10.48550/arXiv.2505.19377 , http://arxiv.org/abs/2505.19377

  43. [43]

    In: ACM SIGGRAPH 2005 Papers

    Müller, M., Röder, T., Clausen, M.: Efficient content-based retrieval of motion cap- ture data. In: ACM SIGGRAPH 2005 Papers. pp. 677–685. SIGGRAPH ’05, Associ- ation for Computing Machinery, New York, NY, USA (Jul 2005).https://doi.org/ 10.1145/1186822.1073247, https://dl.acm.org/doi/10.1145/1186822.1073247

  44. [44]

    In: Handbook of Markov Chain Monte Carlo

    Neal, R.M.: MCMC Using Hamiltonian Dynamics. In: Handbook of Markov Chain Monte Carlo. Chapman and Hall/CRC (2011)

  45. [45]

    Eurographics (Short Papers)7(10) (2008), http://reports-archive.adm.cs.cmu.edu/anon/2007/CMU-CS-07-164.pdf

    Onuma, K., Faloutsos, C., Hodgins, J.K.: FMDistance: A Fast and Effective Distance Function for Motion Capture Data. Eurographics (Short Papers)7(10) (2008), http://reports-archive.adm.cs.cmu.edu/anon/2007/CMU-CS-07-164.pdf

  46. [46]

    Advances in Neural Information Processing Systems36, 76382–76408 (Dec 2023), https : / / proceedings

    Park, G.Y., Kim, J., Kim, B., Lee, S.W., Ye, J.C.: Energy-Based Cross Attention for Bayesian Context Update in Text-to-Image Diffusion Models. Advances in Neural Information Processing Systems36, 76382–76408 (Dec 2023), https : / / proceedings . neurips . cc / paper _ files / paper / 2023 / hash / f0878b7efa656b3bbd407c9248d13751-Abstract-Conference.html

  47. [47]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Peebles, W., Xie, S.: Scalable Diffusion Models with Transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4195–4205 (2023), https://openaccess.thecvf.com/content/ICCV2023/html/Peebles_Scalable_ Diffusion_Models_with_Transformers_ICCV_2023_paper.html

  48. [48]

    arXiv preprint arXiv:2404.00054 (2024)

    Peng, S., Ladenheim, K., Shrestha, S., Fermüller, C.: Choreographing the digi- tal canvas: A machine learning approach to artistic performance. arXiv preprint arXiv:2404.00054 (2024)

  49. [49]

    In: Proceedings of the 9th International Conference on Movement and Computing

    Peng, S., Ladenheim, K., Shrestha, S., Fermüller, C.: Generation of novel fall animation with configurable attributes. In: Proceedings of the 9th International Conference on Movement and Computing. MOCO ’24, Association for Computing Machinery, New York, NY, USA (2024). https://doi.org/10.1145/3658852. 3659087,https://doi.org/10.1145/3658852.3659087

  50. [50]

    In: International Conference on Computer Vision (ICCV) (2023)

    Petrovich, M., Black, M.J., Varol, G.: TMR: Text-to-motion retrieval using con- trastive 3D human motion synthesis. In: International Conference on Computer Vision (ICCV) (2023)

  51. [51]

    Big Data4(4), 236–252 (Dec 2016).https://doi.org/10.1089/big.2016.0028, https: //www.liebertpub.com/doi/abs/10.1089/big.2016.0028

    Plappert, M., Mandery, C., Asfour, T.: The KIT Motion-Language Dataset. Big Data4(4), 236–252 (Dec 2016).https://doi.org/10.1089/big.2016.0028, https: //www.liebertpub.com/doi/abs/10.1089/big.2016.0028

  52. [52]

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning Transferable STREAM 19 Visual Models From Natural Language Supervision (Feb 2021).https://doi.org/ 10.48550/arXiv.2103.00020,http://arxiv.org/abs/2103.00020

  53. [53]

    Hopfield Networks is All You Need

    Ramsauer, H., Schäfl, B., Lehner, J., Seidl, P., Widrich, M., Adler, T., Gruber, L., Holzleitner, M., Pavlović, M., Sandve, G.K., Greiff, V., Kreil, D., Kopp, M., Klambauer, G., Brandstetter, J., Hochreiter, S.: Hopfield Networks is All You Need (Apr 2021). https://doi.org/10.48550/arXiv.2008.02217, http://arxiv.org/ abs/2008.02217

  54. [54]

    In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F

    Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional Networks for Biomed- ical Image Segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. pp. 234–241. Springer International Publishing, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28

  55. [55]

    arXiv preprint arXiv:2504.04634 (2025)

    Shah, F.N., Shah, P., Saleem, M.U., Pinyoanuntapong, E., Wang, P., Xue, H., Helmy, A.: Dancemosaic: High-fidelity dance generation with multimodal editability. arXiv preprint arXiv:2504.04634 (2025)

  56. [56]

    Proceedings of the National Academy of Sciences of the United States of America110(1), 70– 75 (Jan 2013).https://doi.org/10.1073/pnas.1209023110, https://pmc.ncbi

    Sievers, B., Polansky, L., Casey, M., Wheatley, T.: Music and movement share a dynamic structure that supports universal expressions of emotion. Proceedings of the National Academy of Sciences of the United States of America110(1), 70– 75 (Jan 2013).https://doi.org/10.1073/pnas.1209023110, https://pmc.ncbi. nlm.nih.gov/articles/PMC3538264/

  57. [57]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Siyao, L., Yu, W., Gu, T., Lin, C., Wang, Q., Qian, C., Loy, C.C., Liu, Z.: Bai- lando: 3D Dance Generation by Actor-Critic GPT With Choreographic Memory. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11050–11059 (2022),https://openaccess.thecvf.com/content/ CVPR2022/html/Siyao_Bailando_3D_Dance_Generation_by_A...

  58. [58]

    In: Proceedings of the 32nd International Conference on Machine Learning

    Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep Unsupervised Learning using Nonequilibrium Thermodynamics. In: Proceedings of the 32nd International Conference on Machine Learning. pp. 2256–2265. PMLR (Jun 2015), https://proceedings.mlr.press/v37/sohl-dickstein15.html

  59. [59]

    Song, Y., Kingma, D.P.: How to Train Your Energy-Based Models (Feb 2021).https: //doi.org/10.48550/arXiv.2101.03288,http://arxiv.org/abs/2101.03288

  60. [60]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-Based Generative Modeling through Stochastic Differential Equations (Feb 2021). https://doi.org/10.48550/arXiv.2011.13456 , http://arxiv.org/abs/ 2011.13456

  61. [61]

    Sun, J., Wang, C., Hu, H., Lai, H., Jin, Z., Hu, J.F.: You Never Stop Dancing: Non-freezing Dance Generation via Bank-constrained Manifold Pro- jection. Advances in Neural Information Processing Systems35, 9995–10007 (Dec 2022), https://proceedings.neurips.cc/paper_files/paper/2022/hash/ 40bfe6177e8aed33c982264cf9e6e62c-Abstract-Conference.html

  62. [62]

    Human Motion Diffusion Model

    Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human Motion Diffusion Model (Oct 2022). https://doi.org/10.48550/arXiv.2209. 14916,http://arxiv.org/abs/2209.14916

  63. [63]

    Tripathi, S., Taheri, O., Lassner, C., Black, M.J., Holden, D., Stoll, C.: HUMOS: Human Motion Model Conditioned on Body Shape (Sep 2024).https://doi.org/ 10.48550/arXiv.2409.03944,http://arxiv.org/abs/2409.03944

  64. [64]

    In: Proceedings of the Asian Conference on Computer Vision

    Truong-Thuy, T.V., Bui-Le, G.C., Nguyen, H.D., Le, T.N.: Rethinking Sampling for Music-Driven Long-Term Dance Generation. In: Proceedings of the Asian Conference on Computer Vision. pp. 2667–2683 (2024),https://openaccess. 20 S.J. Yoo, et al. thecvf.com/content/ACCV2024/html/Truong-Thuy_Rethinking_Sampling_for_ Music-Driven_Long-Term_Dance_Generation_ACCV...

  65. [65]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Tseng, J., Castellon, R., Liu, K.: Edge: Editable dance generation from music. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 448–458 (2023)

  66. [66]

    arXiv preprint arXiv:2406.07871 (2024)

    Wang, H., Zhu, Y., Geng, X.: Flexible music-conditioned dance generation with style description prompts. arXiv preprint arXiv:2406.07871 (2024)

  67. [67]

    Proceedings of the AAAI Conference on Artificial Intelligence39(24), 25615–25623 (Apr 2025)

    Yang, H., Su, K., Zhang, Y., Chen, J., Qian, K., Liu, G., Gan, C.: UniMuMo: Unified Text, Music, and Motion Generation. Proceedings of the AAAI Conference on Artificial Intelligence39(24), 25615–25623 (Apr 2025). https://doi.org/ 10.1609/aaai.v39i24.34752, https://ojs.aaai.org/index.php/AAAI/article/ view/34752

  68. [68]

    In: Advances in Neural Information Processing Systems

    Yuille, A.L., Rangarajan, A.: The Concave-Convex Procedure (CCCP). In: Advances in Neural Information Processing Systems. vol. 14. MIT Press (2001), https : / / proceedings . neurips . cc / paper / 2001 / hash / a012869311d64a44b5a0d567cd20de04-Abstract.html

  69. [69]

    https://doi.org/10.48550/arXiv.2508.17342, http: //arxiv.org/abs/2508.17342

    Zhang, H., Li, Z., Qi, X., Li, M., Sun, M., Zhang, M., Han, S.: DanceEditor: Towards Iterative Editable Music-driven Dance Generation with Open-Vocabulary Descriptions (Aug 2025). https://doi.org/10.48550/arXiv.2508.17342, http: //arxiv.org/abs/2508.17342

  70. [70]

    Zhang, J., Fan, H., Yang, Y.: EnergyMoGen: Compositional Human Motion Gen- eration with Energy-Based Diffusion Model in Latent Space (Dec 2024).https: //doi.org/10.48550/arXiv.2412.14706,http://arxiv.org/abs/2412.14706

  71. [71]

    48550/arXiv.2304.01116,http://arxiv.org/abs/2304.01116

    Zhang, M., Guo, X., Pan, L., Cai, Z., Hong, F., Li, H., Yang, L., Liu, Z.: ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model (Apr 2023).https://doi.org/10. 48550/arXiv.2304.01116,http://arxiv.org/abs/2304.01116

  72. [72]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Zhou, Y., Barnes, C., Lu, J., Yang, J., Li, H.: On the Continuity of Rotation Representations in Neural Networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5745–5753 (2019),https:// openaccess.thecvf.com/content_CVPR_2019/html/Zhou_On_the_Continuity_ of_Rotation_Representations_in_Neural_Networks_CVPR_2019_...

  73. [73]

    :$𝑐%𝑐!𝑡 + FF Layer Layer NormAdaLNScale/Shift AdaLNScale Layer Norm+ Cross w/o AdaLN Transformer Encoder Cross-Attention 𝑥!

    Zhuang, W., Wang, C., Chai, J., Wang, Y., Shao, M., Xia, S.: Music2dance: Dancenet for music-driven dance generation. ACM Trans. Multimedia Comput. Commun. Appl.18(2) (Feb 2022). https://doi.org/10.1145/3485664 , https: //doi.org/10.1145/3485664 STREAM 21 Supplementary Material: Text Dictates, Music Decorates: Energy- based Attention for Editable Dance Mo...