pith. machine review for the scientific record. sign in

arxiv: 2604.17005 · v1 · submitted 2026-04-18 · 💻 cs.CV · cs.SD

Recognition: unknown

TeMuDance: Contrastive Alignment-Based Textual Control for Music-Driven Dance Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 07:34 UTC · model grok-4.3

classification 💻 cs.CV cs.SD
keywords music-driven dance generationtext-conditioned controlcontrastive alignmentmotion embeddingdiffusion modeldataset bridgingfine-tuning strategykinematic evaluation metric
0
0 comments X

The pith

Motion embeddings align separate music-dance and text-motion datasets so a text branch can be added to a frozen music-to-dance diffusion model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to add natural-language control to music-driven dance generation without first collecting a large music-text-motion triplet dataset. It treats motion sequences as the common semantic anchor that lets a contrastive model retrieve text supervision from an existing text-motion corpus and music supervision from a music-dance corpus. A lightweight text-control branch is then trained on top of a frozen diffusion backbone using dual-stream fine-tuning and confidence filtering to suppress retrieval noise. The result is dance that remains rhythmically faithful to the music yet follows the kinematic intent of the text prompt. A reader would care because the method turns an otherwise data-scarce problem into one that re-uses already available unimodal datasets.

Core claim

TeMuDance introduces a motion-centred bridging paradigm that projects music-dance and text-motion pairs into a shared embedding space via contrastive alignment; the aligned motion embeddings then serve as queries to retrieve the missing modality for each sample, after which a lightweight text-control branch is trained on a frozen music-to-dance diffusion backbone with dual-stream fine-tuning and confidence-based filtering, yielding competitive dance quality together with substantially stronger adherence to textual kinematic instructions.

What carries the argument

Motion-centred bridging paradigm that uses contrastive alignment on motion embeddings to retrieve cross-modal supervision from disjoint datasets.

If this is right

  • Text prompts can now specify specific movement styles or body-part actions while the generated dance stays synchronized to the input music.
  • Training no longer requires expensive manual annotation of music-text-motion triplets.
  • The original music-to-dance diffusion backbone can be kept frozen, preserving its rhythmic fidelity.
  • A new task-aligned metric directly measures whether a prompt produces the intended kinematic attributes under music conditioning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same bridging idea could be applied to other pairs of generative tasks that share a common modality, such as text-controlled video from audio.
  • Reducing the need for fully aligned triplet data may accelerate controllable generation research in other creative domains.
  • Interactive choreography tools could let users iteratively refine dance sequences with natural language while the music track stays fixed.

Load-bearing premise

Motion embeddings retrieved from separate datasets are accurate enough semantic bridges that the confidence filter can remove most of the remaining misalignment noise before fine-tuning.

What would settle it

Run the proposed text-aligned kinematic metric on a held-out set of prompts; if the metric scores remain near the level of the unconditioned baseline or if dance quality metrics drop sharply, the bridging-plus-filtering claim is falsified.

Figures

Figures reproduced from arXiv: 2604.17005 by Diptesh Kanojia, Wenwu Wang, Xinran Liu, Zhenhua Feng.

Figure 1
Figure 1. Figure 1: The proposed TeMuDance method is able to gener [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An overview of TeMuDance. We learn a motion-centred bank by contrastively aligning disjoint text–motion and [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Architecture of the music-conditioned diffusion [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of the Motion-Centred Bank. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visual pipeline of Dual-Stream Training. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Text–music controllability at inference. [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visual comparison of the generated dance between the proposed method and TM2D [9]. [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visual comparisons of the ablation designs and our [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
read the original abstract

Existing music-driven dance generation approaches have achieved strong realism and effective audio-motion alignment. However, they generally lack semantic controllability, making it difficult to guide specific movements through natural language descriptions. This limitation primarily stems from the absence of large-scale datasets that jointly align music, text, and motion for supervised learning of text-conditioned control. To address this challenge, we propose TeMuDance, a framework that enables text-based control for music-conditioned dance generation without requiring any manually annotated music-text-motion triplet dataset. TeMuDance introduces a motion-centred bridging paradigm that leverages motion as a shared semantic anchor to align disjoint music-dance and text-motion datasets within a unified embedding space, enabling cross-modal retrieval of missing modalities for end-to-end training. A lightweight text control branch is then trained on top of a frozen music-to-dance diffusion backbone, preserving rhythmic fidelity while enabling fine-grained semantic guidance. To further suppress noise inherent in the retrieved supervision, we design a dual-stream fine-tuning strategy with confidence-based filtering. We also propose a novel task-aligned metric that quantifies whether textual prompts induce the intended kinematic attributes under music conditioning. Extensive experiments demonstrate that TeMuDance achieves competitive dance quality while substantially improving text-conditioned control over existing methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes TeMuDance, a framework for adding text-based semantic control to music-driven dance generation without requiring manually annotated music-text-motion triplets. It uses motion embeddings as a shared semantic anchor to align disjoint music-dance and text-motion datasets via contrastive learning, enabling cross-modal retrieval of supervision. A lightweight text-control branch is trained atop a frozen music-to-dance diffusion backbone using dual-stream fine-tuning and confidence-based filtering to suppress noise from retrieval. The authors also introduce a task-aligned metric to quantify whether text prompts induce intended kinematic attributes under music conditioning. Experiments are claimed to show competitive dance quality alongside substantially improved text-conditioned control relative to prior methods.

Significance. If the core bridging and filtering steps deliver high-fidelity supervision, the approach would meaningfully reduce the data bottleneck for controllable multimodal generation and could generalize to other audio-visual or text-motion tasks. The decision to freeze the diffusion backbone while adding a lightweight control branch is a pragmatic strength that helps preserve rhythmic fidelity. However, the significance is currently limited by the absence of direct evidence that the retrieved pairs achieve the required kinematic-level semantic fidelity across heterogeneous datasets.

major comments (3)
  1. [Abstract / §3] Abstract and §3 (motion-centred bridging paradigm): the claim that contrastive alignment on motion embeddings yields sufficiently clean cross-modal supervision for fine-grained text control is load-bearing, yet no retrieval-precision metrics, noise-rate statistics, or kinematic-fidelity analysis (e.g., joint-angle or trajectory alignment between retrieved text and original motion) are reported. Without these, it is impossible to verify that the dual-stream fine-tuning and confidence filtering actually mitigate misalignment rather than simply discarding most data.
  2. [§4] §4 (experiments and ablations): the paper states that extensive experiments demonstrate competitive dance quality and improved text control, but the provided description contains no quantitative results, ablation tables on the confidence threshold or contrastive temperature, or comparisons showing that the task-aligned metric correlates with human judgments of controllability. This gap directly affects the central claim of 'substantially improving text-conditioned control.'
  3. [Method / §3.3] Method description of the novel task-aligned metric: the metric is introduced to quantify text-induced kinematic attributes, but its exact formulation, normalization, and validation against existing metrics (e.g., FID, beat alignment, or user-study scores) are not detailed. If the metric is ad-hoc and unvalidated, it cannot reliably support the superiority claims.
minor comments (2)
  1. [§3.2] Notation for the contrastive loss temperature and loss weights should be explicitly defined with symbols and ranges used in experiments.
  2. [Figure 2] Figure captions for the pipeline diagram should clarify which components are frozen versus trainable during the dual-stream stage.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help clarify the presentation of our contributions. We address each major comment below and will incorporate the requested analyses and clarifications in a revised manuscript.

read point-by-point responses
  1. Referee: [Abstract / §3] Abstract and §3 (motion-centred bridging paradigm): the claim that contrastive alignment on motion embeddings yields sufficiently clean cross-modal supervision for fine-grained text control is load-bearing, yet no retrieval-precision metrics, noise-rate statistics, or kinematic-fidelity analysis (e.g., joint-angle or trajectory alignment between retrieved text and original motion) are reported. Without these, it is impossible to verify that the dual-stream fine-tuning and confidence filtering actually mitigate misalignment rather than simply discarding most data.

    Authors: We agree that direct quantitative validation of the retrieved supervision quality is necessary to support the core bridging claim. The current manuscript validates the approach primarily through end-to-end task performance, but we will add a new subsection in §3 that reports retrieval precision (Recall@1/5/10) for both text-to-motion and music-to-motion retrieval in the shared embedding space. We will also include noise-rate statistics from manual review of 200 sampled pairs and kinematic-fidelity measurements (mean joint-angle deviation and trajectory RMSE) between original motions and those implied by the retrieved text under matched music. These will be shown both before and after confidence filtering to demonstrate its effect. revision: yes

  2. Referee: [§4] §4 (experiments and ablations): the paper states that extensive experiments demonstrate competitive dance quality and improved text control, but the provided description contains no quantitative results, ablation tables on the confidence threshold or contrastive temperature, or comparisons showing that the task-aligned metric correlates with human judgments of controllability. This gap directly affects the central claim of 'substantially improving text-conditioned control.'

    Authors: We will expand §4 to include all quantitative results in the main text (currently partially relegated to supplementary material), with tables reporting dance quality (FID, BAS) and text-control metrics across baselines. New ablation tables will vary the confidence threshold (0.5–0.9) and contrastive temperature (0.05–0.2), reporting their impact on both quality and control metrics. We will further add a user study (20 raters, 5-point controllability scale) and report the correlation (Spearman) between the task-aligned metric and human scores to directly link the metric to perceived controllability. revision: yes

  3. Referee: [Method / §3.3] Method description of the novel task-aligned metric: the metric is introduced to quantify text-induced kinematic attributes, but its exact formulation, normalization, and validation against existing metrics (e.g., FID, beat alignment, or user-study scores) are not detailed. If the metric is ad-hoc and unvalidated, it cannot reliably support the superiority claims.

    Authors: We will revise §3.3 to present the complete formulation: the metric computes the average cosine similarity in a kinematic embedding space (from a frozen pose encoder) between the generated motion and a text-conditioned reference motion, normalized by batch-wise standard deviation. We will add validation results showing its negative correlation with FID, positive correlation with beat alignment, and strong alignment with the new user-study controllability scores, thereby grounding its use for superiority claims. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on external datasets, pre-trained backbone, and standard contrastive retrieval without self-referential reduction

full rationale

The paper's central pipeline—motion-centred contrastive alignment of disjoint music-dance and text-motion datasets, cross-modal retrieval, dual-stream fine-tuning of a text branch on a frozen diffusion backbone, and confidence filtering—is presented as a practical engineering solution rather than a mathematical derivation. No equations are shown that define a quantity in terms of itself or rename a fitted parameter as a prediction. No load-bearing uniqueness theorem or ansatz is imported via self-citation. The claimed improvements are supported by external experimental validation on standard metrics and a new task-aligned metric, keeping the argument self-contained against independent benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claim rests on standard assumptions from contrastive learning and diffusion models plus several ad-hoc components introduced for bridging and filtering; no new physical entities are postulated.

free parameters (2)
  • confidence threshold for filtering
    Used in dual-stream fine-tuning to suppress noise from retrieved supervision; value chosen to balance data quality and quantity.
  • contrastive alignment temperature and loss weights
    Hyperparameters in the motion-centred bridging step that determine how strongly music-dance and text-motion embeddings are pulled together.
axioms (2)
  • domain assumption Motion embeddings from separate datasets form a reliable shared semantic space for cross-modal retrieval
    Invoked in the motion-centred bridging paradigm to align disjoint datasets without direct music-text pairs.
  • domain assumption Freezing the music-to-dance diffusion backbone preserves rhythmic fidelity while allowing text control
    Stated as the basis for adding the lightweight text branch without retraining the core model.
invented entities (1)
  • task-aligned metric no independent evidence
    purpose: Quantifies whether textual prompts induce intended kinematic attributes under music conditioning
    New evaluation tool proposed to measure text control effectiveness beyond standard dance quality metrics.

pith-pipeline@v0.9.0 · 5526 in / 1711 out tokens · 35223 ms · 2026-05-10T07:34:09.656694+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 7 canonical work pages · 3 internal anchors

  1. [1]

    Emre Aksan, Manuel Kaufmann, Peng Cao, and Otmar Hilliges. 2021. A spatio- temporal transformer for 3D human motion prediction. In2021 International Conference on 3D Vision (3DV). IEEE, 565–574

  2. [2]

    Nikos Athanasiou, Alpár Cseke, Markos Diomataris, Michael J Black, and Gül Varol. 2024. Motionfix: Text-driven 3d human motion editing. InSIGGRAPH Asia 2024 Conference Papers. 1–11

  3. [3]

    Judith Butepage, Michael J Black, Danica Kragic, and Hedvig Kjellstrom. 2017. Deep representation learning for human motion prediction and classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6158–6166

  4. [4]

    Hsukuang Chiu, Ehsan Adeli, Borui Wang, De-An Huang, and Juan Carlos Niebles. 2019. Action-agnostic human pose forecasting. In2019 IEEE Winter Conference on Applications of Computer Vision (W ACV). IEEE, 1423–1432

  5. [5]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Volume 1 (Long and Short Papers). 4171–4186

  6. [6]

    Xiaoxiao Du, Ram Vasudevan, and Matthew Johnson-Roberson. 2019. Bio-lstm: A biomechanically inspired recurrent neural network for 3-d pedestrian pose and gait prediction.IEEE Robotics and Automation Letters4, 2 (2019), 1501–1508

  7. [7]

    Di Fan, Lili Wan, Wanru Xu, and Shenghui Wang. 2022. A bi-directional attention guided cross-modal network for music based dance generation.Computers and Electrical Engineering103 (2022), 108310

  8. [8]

    Satoru Fukayama and Masataka Goto. 2015. Music content driven automated choreography with beat-wise motion connectivity constraints.Proceedings of SMC(2015), 177–183

  9. [9]

    Kehong Gong, Dongze Lian, Heng Chang, Chuan Guo, Zihang Jiang, Xinxin Zuo, Michael Bi Mi, and Xinchao Wang. 2023. Tm2d: Bimodality driven 3D dance generation via music-text integration. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 9942–9952

  10. [10]

    Liang-Yan Gui, Yu-Xiong Wang, Xiaodan Liang, and José MF Moura. 2018. Adver- sarial geometry-aware human motion prediction. InProceedings of the European Conference on Computer Vision (ECCV). 786–803

  11. [11]

    Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng

  12. [12]

    InProceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Generating diverse and natural 3d human motions from text. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5152–5161

  13. [13]

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Mo- mentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9729–9738

  14. [14]

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in Neural Information Processing Systems30 (2017)

  15. [15]

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. InAdvances in Neural Information Processing Systems (NeurIPS)

  16. [16]

    Jonathan Ho and Tim Salimans. 2022. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598(2022)

  17. [17]

    Daniel Holden, Jun Saito, and Taku Komura. 2016. A deep learning framework for character motion synthesis and editing.ACM Transactions on Graphics (TOG) 35, 4 (2016), 1–11

  18. [18]

    Daniel Holden, Jun Saito, Taku Komura, and Thomas Joyce. 2015. Learning motion manifolds with convolutional autoencoders. InSIGGRAPH Asia 2015 technical briefs. Association for Computing Machinery, New York, NY, USA, 1–4

  19. [19]

    Ruozi Huang, Huang Hu, Wei Wu, Kei Sawada, Mi Zhang, and Daxin Jiang

  20. [20]

    Dance revolution: Long-term dance generation with music via curriculum learning.arXiv preprint arXiv:2006.06119(2020)

  21. [21]

    Yuhang Huang, Junjie Zhang, Shuyan Liu, Qian Bao, Dan Zeng, Zhineng Chen, and Wu Liu. 2022. Genre-conditioned long-term 3d dance generation driven by music. InICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4858–4862

  22. [22]

    Jinwoo Kim, Heeseok Oh, Seongjean Kim, Hoseok Tong, and Sanghoon Lee. 2022. A brand new dance partner: Music-conditioned pluralistic dancing controlled by multiple dance genres. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 3490–3500

  23. [23]

    Buyu Li, Yongchi Zhao, Shi Zhelun, and Lu Sheng. 2022. Danceformer: Mu- sic conditioned 3D dance generation with parametric motion transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 1272–1279

  24. [24]

    Chen Li, Zhen Zhang, Wee Sun Lee, and Gim Hee Lee. 2018. Convolutional sequence to sequence model for human dynamics. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5226–5234

  25. [25]

    Jiaman Li, Yihang Yin, Hang Chu, Yi Zhou, Tingwu Wang, Sanja Fidler, and Hao Li. 2020. Learning to generate diverse dance motions with transformer.arXiv preprint arXiv:2008.08171(2020)

  26. [26]

    Ruilong Li, Shan Yang, David A Ross, and Angjoo Kanazawa. 2021. Ai choreog- rapher: Music conditioned 3D dance generation with aist++. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 13401–13412

  27. [27]

    Ronghui Li, YuXiang Zhang, Yachao Zhang, Hongwen Zhang, Jie Guo, Yan Zhang, Yebin Liu, and Xiu Li. 2024. Lodge: A coarse to fine diffusion network for long dance generation guided by the characteristic dance primitives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1524–1534

  28. [28]

    Ronghui Li, Junfan Zhao, Yachao Zhang, Mingyang Su, Zeping Ren, Han Zhang, Yansong Tang, and Xiu Li. 2023. Finedance: A fine-grained choreography dataset for 3d full body dance generation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 10234–10243

  29. [29]

    Xinran Liu, Xu Dong, Diptesh Kanojia, Wenwu Wang, and Zhenhua Feng. 2025. GCDance: Genre-Controlled 3D Full Body Dance Generation Driven By Music. arXiv preprint arXiv:2502.18309(2025)

  30. [30]

    Xinran Liu, Zhenhua Feng, Diptesh Kanojia, and Wenwu Wang. 2024. DGFM: Full Body Dance Generation Driven by Music Foundation Models. InAudio Imag- ination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation

  31. [31]

    Zhenguang Liu, Shuang Wu, Shuyuan Jin, Qi Liu, Shijian Lu, Roger Zimmermann, and Li Cheng. 2019. Towards natural and accurate future motion prediction of humans and animals. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10004–10012

  32. [32]

    Iliana Loi, Evangelia I Zacharaki, and Konstantinos Moustakas. 2023. Machine learning approaches for 3D motion synthesis and musculoskeletal dynamics estimation: a survey.IEEE transactions on Visualization and Computer Graphics 30, 8 (2023), 5810–5829

  33. [33]

    Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. 2023. SMPL: A skinned multi-person linear model. InSeminal Graphics Papers: Pushing the Boundaries, Volume 2. 851–866

  34. [34]

    Zhenye Luo, Min Ren, Xuecai Hu, Yongzhen Huang, and Li Yao. 2024. Popdg: Popular 3d dance generation with popdanceset. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 26984–26993

  35. [35]

    Tiezheng Ma, Yongwei Nie, Chengjiang Long, Qing Zhang, and Guiqing Li. 2022. Progressively generating better initial guesses towards next stages for high- quality human motion prediction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6437–6446

  36. [36]

    Julieta Martinez, Michael J Black, and Javier Romero. 2017. On human motion prediction using recurrent neural networks. InProceedings of the IEEE conference on computer vision and pattern recognition. 2891–2900

  37. [37]

    Lucas Mourot, Ludovic Hoyet, François Le Clerc, François Schnitzler, and Pierre Hellier. 2022. A survey on deep learning for skeleton-based human animation. InComputer Graphics Forum, Vol. 41. Wiley Online Library, 122–157

  38. [38]

    Ferda Ofli, Yasemin Demir, Yücel Yemez, Engin Erzin, A Murat Tekalp, Koray Balcı, İdil Kızoğlu, Lale Akarun, Cristian Canton-Ferrer, Joëlle Tilmanne, et al

  39. [39]

    An audio-driven dancing avatar.Journal on Multimodal User Interfaces2 (2008), 93–103

  40. [40]

    Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. 2018. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32

  41. [41]

    Qiaosong Qi, Le Zhuo, Aixi Zhang, Yue Liao, Fei Fang, Si Liu, and Shuicheng Yan

  42. [42]

    InProceedings of the 31st ACM International Conference on Multimedia

    Diffdance: Cascaded human motion Diffusion Model for dance generation. InProceedings of the 31st ACM International Conference on Multimedia. 1374– 1382

  43. [43]

    Dmitry Senushkin, Nikolay Patakin, Arseny Kuznetsov, and Anton Konushin

  44. [44]

    Independent component alignment for multi-task learning. InCVPR. 20083– 20093

  45. [45]

    Takaaki Shiratori, Atsushi Nakazawa, and Katsushi Ikeuchi. 2006. Dancing-to- music character animation. InComputer Graphics Forum, Vol. 25. Wiley Online Library, 449–458

  46. [46]

    Li Siyao, Weijiang Yu, Tianpei Gu, Chunze Lin, Quan Wang, Chen Qian, Chen Change Loy, and Ziwei Liu. 2022. Bailando: 3D dance generation by actor-critic gpt with choreographic memory. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 11050–11059

  47. [47]

    Guofei Sun, Yongkang Wong, Zhiyong Cheng, Mohan S Kankanhalli, Weidong Geng, and Xiangdong Li. 2020. Deepdance: music-to-dance motion choreography with adversarial learning.IEEE Transactions on Multimedia23 (2020), 497–509

  48. [48]

    Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen- Or, and Amit H. Bermano. 2022. Human Motion Diffusion Model. arXiv:2209.14916 [cs.CV]

  49. [49]

    Jonathan Tseng, Rodrigo Castellon, and Karen Liu. 2023. Edge: Editable dance generation from music. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 448–458

  50. [50]

    Qing Wang, Xiaohang Yang, Yilan Dong, Naveen Raj Govindaraj, Gregory Slabaugh, and Shanxin Yuan. 2025. DanceChat: Large Language Model-Guided Music-to-Dance Generation.arXiv preprint arXiv:2506.10574(2025)

  51. [51]

    Xingyu Xie, Pan Zhou, Huan Li, Zhouchen Lin, and Shuicheng Yan. 2024. Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models.IEEE Transactions on Pattern Analysis and Machine Intelligence(2024). 9

  52. [52]

    Han Yang, Kun Su, Yutong Zhang, Jiaben Chen, Kaizhi Qian, Gaowen Liu, and Chuang Gan. 2025. Unimumo: Unified text, music, and motion generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 25615–25623

  53. [53]

    Hengyuan Zhang, Zhe Li, Xingqun Qi, Mengze Li, Muyi Sun, Siye Wang, Man Zhang, and Sirui Han. 2025. DanceEditor: Towards Iterative Editable Music- driven Dance Generation with Open-Vocabulary Descriptions. InProceedings of the IEEE/CVF International Conference on Computer Vision. 12158–12168

  54. [54]

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. 2023. Adding conditional control to text-to-image Diffusion Models. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV). 3836–3847

  55. [55]

    Zeyu Zhang, Yiran Wang, Wei Mao, Danning Li, Rui Zhao, Biao Wu, Zirui Song, Bohan Zhuang, Ian Reid, and Richard Hartley. 2025. Motion anything: Any to motion generation.arXiv preprint arXiv:2503.06955(2025)

  56. [56]

    Wentao Zhu, Xiaoxuan Ma, Dongwoo Ro, Hai Ci, Jinlu Zhang, Jiaxin Shi, Feng Gao, Qi Tian, and Yizhou Wang. 2023. Human motion generation: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence46, 4 (2023), 2430–2449

  57. [57]

    Wenlin Zhuang, Congyi Wang, Jinxiang Chai, Yangang Wang, Ming Shao, and Siyu Xia. 2022. Music2dance: Dancenet for music-driven dance generation. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)18, 2 (2022), 1–21. 10 A Detailed Motion Representation In this section, we detail the composition of the motion represen- tation 𝒙...