GCDance: Genre-Controlled Music-Driven 3D Full Body Dance Generation
Pith reviewed 2026-05-23 02:38 UTC · model grok-4.3
The pith
A text-based control mechanism lets diffusion models generate 3D dances that match both music and a chosen genre.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors introduce a diffusion-based framework for genre-specific 3D full-body dance generation conditioned on both music and descriptive text. A text-based control mechanism maps input prompts, whether explicit genre labels or free-form text, into genre-specific control signals. Features from a music foundation model support coherent alignment between conditions, and a multi-task optimization strategy balances physical realism, spatial accuracy, and text classification to improve overall sequence quality. Experiments on the FineDance and AIST++ datasets show the method outperforms existing approaches.
What carries the argument
The text-based control mechanism that converts input prompts into genre-specific control signals for the diffusion process.
If this is right
- Dances show improved stylistic consistency with the input genre while remaining synchronized to music.
- Both explicit genre labels and free-form descriptive text can guide generation.
- The multi-task optimization maintains high physical realism and spatial accuracy alongside style control.
- Results exceed prior state-of-the-art methods on the FineDance and AIST++ datasets.
Where Pith is reading between the lines
- The text-conditioning approach could extend to other motion synthesis tasks that require style control, such as character animation in games.
- Natural language input might lower barriers for non-experts creating custom dance sequences in virtual environments.
- Further tests on music outside the training distribution would clarify how well the genre mapping generalizes.
- Combining this control with real-time audio input could support interactive applications like live performance tools.
Load-bearing premise
The text prompts can be mapped to genre control signals that improve stylistic consistency without reducing motion quality or music synchronization.
What would settle it
Generate dances from music of one genre paired with a text prompt for a conflicting genre and check whether the output fails to reflect the prompted style while still matching the music beats.
Figures
read the original abstract
Music-driven dance generation is a challenging task as it requires strict adherence to genre-specific choreography while ensuring physically realistic and precisely synchronized dance sequences with the music's beats and rhythm. Although significant progress has been made in music-conditioned dance generation, most existing methods struggle to convey specific stylistic attributes in generated dance. To bridge this gap, we propose a diffusion-based framework for genre-specific 3D full-body dance generation, conditioned on both music and descriptive text. To effectively incorporate genre information, we develop a text-based control mechanism that maps input prompts, either explicit genre labels or free-form descriptive text, into genre-specific control signals, enabling precise and controllable text-guided generation of genre-consistent dance motions. Furthermore, to enhance the alignment between music and textual conditions, we leverage the features of a music foundation model, facilitating coherent and semantically aligned dance synthesis. Last, to balance the objectives of extracting text-genre information and maintaining high-quality generation results, we propose a novel multi-task optimization strategy. This effectively balances competing factors such as physical realism, spatial accuracy, and text classification, significantly improving the overall quality of the generated sequences. Extensive experimental results obtained on the FineDance and AIST++ datasets demonstrate the superiority of GCDance over the existing state-of-the-art approaches.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes GCDance, a diffusion-based framework for genre-controlled 3D full-body dance generation conditioned on both music and text prompts (either genre labels or free-form descriptions). It introduces a text-based control mechanism to produce genre-specific signals, incorporates features from a music foundation model for better alignment, and uses a multi-task optimization strategy to balance physical realism, spatial accuracy, and text classification. Experiments on the FineDance and AIST++ datasets are stated to demonstrate superiority over existing state-of-the-art methods.
Significance. If the empirical claims hold with proper validation, the work would provide a practical advance in controllable dance synthesis by enabling text-guided genre consistency while preserving music synchronization and physical plausibility. The combination of diffusion models, music foundation features, and multi-task balancing addresses a recognized limitation in prior music-driven methods. Strengths include the coherent integration of text conditioning and the explicit multi-task loss formulation, though the absence of detailed metrics limits immediate assessment of impact.
major comments (2)
- [Experiments] Experiments section: The central claim of superiority over SOTA on FineDance and AIST++ is asserted without any reported quantitative metrics (e.g., FID, beat alignment scores, genre classification accuracy), baseline descriptions, ablation studies, or statistical tests. This directly undermines verification of the text-based control mechanism's effectiveness and the multi-task strategy's benefits.
- [§3.2] §3.2 (Text-based Control Mechanism): The mapping of input prompts to genre-specific control signals is presented as enabling precise text-guided generation, yet no analysis or ablation demonstrates that this improves stylistic consistency without degrading physical realism or music synchronization—the load-bearing assumption for the framework's novelty.
minor comments (1)
- [Abstract] The abstract and introduction would benefit from explicit citation of the music foundation model used and a brief description of the multi-task loss weights.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to provide the requested experimental details and analyses.
read point-by-point responses
-
Referee: [Experiments] Experiments section: The central claim of superiority over SOTA on FineDance and AIST++ is asserted without any reported quantitative metrics (e.g., FID, beat alignment scores, genre classification accuracy), baseline descriptions, ablation studies, or statistical tests. This directly undermines verification of the text-based control mechanism's effectiveness and the multi-task strategy's benefits.
Authors: We acknowledge that the submitted manuscript does not report the quantitative metrics, baseline details, ablations, or statistical tests in the Experiments section. In the revision we will add FID, beat alignment, genre classification accuracy, full baseline descriptions, ablation studies on the text-based control and multi-task losses, and statistical significance tests to substantiate the superiority claims. revision: yes
-
Referee: [§3.2] §3.2 (Text-based Control Mechanism): The mapping of input prompts to genre-specific control signals is presented as enabling precise text-guided generation, yet no analysis or ablation demonstrates that this improves stylistic consistency without degrading physical realism or music synchronization—the load-bearing assumption for the framework's novelty.
Authors: We agree that an explicit ablation is required to isolate the contribution of the text-based control mechanism. The revised manuscript will include an ablation study comparing the full model against a variant without text conditioning, reporting metrics for stylistic consistency (genre accuracy), physical realism, and music synchronization to confirm that the mechanism improves genre adherence without harming the other objectives. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper presents a diffusion framework with text-based genre control, music foundation model features, and multi-task optimization. No equations, derivations, or predictions are shown that reduce to fitted inputs, self-definitions, or self-citation chains by construction. Claims of superiority rest on experimental results on external datasets (FineDance, AIST++), which are independent of any internal tautology. The text-conditioning mechanism is described as a novel architectural addition rather than a renaming or fit of prior outputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- multi-task loss weights
axioms (1)
- domain assumption Features from a music foundation model provide semantically aligned information that improves dance-music synchronization when injected into the diffusion process.
Forward citations
Cited by 4 Pith papers
-
TeMuDance: Contrastive Alignment-Based Textual Control for Music-Driven Dance Generation
TeMuDance enables text-based semantic control over music-conditioned dance generation by using motion as a bridge to align existing unpaired datasets and training a lightweight text branch on a frozen diffusion backbo...
-
ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body
ViBES introduces a speech-language-behavior model using modality-specific transformer experts that jointly generates dialogue and 3D body actions, showing gains over separate co-speech and text-to-motion baselines on ...
-
PianoFlow: Music-Aware Streaming Piano Motion Generation with Bimanual Coordination
PianoFlow generates coordinated bimanual piano motions from audio via MIDI-distilled flow-matching, asymmetric role-gated interaction, and autoregressive streaming continuation, outperforming priors with 9x faster inference.
-
Listen to Rhythm, Choose Movements: Autoregressive Multimodal Dance Generation via Diffusion and Mamba with Decoupled Dance Dataset
LRCM is a new multimodal diffusion model with audio and text Conformers plus Motion Temporal Mamba for generating long, coherent dance sequences from rhythm and descriptions using a decoupled dataset.
Reference graph
Works this paper leans on
-
[1]
Is dance a language? movement, meaning and commu- nication,
H. Bannerman, “Is dance a language? movement, meaning and commu- nication,”Dance Research, vol. 32, no. 1, pp. 65–80, 2014
work page 2014
-
[2]
Ai choreographer: Music conditioned 3d dance generation with aist++,
R. Li, S. Yang, D. A. Ross, and A. Kanazawa, “Ai choreographer: Music conditioned 3d dance generation with aist++,” inICCV, 2021, pp. 13 401–13 412
work page 2021
-
[3]
J. Kim, H. Oh, S. Kim, H. Tong, and S. Lee, “A brand new dance partner: Music-conditioned pluralistic dancing controlled by multiple dance genres,” inCVPR, 2022, pp. 3490–3500
work page 2022
-
[4]
Bailando: 3d dance generation by actor-critic gpt with choreographic memory,
L. Siyao, W. Yu, T. Gu, C. Lin, Q. Wang, C. Qian, C. C. Loy, and Z. Liu, “Bailando: 3d dance generation by actor-critic gpt with choreographic memory,” inCVPR, 2022, pp. 11 050–11 059
work page 2022
-
[5]
Tm2d: Bimodality driven 3d dance generation via music-text integration,
K. Gong, D. Lian, H. Chang, C. Guo, Z. Jiang, X. Zuo, M. B. Mi, and X. Wang, “Tm2d: Bimodality driven 3d dance generation via music-text integration,” inICCV, 2023, pp. 9942–9952
work page 2023
-
[6]
G. Tevet, S. Raab, B. Gordon, Y . Shafir, D. Cohen-Or, and A. H. Bermano, “Human motion diffusion model,” 2022
work page 2022
-
[7]
Denoising diffusion probabilistic models,
J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” NeurIPS, vol. 33, pp. 6840–6851, 2020
work page 2020
-
[8]
Edge: Editable dance generation from music,
J. Tseng, R. Castellon, and K. Liu, “Edge: Editable dance generation from music,” inCVPR, 2023, pp. 448–458
work page 2023
-
[9]
Diff- dance: Cascaded human motion diffusion model for dance generation,
Q. Qi, L. Zhuo, A. Zhang, Y . Liao, F. Fang, S. Liu, and S. Yan, “Diff- dance: Cascaded human motion diffusion model for dance generation,” inACM MM, 2023, pp. 1374–1382
work page 2023
-
[10]
DGFM: Full Body Dance Generation Driven by Music Foundation Models,
X. Liu, Z. Feng, D. Kanojia, and W. Wang, “DGFM: Full Body Dance Generation Driven by Music Foundation Models,” inNeurIPS 2024 Workshop on AI-Driven Speech, Music, and Sound Generation. 10
work page 2024
-
[11]
R. Li, Y . Zhang, Y . Zhang, H. Zhang, J. Guo, and et al, “Lodge: A coarse to fine diffusion network for long dance generation guided by the characteristic dance primitives,” inCVPR, 2024, pp. 1524–1534
work page 2024
-
[12]
Longdancediff: Long-term dance generation with conditional diffusion model,
S. Yang, Z. Yang, and Z. Wang, “Longdancediff: Long-term dance generation with conditional diffusion model,”arXiv preprint arXiv:2308.11945, 2023
-
[13]
Wav2clip: Learning robust audio representations from clip,
H.-H. Wu, P. Seetharaman, K. Kumar, and J. P. Bello, “Wav2clip: Learning robust audio representations from clip,” inICASSP, 2022, pp. 4563–4567
work page 2022
-
[14]
Finedance: A fine-grained choreography dataset for 3d full body dance generation,
R. Li, J. Zhao, Y . Zhang, M. Su, Z. Ren, H. Zhang, Y . Tang, and X. Li, “Finedance: A fine-grained choreography dataset for 3d full body dance generation,” inICCV, 2023, pp. 10 234–10 243
work page 2023
-
[15]
An audio-driven dancing avatar,
F. Ofli, Y . Demir, Y . Yemez, E. Erzin, A. M. Tekalp, K. Balcı,˙I. Kızo˘glu, L. Akarun, C. Canton-Ferrer, J. Tilmanneet al., “An audio-driven dancing avatar,”Journal on Multimodal User Interfaces, vol. 2, pp. 93– 103, 2008
work page 2008
-
[16]
Music content driven automated choreog- raphy with beat-wise motion connectivity constraints,
S. Fukayama and M. Goto, “Music content driven automated choreog- raphy with beat-wise motion connectivity constraints,”Proceedings of SMC, pp. 177–183, 2015
work page 2015
-
[17]
A deep learning framework for character motion synthesis and editing,
D. Holden, J. Saito, and T. Komura, “A deep learning framework for character motion synthesis and editing,”ACM Transactions on Graphics (TOG), vol. 35, no. 4, pp. 1–11, 2016
work page 2016
-
[18]
Action- agnostic human pose forecasting,
H. Chiu, E. Adeli, B. Wang, D.-A. Huang, and J. C. Niebles, “Action- agnostic human pose forecasting,” inWACV, 2019, pp. 1423–1432
work page 2019
-
[19]
X. Du, R. Vasudevan, and M. Johnson-Roberson, “Bio-lstm: A biome- chanically inspired recurrent neural network for 3-d pedestrian pose and gait prediction,”IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 1501–1508, 2019
work page 2019
-
[20]
A bi-directional attention guided cross-modal network for music based dance generation,
D. Fan, L. Wan, W. Xu, and S. Wang, “A bi-directional attention guided cross-modal network for music based dance generation,”Computers and Electrical Engineering, vol. 103, p. 108310, 2022
work page 2022
-
[21]
Genre-conditioned long-term 3d dance generation driven by music,
Y . Huang, J. Zhang, S. Liu, Q. Bao, D. Zeng, Z. Chen, and W. Liu, “Genre-conditioned long-term 3d dance generation driven by music,” in ICASSP, 2022, pp. 4858–4862
work page 2022
-
[22]
Danceformer: Music con- ditioned 3d dance generation with parametric motion transformer,
B. Li, Y . Zhao, S. Zhelun, and L. Sheng, “Danceformer: Music con- ditioned 3d dance generation with parametric motion transformer,” in AAAI, vol. 36, no. 2, 2022, pp. 1272–1279
work page 2022
-
[23]
Mu- sic2dance: Dancenet for music-driven dance generation,
W. Zhuang, C. Wang, J. Chai, Y . Wang, M. Shao, and S. Xia, “Mu- sic2dance: Dancenet for music-driven dance generation,”ACM Trans- actions on Multimedia Computing, Communications, and Applications (TOMM), vol. 18, no. 2, pp. 1–21, 2022
work page 2022
-
[24]
Improved denoising diffusion proba- bilistic models,
A. Q. Nichol and P. Dhariwal, “Improved denoising diffusion proba- bilistic models,” inICML. PMLR, 2021, pp. 8162–8171
work page 2021
-
[25]
Diffusion models in vision: A survey,
F.-A. Croitoru, V . Hondru, R. T. Ionescu, and M. Shah, “Diffusion models in vision: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 9, pp. 10 850–10 869, 2023
work page 2023
-
[26]
Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,
N. Ruiz, Y . Li, V . Jampani, Y . Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,” inCVPR, 2023, pp. 22 500–22 510
work page 2023
-
[27]
Hierarchical Text-Conditional Image Generation with CLIP Latents
A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,”arXiv preprint arXiv:2204.06125, vol. 1, no. 2, p. 3, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[28]
Audioldm 2: Learning holistic audio generation with self-supervised pretraining,
H. Liu, Y . Yuan, X. Liu, X. Mei, and et al, “Audioldm 2: Learning holistic audio generation with self-supervised pretraining,”IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 32, pp. 2871–2883, 2024
work page 2024
-
[29]
AudioLDM: Text-to-audio generation with latent diffusion models,
H. Liu, Z. Chen, Y . Yuan, and et al., “AudioLDM: Text-to-audio generation with latent diffusion models,”ICML, 2023
work page 2023
-
[30]
DiffWave: A Versatile Diffusion Model for Audio Synthesis
Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, “Dif- fwave: A versatile diffusion model for audio synthesis,”arXiv preprint arXiv:2009.09761, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[31]
Latent diffusion for language generation,
J. Lovelace, V . Kishore, C. Wan, E. Shekhtman, and K. Q. Weinberger, “Latent diffusion for language generation,”NeurIPS, vol. 36, 2024
work page 2024
-
[32]
Diffusionbert: Improving generative masked language models with diffusion models,
Z. He, T. Sun, K. Wang, X. Huang, and X. Qiu, “Diffusionbert: Improving generative masked language models with diffusion models,” arXiv preprint arXiv:2211.15029, 2022
-
[33]
Improving diffusion models for inverse problems using manifold constraints,
H. Chung, B. Sim, D. Ryu, and J. C. Ye, “Improving diffusion models for inverse problems using manifold constraints,”NeurIPS, vol. 35, pp. 25 683–25 696, 2022
work page 2022
-
[34]
Diffusion models beat gans on image synthesis,
P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,”NeurIPS, vol. 34, pp. 8780–8794, 2021
work page 2021
-
[35]
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image gen- eration and editing with text-guided diffusion models,”arXiv preprint arXiv:2112.10741, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[36]
High- resolution image synthesis with latent diffusion models,
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inCVPR, 2022, pp. 10 684–10 695
work page 2022
-
[37]
Blended diffusion for text- driven editing of natural images,
O. Avrahami, D. Lischinski, and O. Fried, “Blended diffusion for text- driven editing of natural images,” inCVPR, 2022, pp. 18 208–18 218
work page 2022
-
[38]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inICML. PMLR, 2021, pp. 8748–8763
work page 2021
-
[39]
Guided motion diffusion for controllable human motion synthesis,
K. Karunratanakul, K. Preechakul, S. Suwajanakorn, and S. Tang, “Guided motion diffusion for controllable human motion synthesis,” in ICCV, 2023, pp. 2151–2162
work page 2023
-
[40]
Listen, denoise, action! audio-driven motion synthesis with diffusion models,
S. Alexanderson, R. Nagy, J. Beskow, and G. E. Henter, “Listen, denoise, action! audio-driven motion synthesis with diffusion models,”ACM Transactions on Graphics (TOG), vol. 42, no. 4, pp. 1–20, 2023
work page 2023
-
[41]
Which tasks should be learned together in multi-task learning?
T. Standley, A. Zamir, D. Chen, L. Guibas, J. Malik, and S. Savarese, “Which tasks should be learned together in multi-task learning?” in ICML, 2020, pp. 9120–9132
work page 2020
-
[42]
End-to-end multi-task learning with attention,
S. Liu, E. Johns, and A. J. Davison, “End-to-end multi-task learning with attention,” inCVPR, 2019, pp. 1871–1880
work page 2019
-
[43]
Towards impartial multi-task learning
L. Liu, Y . Li, Z. Kuang, J. Xue, Y . Chen, W. Yang, Q. Liao, and W. Zhang, “Towards impartial multi-task learning.” iclr, 2021
work page 2021
-
[44]
Multi-Task Learning as a Bargaining Game,
A. Navon, A. Shamsian, I. Achituve, H. Maron, K. Kawaguchi, G. Chechik, and E. Fetaya, “Multi-Task Learning as a Bargaining Game,” inICML, 2022, pp. 16 428–16 446
work page 2022
-
[45]
J. Nash, “Two-Person Cooperative Games,”Econometrica, vol. 21, no. 1, pp. 128–140, 1953
work page 1953
-
[46]
Independent component alignment for multi-task learning,
D. Senushkin, N. Patakin, A. Kuznetsov, and A. Konushin, “Independent component alignment for multi-task learning,” inCVPR, 2023, pp. 20 083–20 093
work page 2023
-
[47]
Bayesian uncertainty for gradient aggregation in multi-task learning,
I. Achituve, I. Diamant, A. Netzer, G. Chechik, and E. Fetaya, “Bayesian uncertainty for gradient aggregation in multi-task learning,”arXiv preprint arXiv:2402.04005, 2024
-
[48]
A modulation module for multi-task learning with applications in image retrieval,
X. Zhao, H. Li, X. Shen, X. Liang, and Y . Wu, “A modulation module for multi-task learning with applications in image retrieval,” inECCV, September 2018
work page 2018
-
[49]
A. Wong, Y . Wu, S. Abbasi, S. Nair, Y . Chen, and M. J. Shafiee, “Fast graspnext: A fast self-attention neural network architecture for multi-task learning in computer vision tasks for robotic grasping on the edge,” in CVPR Workshops, June 2023, pp. 2293–2297
work page 2023
-
[50]
A multi-task learning framework for quality estimation,
S. Deoghare, P. Choudhary, D. Kanojia, T. Ranasinghe, P. Bhattacharyya, and C. Orasan, “A multi-task learning framework for quality estimation,” inFindings of the Association for Computational Linguistics: ACL 2023, 2023, pp. 9191–9205
work page 2023
-
[51]
S. Qian, C. Or ˘asan, D. Kanojia, and F. d. Carmo, “A multi-task learning framework for evaluating machine translation of emotion-loaded user- generated content,”arXiv preprint arXiv:2410.03277, 2024
-
[52]
Smpl: A skinned multi-person linear model,
M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, “Smpl: A skinned multi-person linear model,” inSeminal Graphics Papers: Pushing the Boundaries, Volume 2, 2023, pp. 851–866
work page 2023
-
[53]
M. K. Gourisaria, R. Agrawal, and et al, “Comparative analysis of audio classification with mfcc and stft features using machine learning techniques,”Discover Internet of Things, vol. 4, no. 1, p. 1, 2024
work page 2024
-
[54]
librosa: Audio and music signal analysis in python
B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg, and O. Nieto, “librosa: Audio and music signal analysis in python.” in SciPy, 2015, pp. 18–24
work page 2015
-
[55]
Learning to prompt for vision- language models,
K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision- language models,”IJCV, vol. 130, no. 9, pp. 2337–2348, 2022
work page 2022
-
[56]
Film: Visual reasoning with a general conditioning layer,
E. Perez, F. Strub, H. De Vries, and et al, “Film: Visual reasoning with a general conditioning layer,” inAAAI, vol. 32, no. 1, 2018
work page 2018
-
[57]
Dance revolution: Long-term dance generation with music via curriculum learning,
R. Huang, H. Hu, W. Wu, K. Sawada, M. Zhang, and D. Jiang, “Dance revolution: Long-term dance generation with music via curriculum learning,”arXiv preprint arXiv:2006.06119, 2020
-
[58]
Gans trained by a two time-scale update rule converge to a local nash equilibrium,
M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,”NeurIPS, vol. 30, 2017
work page 2017
-
[59]
Popdg: Popular 3d dance generation with popdanceset,
Z. Luo, M. Ren, X. Hu, Y . Huang, and L. Yao, “Popdg: Popular 3d dance generation with popdanceset,” inCVPR, 2024, pp. 26 984–26 993
work page 2024
-
[60]
J. Kim, H. Oh, S. Kim, H. Tong, and S. Lee, “A brand new dance partner: Music-conditioned pluralistic dancing controlled by multiple dance genres,” inCVPR, June 2022, pp. 3490–3500
work page 2022
-
[61]
Mu- sic2dance: Music-driven dance generation using wavenet,
W. Zhuang, C. Wang, S. Xia, J. Chai, and Y . Wang, “Mu- sic2dance: Music-driven dance generation using wavenet,”arXiv preprint arXiv:2002.03761, vol. 3, no. 4, p. 6, 2020
-
[62]
Efficient content-based retrieval of motion capture data,
M. M ¨uller, T. R¨oder, and M. Clausen, “Efficient content-based retrieval of motion capture data,” inACM SIGGRAPH, 2005, pp. 677–685. 11
work page 2005
-
[63]
Y . Wu, K. Chen, T. Zhang, Y . Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” inICASSP, 2023, pp. 1–5
work page 2023
-
[64]
wav2vec 2.0: A framework for self-supervised learning of speech representations,
A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” NeurIPS, vol. 33, pp. 12 449–12 460, 2020
work page 2020
-
[65]
Jukebox: A Generative Model for Music
P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford, and I. Sutskever, “Jukebox: A generative model for music,”arXiv preprint arXiv:2005.00341, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2005
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.