GCDance: Genre-Controlled Music-Driven 3D Full Body Dance Generation

Diptesh Kanojia; Shenbin Qian; Wenwu Wang; Xinran Liu; Xu Dong; Zhenhua Feng

arxiv: 2502.18309 · v3 · submitted 2025-02-25 · 💻 cs.GR · cs.CV· cs.SD· eess.AS

GCDance: Genre-Controlled Music-Driven 3D Full Body Dance Generation

Xinran Liu , Xu Dong , Shenbin Qian , Diptesh Kanojia , Wenwu Wang , Zhenhua Feng This is my paper

Pith reviewed 2026-05-23 02:38 UTC · model grok-4.3

classification 💻 cs.GR cs.CVcs.SDeess.AS

keywords 3D dance generationmusic-driven motiondiffusion modelsgenre controltext conditioningfull-body motion synthesismulti-task optimization

0 comments

The pith

A text-based control mechanism lets diffusion models generate 3D dances that match both music and a chosen genre.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to create three-dimensional full-body dance sequences that follow the rhythm of input music while also matching a specific genre described through text prompts. Earlier methods could often sync to beats but produced motions without reliable stylistic consistency across genres. The framework adds a control system that turns text descriptions into guiding signals for the generator, draws on features from a music foundation model for better alignment, and applies a multi-task training strategy to balance realism, accuracy, and style classification. Success here would mean users can direct dance output toward particular styles using either genre labels or free-form text without losing physical plausibility or timing.

Core claim

The authors introduce a diffusion-based framework for genre-specific 3D full-body dance generation conditioned on both music and descriptive text. A text-based control mechanism maps input prompts, whether explicit genre labels or free-form text, into genre-specific control signals. Features from a music foundation model support coherent alignment between conditions, and a multi-task optimization strategy balances physical realism, spatial accuracy, and text classification to improve overall sequence quality. Experiments on the FineDance and AIST++ datasets show the method outperforms existing approaches.

What carries the argument

The text-based control mechanism that converts input prompts into genre-specific control signals for the diffusion process.

If this is right

Dances show improved stylistic consistency with the input genre while remaining synchronized to music.
Both explicit genre labels and free-form descriptive text can guide generation.
The multi-task optimization maintains high physical realism and spatial accuracy alongside style control.
Results exceed prior state-of-the-art methods on the FineDance and AIST++ datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The text-conditioning approach could extend to other motion synthesis tasks that require style control, such as character animation in games.
Natural language input might lower barriers for non-experts creating custom dance sequences in virtual environments.
Further tests on music outside the training distribution would clarify how well the genre mapping generalizes.
Combining this control with real-time audio input could support interactive applications like live performance tools.

Load-bearing premise

The text prompts can be mapped to genre control signals that improve stylistic consistency without reducing motion quality or music synchronization.

What would settle it

Generate dances from music of one genre paired with a text prompt for a conflicting genre and check whether the output fails to reflect the prompted style while still matching the music beats.

Figures

Figures reproduced from arXiv: 2502.18309 by Diptesh Kanojia, Shenbin Qian, Wenwu Wang, Xinran Liu, Xu Dong, Zhenhua Feng.

**Figure 2.** Figure 2: An overview of GCDance. Left: the multimodal inputs and feature extraction. Middle: the training process at a given diffusion timestep [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: The control module of GCDance. Finally, GCDance takes as input the noise slice dT , the music condition CM, the text genre embedding CE, and the diffusion timestep t. These inputs are then fed into a Transformer-based denoising network. As illustrated in Figure 3, we employ two expert downsampling modules to separately model the distributions of body motion and hand motion inspired by [14]. This approach … view at source ↗

**Figure 5.** Figure 5: GCDance can generate joint-specific and temporally-specific dance segments. In the left example, the constrained body joints are shown in gray, while [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Same music, different popular dance. Boxed hand, leg, and full-body poses highlight the salient stylistic features that distinguish each genre. Miao Dai Classical Classical Music [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 8.** Figure 8: Visualization comparison of SOTAs methods. [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

read the original abstract

Music-driven dance generation is a challenging task as it requires strict adherence to genre-specific choreography while ensuring physically realistic and precisely synchronized dance sequences with the music's beats and rhythm. Although significant progress has been made in music-conditioned dance generation, most existing methods struggle to convey specific stylistic attributes in generated dance. To bridge this gap, we propose a diffusion-based framework for genre-specific 3D full-body dance generation, conditioned on both music and descriptive text. To effectively incorporate genre information, we develop a text-based control mechanism that maps input prompts, either explicit genre labels or free-form descriptive text, into genre-specific control signals, enabling precise and controllable text-guided generation of genre-consistent dance motions. Furthermore, to enhance the alignment between music and textual conditions, we leverage the features of a music foundation model, facilitating coherent and semantically aligned dance synthesis. Last, to balance the objectives of extracting text-genre information and maintaining high-quality generation results, we propose a novel multi-task optimization strategy. This effectively balances competing factors such as physical realism, spatial accuracy, and text classification, significantly improving the overall quality of the generated sequences. Extensive experimental results obtained on the FineDance and AIST++ datasets demonstrate the superiority of GCDance over the existing state-of-the-art approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GCDance adds text-to-genre control and music foundation model features to a diffusion dance generator with a multi-task loss, but the abstract states superiority on two datasets without any metrics or baseline details.

read the letter

The main takeaway is a diffusion model that takes music plus text prompts to produce genre-specific 3D full-body dance. It maps text (labels or free-form) to control signals, pulls features from a music foundation model for better rhythm and semantic alignment, and uses a multi-task loss to trade off realism, spatial accuracy, and genre classification. That specific combination of text pathway, foundation-model conditioning, and explicit loss balancing is not in the music-only baselines referenced in the abstract, so the assembly counts as new for this task. The framework is presented cleanly and the motivation for adding genre control is straightforward. The components build on standard diffusion and conditioning tricks without obvious internal contradictions. The soft spot is the evidence. The abstract claims better results than SOTA on FineDance and AIST++ yet gives no numbers, no baseline descriptions, no statistical tests, and no mention of how the multi-task weights were set or validated. Without those, the central performance claim cannot be checked. The loss weights are free parameters, which raises the usual question about whether they were tuned on the test sets. If the full paper contains proper tables, ablations, and failure analysis, the work becomes more useful; right now the empirical side is thin. This is aimed at researchers and engineers working on controllable motion synthesis for animation, VR, or entertainment tools. A reader already following diffusion-based dance or human motion papers could pick up the control mechanism and the loss-balancing idea. It deserves peer review because the approach is coherent and the gap it targets is real; referees can sort out whether the experiments actually support the claims once the numbers are on the table.

Referee Report

2 major / 1 minor

Summary. The paper proposes GCDance, a diffusion-based framework for genre-controlled 3D full-body dance generation conditioned on both music and text prompts (either genre labels or free-form descriptions). It introduces a text-based control mechanism to produce genre-specific signals, incorporates features from a music foundation model for better alignment, and uses a multi-task optimization strategy to balance physical realism, spatial accuracy, and text classification. Experiments on the FineDance and AIST++ datasets are stated to demonstrate superiority over existing state-of-the-art methods.

Significance. If the empirical claims hold with proper validation, the work would provide a practical advance in controllable dance synthesis by enabling text-guided genre consistency while preserving music synchronization and physical plausibility. The combination of diffusion models, music foundation features, and multi-task balancing addresses a recognized limitation in prior music-driven methods. Strengths include the coherent integration of text conditioning and the explicit multi-task loss formulation, though the absence of detailed metrics limits immediate assessment of impact.

major comments (2)

[Experiments] Experiments section: The central claim of superiority over SOTA on FineDance and AIST++ is asserted without any reported quantitative metrics (e.g., FID, beat alignment scores, genre classification accuracy), baseline descriptions, ablation studies, or statistical tests. This directly undermines verification of the text-based control mechanism's effectiveness and the multi-task strategy's benefits.
[§3.2] §3.2 (Text-based Control Mechanism): The mapping of input prompts to genre-specific control signals is presented as enabling precise text-guided generation, yet no analysis or ablation demonstrates that this improves stylistic consistency without degrading physical realism or music synchronization—the load-bearing assumption for the framework's novelty.

minor comments (1)

[Abstract] The abstract and introduction would benefit from explicit citation of the music foundation model used and a brief description of the multi-task loss weights.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to provide the requested experimental details and analyses.

read point-by-point responses

Referee: [Experiments] Experiments section: The central claim of superiority over SOTA on FineDance and AIST++ is asserted without any reported quantitative metrics (e.g., FID, beat alignment scores, genre classification accuracy), baseline descriptions, ablation studies, or statistical tests. This directly undermines verification of the text-based control mechanism's effectiveness and the multi-task strategy's benefits.

Authors: We acknowledge that the submitted manuscript does not report the quantitative metrics, baseline details, ablations, or statistical tests in the Experiments section. In the revision we will add FID, beat alignment, genre classification accuracy, full baseline descriptions, ablation studies on the text-based control and multi-task losses, and statistical significance tests to substantiate the superiority claims. revision: yes
Referee: [§3.2] §3.2 (Text-based Control Mechanism): The mapping of input prompts to genre-specific control signals is presented as enabling precise text-guided generation, yet no analysis or ablation demonstrates that this improves stylistic consistency without degrading physical realism or music synchronization—the load-bearing assumption for the framework's novelty.

Authors: We agree that an explicit ablation is required to isolate the contribution of the text-based control mechanism. The revised manuscript will include an ablation study comparing the full model against a variant without text conditioning, reporting metrics for stylistic consistency (genre accuracy), physical realism, and music synchronization to confirm that the mechanism improves genre adherence without harming the other objectives. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents a diffusion framework with text-based genre control, music foundation model features, and multi-task optimization. No equations, derivations, or predictions are shown that reduce to fitted inputs, self-definitions, or self-citation chains by construction. Claims of superiority rest on experimental results on external datasets (FineDance, AIST++), which are independent of any internal tautology. The text-conditioning mechanism is described as a novel architectural addition rather than a renaming or fit of prior outputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the untested premise that text prompts can be mapped to effective genre control signals and that the multi-task objective can be balanced without explicit specification of the weighting scheme or validation that the balance preserves physical realism.

free parameters (1)

multi-task loss weights
The novel multi-task optimization strategy requires balancing terms for realism, synchronization, and classification, yet no values or selection procedure are stated.

axioms (1)

domain assumption Features from a music foundation model provide semantically aligned information that improves dance-music synchronization when injected into the diffusion process.
Invoked when the authors state that leveraging these features facilitates coherent and semantically aligned dance synthesis.

pith-pipeline@v0.9.0 · 5781 in / 1342 out tokens · 34700 ms · 2026-05-23T02:38:53.056910+00:00 · methodology

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TeMuDance: Contrastive Alignment-Based Textual Control for Music-Driven Dance Generation
cs.CV 2026-04 unverdicted novelty 7.0

TeMuDance enables text-based semantic control over music-conditioned dance generation by using motion as a bridge to align existing unpaired datasets and training a lightweight text branch on a frozen diffusion backbo...
ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body
cs.CV 2025-12 unverdicted novelty 7.0

ViBES introduces a speech-language-behavior model using modality-specific transformer experts that jointly generates dialogue and 3D body actions, showing gains over separate co-speech and text-to-motion baselines on ...
PianoFlow: Music-Aware Streaming Piano Motion Generation with Bimanual Coordination
cs.CV 2026-04 unverdicted novelty 6.0

PianoFlow generates coordinated bimanual piano motions from audio via MIDI-distilled flow-matching, asymmetric role-gated interaction, and autoregressive streaming continuation, outperforming priors with 9x faster inference.
Listen to Rhythm, Choose Movements: Autoregressive Multimodal Dance Generation via Diffusion and Mamba with Decoupled Dance Dataset
cs.GR 2026-01 unverdicted novelty 6.0

LRCM is a new multimodal diffusion model with audio and text Conformers plus Motion Temporal Mamba for generating long, coherent dance sequences from rhythm and descriptions using a decoupled dataset.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · cited by 4 Pith papers · 4 internal anchors

[1]

Is dance a language? movement, meaning and commu- nication,

H. Bannerman, “Is dance a language? movement, meaning and commu- nication,”Dance Research, vol. 32, no. 1, pp. 65–80, 2014

work page 2014
[2]

Ai choreographer: Music conditioned 3d dance generation with aist++,

R. Li, S. Yang, D. A. Ross, and A. Kanazawa, “Ai choreographer: Music conditioned 3d dance generation with aist++,” inICCV, 2021, pp. 13 401–13 412

work page 2021
[3]

A brand new dance partner: Music-conditioned pluralistic dancing controlled by multiple dance genres,

J. Kim, H. Oh, S. Kim, H. Tong, and S. Lee, “A brand new dance partner: Music-conditioned pluralistic dancing controlled by multiple dance genres,” inCVPR, 2022, pp. 3490–3500

work page 2022
[4]

Bailando: 3d dance generation by actor-critic gpt with choreographic memory,

L. Siyao, W. Yu, T. Gu, C. Lin, Q. Wang, C. Qian, C. C. Loy, and Z. Liu, “Bailando: 3d dance generation by actor-critic gpt with choreographic memory,” inCVPR, 2022, pp. 11 050–11 059

work page 2022
[5]

Tm2d: Bimodality driven 3d dance generation via music-text integration,

K. Gong, D. Lian, H. Chang, C. Guo, Z. Jiang, X. Zuo, M. B. Mi, and X. Wang, “Tm2d: Bimodality driven 3d dance generation via music-text integration,” inICCV, 2023, pp. 9942–9952

work page 2023
[6]

Human motion diffusion model,

G. Tevet, S. Raab, B. Gordon, Y . Shafir, D. Cohen-Or, and A. H. Bermano, “Human motion diffusion model,” 2022

work page 2022
[7]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” NeurIPS, vol. 33, pp. 6840–6851, 2020

work page 2020
[8]

Edge: Editable dance generation from music,

J. Tseng, R. Castellon, and K. Liu, “Edge: Editable dance generation from music,” inCVPR, 2023, pp. 448–458

work page 2023
[9]

Diff- dance: Cascaded human motion diffusion model for dance generation,

Q. Qi, L. Zhuo, A. Zhang, Y . Liao, F. Fang, S. Liu, and S. Yan, “Diff- dance: Cascaded human motion diffusion model for dance generation,” inACM MM, 2023, pp. 1374–1382

work page 2023
[10]

DGFM: Full Body Dance Generation Driven by Music Foundation Models,

X. Liu, Z. Feng, D. Kanojia, and W. Wang, “DGFM: Full Body Dance Generation Driven by Music Foundation Models,” inNeurIPS 2024 Workshop on AI-Driven Speech, Music, and Sound Generation. 10

work page 2024
[11]

Lodge: A coarse to fine diffusion network for long dance generation guided by the characteristic dance primitives,

R. Li, Y . Zhang, Y . Zhang, H. Zhang, J. Guo, and et al, “Lodge: A coarse to fine diffusion network for long dance generation guided by the characteristic dance primitives,” inCVPR, 2024, pp. 1524–1534

work page 2024
[12]

Longdancediff: Long-term dance generation with conditional diffusion model,

S. Yang, Z. Yang, and Z. Wang, “Longdancediff: Long-term dance generation with conditional diffusion model,”arXiv preprint arXiv:2308.11945, 2023

work page arXiv 2023
[13]

Wav2clip: Learning robust audio representations from clip,

H.-H. Wu, P. Seetharaman, K. Kumar, and J. P. Bello, “Wav2clip: Learning robust audio representations from clip,” inICASSP, 2022, pp. 4563–4567

work page 2022
[14]

Finedance: A fine-grained choreography dataset for 3d full body dance generation,

R. Li, J. Zhao, Y . Zhang, M. Su, Z. Ren, H. Zhang, Y . Tang, and X. Li, “Finedance: A fine-grained choreography dataset for 3d full body dance generation,” inICCV, 2023, pp. 10 234–10 243

work page 2023
[15]

An audio-driven dancing avatar,

F. Ofli, Y . Demir, Y . Yemez, E. Erzin, A. M. Tekalp, K. Balcı,˙I. Kızo˘glu, L. Akarun, C. Canton-Ferrer, J. Tilmanneet al., “An audio-driven dancing avatar,”Journal on Multimodal User Interfaces, vol. 2, pp. 93– 103, 2008

work page 2008
[16]

Music content driven automated choreog- raphy with beat-wise motion connectivity constraints,

S. Fukayama and M. Goto, “Music content driven automated choreog- raphy with beat-wise motion connectivity constraints,”Proceedings of SMC, pp. 177–183, 2015

work page 2015
[17]

A deep learning framework for character motion synthesis and editing,

D. Holden, J. Saito, and T. Komura, “A deep learning framework for character motion synthesis and editing,”ACM Transactions on Graphics (TOG), vol. 35, no. 4, pp. 1–11, 2016

work page 2016
[18]

Action- agnostic human pose forecasting,

H. Chiu, E. Adeli, B. Wang, D.-A. Huang, and J. C. Niebles, “Action- agnostic human pose forecasting,” inWACV, 2019, pp. 1423–1432

work page 2019
[19]

Bio-lstm: A biome- chanically inspired recurrent neural network for 3-d pedestrian pose and gait prediction,

X. Du, R. Vasudevan, and M. Johnson-Roberson, “Bio-lstm: A biome- chanically inspired recurrent neural network for 3-d pedestrian pose and gait prediction,”IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 1501–1508, 2019

work page 2019
[20]

A bi-directional attention guided cross-modal network for music based dance generation,

D. Fan, L. Wan, W. Xu, and S. Wang, “A bi-directional attention guided cross-modal network for music based dance generation,”Computers and Electrical Engineering, vol. 103, p. 108310, 2022

work page 2022
[21]

Genre-conditioned long-term 3d dance generation driven by music,

Y . Huang, J. Zhang, S. Liu, Q. Bao, D. Zeng, Z. Chen, and W. Liu, “Genre-conditioned long-term 3d dance generation driven by music,” in ICASSP, 2022, pp. 4858–4862

work page 2022
[22]

Danceformer: Music con- ditioned 3d dance generation with parametric motion transformer,

B. Li, Y . Zhao, S. Zhelun, and L. Sheng, “Danceformer: Music con- ditioned 3d dance generation with parametric motion transformer,” in AAAI, vol. 36, no. 2, 2022, pp. 1272–1279

work page 2022
[23]

Mu- sic2dance: Dancenet for music-driven dance generation,

W. Zhuang, C. Wang, J. Chai, Y . Wang, M. Shao, and S. Xia, “Mu- sic2dance: Dancenet for music-driven dance generation,”ACM Trans- actions on Multimedia Computing, Communications, and Applications (TOMM), vol. 18, no. 2, pp. 1–21, 2022

work page 2022
[24]

Improved denoising diffusion proba- bilistic models,

A. Q. Nichol and P. Dhariwal, “Improved denoising diffusion proba- bilistic models,” inICML. PMLR, 2021, pp. 8162–8171

work page 2021
[25]

Diffusion models in vision: A survey,

F.-A. Croitoru, V . Hondru, R. T. Ionescu, and M. Shah, “Diffusion models in vision: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 9, pp. 10 850–10 869, 2023

work page 2023
[26]

Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,

N. Ruiz, Y . Li, V . Jampani, Y . Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,” inCVPR, 2023, pp. 22 500–22 510

work page 2023
[27]

Hierarchical Text-Conditional Image Generation with CLIP Latents

A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,”arXiv preprint arXiv:2204.06125, vol. 1, no. 2, p. 3, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[28]

Audioldm 2: Learning holistic audio generation with self-supervised pretraining,

H. Liu, Y . Yuan, X. Liu, X. Mei, and et al, “Audioldm 2: Learning holistic audio generation with self-supervised pretraining,”IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 32, pp. 2871–2883, 2024

work page 2024
[29]

AudioLDM: Text-to-audio generation with latent diffusion models,

H. Liu, Z. Chen, Y . Yuan, and et al., “AudioLDM: Text-to-audio generation with latent diffusion models,”ICML, 2023

work page 2023
[30]

DiffWave: A Versatile Diffusion Model for Audio Synthesis

Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, “Dif- fwave: A versatile diffusion model for audio synthesis,”arXiv preprint arXiv:2009.09761, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[31]

Latent diffusion for language generation,

J. Lovelace, V . Kishore, C. Wan, E. Shekhtman, and K. Q. Weinberger, “Latent diffusion for language generation,”NeurIPS, vol. 36, 2024

work page 2024
[32]

Diffusionbert: Improving generative masked language models with diffusion models,

Z. He, T. Sun, K. Wang, X. Huang, and X. Qiu, “Diffusionbert: Improving generative masked language models with diffusion models,” arXiv preprint arXiv:2211.15029, 2022

work page arXiv 2022
[33]

Improving diffusion models for inverse problems using manifold constraints,

H. Chung, B. Sim, D. Ryu, and J. C. Ye, “Improving diffusion models for inverse problems using manifold constraints,”NeurIPS, vol. 35, pp. 25 683–25 696, 2022

work page 2022
[34]

Diffusion models beat gans on image synthesis,

P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,”NeurIPS, vol. 34, pp. 8780–8794, 2021

work page 2021
[35]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image gen- eration and editing with text-guided diffusion models,”arXiv preprint arXiv:2112.10741, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[36]

High- resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inCVPR, 2022, pp. 10 684–10 695

work page 2022
[37]

Blended diffusion for text- driven editing of natural images,

O. Avrahami, D. Lischinski, and O. Fried, “Blended diffusion for text- driven editing of natural images,” inCVPR, 2022, pp. 18 208–18 218

work page 2022
[38]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inICML. PMLR, 2021, pp. 8748–8763

work page 2021
[39]

Guided motion diffusion for controllable human motion synthesis,

K. Karunratanakul, K. Preechakul, S. Suwajanakorn, and S. Tang, “Guided motion diffusion for controllable human motion synthesis,” in ICCV, 2023, pp. 2151–2162

work page 2023
[40]

Listen, denoise, action! audio-driven motion synthesis with diffusion models,

S. Alexanderson, R. Nagy, J. Beskow, and G. E. Henter, “Listen, denoise, action! audio-driven motion synthesis with diffusion models,”ACM Transactions on Graphics (TOG), vol. 42, no. 4, pp. 1–20, 2023

work page 2023
[41]

Which tasks should be learned together in multi-task learning?

T. Standley, A. Zamir, D. Chen, L. Guibas, J. Malik, and S. Savarese, “Which tasks should be learned together in multi-task learning?” in ICML, 2020, pp. 9120–9132

work page 2020
[42]

End-to-end multi-task learning with attention,

S. Liu, E. Johns, and A. J. Davison, “End-to-end multi-task learning with attention,” inCVPR, 2019, pp. 1871–1880

work page 2019
[43]

Towards impartial multi-task learning

L. Liu, Y . Li, Z. Kuang, J. Xue, Y . Chen, W. Yang, Q. Liao, and W. Zhang, “Towards impartial multi-task learning.” iclr, 2021

work page 2021
[44]

Multi-Task Learning as a Bargaining Game,

A. Navon, A. Shamsian, I. Achituve, H. Maron, K. Kawaguchi, G. Chechik, and E. Fetaya, “Multi-Task Learning as a Bargaining Game,” inICML, 2022, pp. 16 428–16 446

work page 2022
[45]

Two-Person Cooperative Games,

J. Nash, “Two-Person Cooperative Games,”Econometrica, vol. 21, no. 1, pp. 128–140, 1953

work page 1953
[46]

Independent component alignment for multi-task learning,

D. Senushkin, N. Patakin, A. Kuznetsov, and A. Konushin, “Independent component alignment for multi-task learning,” inCVPR, 2023, pp. 20 083–20 093

work page 2023
[47]

Bayesian uncertainty for gradient aggregation in multi-task learning,

I. Achituve, I. Diamant, A. Netzer, G. Chechik, and E. Fetaya, “Bayesian uncertainty for gradient aggregation in multi-task learning,”arXiv preprint arXiv:2402.04005, 2024

work page arXiv 2024
[48]

A modulation module for multi-task learning with applications in image retrieval,

X. Zhao, H. Li, X. Shen, X. Liang, and Y . Wu, “A modulation module for multi-task learning with applications in image retrieval,” inECCV, September 2018

work page 2018
[49]

Fast graspnext: A fast self-attention neural network architecture for multi-task learning in computer vision tasks for robotic grasping on the edge,

A. Wong, Y . Wu, S. Abbasi, S. Nair, Y . Chen, and M. J. Shafiee, “Fast graspnext: A fast self-attention neural network architecture for multi-task learning in computer vision tasks for robotic grasping on the edge,” in CVPR Workshops, June 2023, pp. 2293–2297

work page 2023
[50]

A multi-task learning framework for quality estimation,

S. Deoghare, P. Choudhary, D. Kanojia, T. Ranasinghe, P. Bhattacharyya, and C. Orasan, “A multi-task learning framework for quality estimation,” inFindings of the Association for Computational Linguistics: ACL 2023, 2023, pp. 9191–9205

work page 2023
[51]

A multi-task learning framework for evaluating machine translation of emotion-loaded user- generated content,

S. Qian, C. Or ˘asan, D. Kanojia, and F. d. Carmo, “A multi-task learning framework for evaluating machine translation of emotion-loaded user- generated content,”arXiv preprint arXiv:2410.03277, 2024

work page arXiv 2024
[52]

Smpl: A skinned multi-person linear model,

M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, “Smpl: A skinned multi-person linear model,” inSeminal Graphics Papers: Pushing the Boundaries, Volume 2, 2023, pp. 851–866

work page 2023
[53]

Comparative analysis of audio classification with mfcc and stft features using machine learning techniques,

M. K. Gourisaria, R. Agrawal, and et al, “Comparative analysis of audio classification with mfcc and stft features using machine learning techniques,”Discover Internet of Things, vol. 4, no. 1, p. 1, 2024

work page 2024
[54]

librosa: Audio and music signal analysis in python

B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg, and O. Nieto, “librosa: Audio and music signal analysis in python.” in SciPy, 2015, pp. 18–24

work page 2015
[55]

Learning to prompt for vision- language models,

K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision- language models,”IJCV, vol. 130, no. 9, pp. 2337–2348, 2022

work page 2022
[56]

Film: Visual reasoning with a general conditioning layer,

E. Perez, F. Strub, H. De Vries, and et al, “Film: Visual reasoning with a general conditioning layer,” inAAAI, vol. 32, no. 1, 2018

work page 2018
[57]

Dance revolution: Long-term dance generation with music via curriculum learning,

R. Huang, H. Hu, W. Wu, K. Sawada, M. Zhang, and D. Jiang, “Dance revolution: Long-term dance generation with music via curriculum learning,”arXiv preprint arXiv:2006.06119, 2020

work page arXiv 2006
[58]

Gans trained by a two time-scale update rule converge to a local nash equilibrium,

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,”NeurIPS, vol. 30, 2017

work page 2017
[59]

Popdg: Popular 3d dance generation with popdanceset,

Z. Luo, M. Ren, X. Hu, Y . Huang, and L. Yao, “Popdg: Popular 3d dance generation with popdanceset,” inCVPR, 2024, pp. 26 984–26 993

work page 2024
[60]

A brand new dance partner: Music-conditioned pluralistic dancing controlled by multiple dance genres,

J. Kim, H. Oh, S. Kim, H. Tong, and S. Lee, “A brand new dance partner: Music-conditioned pluralistic dancing controlled by multiple dance genres,” inCVPR, June 2022, pp. 3490–3500

work page 2022
[61]

Mu- sic2dance: Music-driven dance generation using wavenet,

W. Zhuang, C. Wang, S. Xia, J. Chai, and Y . Wang, “Mu- sic2dance: Music-driven dance generation using wavenet,”arXiv preprint arXiv:2002.03761, vol. 3, no. 4, p. 6, 2020

work page arXiv 2002
[62]

Efficient content-based retrieval of motion capture data,

M. M ¨uller, T. R¨oder, and M. Clausen, “Efficient content-based retrieval of motion capture data,” inACM SIGGRAPH, 2005, pp. 677–685. 11

work page 2005
[63]

Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,

Y . Wu, K. Chen, T. Zhang, Y . Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” inICASSP, 2023, pp. 1–5

work page 2023
[64]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” NeurIPS, vol. 33, pp. 12 449–12 460, 2020

work page 2020
[65]

Jukebox: A Generative Model for Music

P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford, and I. Sutskever, “Jukebox: A generative model for music,”arXiv preprint arXiv:2005.00341, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005

[1] [1]

Is dance a language? movement, meaning and commu- nication,

H. Bannerman, “Is dance a language? movement, meaning and commu- nication,”Dance Research, vol. 32, no. 1, pp. 65–80, 2014

work page 2014

[2] [2]

Ai choreographer: Music conditioned 3d dance generation with aist++,

R. Li, S. Yang, D. A. Ross, and A. Kanazawa, “Ai choreographer: Music conditioned 3d dance generation with aist++,” inICCV, 2021, pp. 13 401–13 412

work page 2021

[3] [3]

A brand new dance partner: Music-conditioned pluralistic dancing controlled by multiple dance genres,

J. Kim, H. Oh, S. Kim, H. Tong, and S. Lee, “A brand new dance partner: Music-conditioned pluralistic dancing controlled by multiple dance genres,” inCVPR, 2022, pp. 3490–3500

work page 2022

[4] [4]

Bailando: 3d dance generation by actor-critic gpt with choreographic memory,

L. Siyao, W. Yu, T. Gu, C. Lin, Q. Wang, C. Qian, C. C. Loy, and Z. Liu, “Bailando: 3d dance generation by actor-critic gpt with choreographic memory,” inCVPR, 2022, pp. 11 050–11 059

work page 2022

[5] [5]

Tm2d: Bimodality driven 3d dance generation via music-text integration,

K. Gong, D. Lian, H. Chang, C. Guo, Z. Jiang, X. Zuo, M. B. Mi, and X. Wang, “Tm2d: Bimodality driven 3d dance generation via music-text integration,” inICCV, 2023, pp. 9942–9952

work page 2023

[6] [6]

Human motion diffusion model,

G. Tevet, S. Raab, B. Gordon, Y . Shafir, D. Cohen-Or, and A. H. Bermano, “Human motion diffusion model,” 2022

work page 2022

[7] [7]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” NeurIPS, vol. 33, pp. 6840–6851, 2020

work page 2020

[8] [8]

Edge: Editable dance generation from music,

J. Tseng, R. Castellon, and K. Liu, “Edge: Editable dance generation from music,” inCVPR, 2023, pp. 448–458

work page 2023

[9] [9]

Diff- dance: Cascaded human motion diffusion model for dance generation,

Q. Qi, L. Zhuo, A. Zhang, Y . Liao, F. Fang, S. Liu, and S. Yan, “Diff- dance: Cascaded human motion diffusion model for dance generation,” inACM MM, 2023, pp. 1374–1382

work page 2023

[10] [10]

DGFM: Full Body Dance Generation Driven by Music Foundation Models,

X. Liu, Z. Feng, D. Kanojia, and W. Wang, “DGFM: Full Body Dance Generation Driven by Music Foundation Models,” inNeurIPS 2024 Workshop on AI-Driven Speech, Music, and Sound Generation. 10

work page 2024

[11] [11]

Lodge: A coarse to fine diffusion network for long dance generation guided by the characteristic dance primitives,

R. Li, Y . Zhang, Y . Zhang, H. Zhang, J. Guo, and et al, “Lodge: A coarse to fine diffusion network for long dance generation guided by the characteristic dance primitives,” inCVPR, 2024, pp. 1524–1534

work page 2024

[12] [12]

Longdancediff: Long-term dance generation with conditional diffusion model,

S. Yang, Z. Yang, and Z. Wang, “Longdancediff: Long-term dance generation with conditional diffusion model,”arXiv preprint arXiv:2308.11945, 2023

work page arXiv 2023

[13] [13]

Wav2clip: Learning robust audio representations from clip,

H.-H. Wu, P. Seetharaman, K. Kumar, and J. P. Bello, “Wav2clip: Learning robust audio representations from clip,” inICASSP, 2022, pp. 4563–4567

work page 2022

[14] [14]

Finedance: A fine-grained choreography dataset for 3d full body dance generation,

R. Li, J. Zhao, Y . Zhang, M. Su, Z. Ren, H. Zhang, Y . Tang, and X. Li, “Finedance: A fine-grained choreography dataset for 3d full body dance generation,” inICCV, 2023, pp. 10 234–10 243

work page 2023

[15] [15]

An audio-driven dancing avatar,

F. Ofli, Y . Demir, Y . Yemez, E. Erzin, A. M. Tekalp, K. Balcı,˙I. Kızo˘glu, L. Akarun, C. Canton-Ferrer, J. Tilmanneet al., “An audio-driven dancing avatar,”Journal on Multimodal User Interfaces, vol. 2, pp. 93– 103, 2008

work page 2008

[16] [16]

Music content driven automated choreog- raphy with beat-wise motion connectivity constraints,

S. Fukayama and M. Goto, “Music content driven automated choreog- raphy with beat-wise motion connectivity constraints,”Proceedings of SMC, pp. 177–183, 2015

work page 2015

[17] [17]

A deep learning framework for character motion synthesis and editing,

D. Holden, J. Saito, and T. Komura, “A deep learning framework for character motion synthesis and editing,”ACM Transactions on Graphics (TOG), vol. 35, no. 4, pp. 1–11, 2016

work page 2016

[18] [18]

Action- agnostic human pose forecasting,

H. Chiu, E. Adeli, B. Wang, D.-A. Huang, and J. C. Niebles, “Action- agnostic human pose forecasting,” inWACV, 2019, pp. 1423–1432

work page 2019

[19] [19]

Bio-lstm: A biome- chanically inspired recurrent neural network for 3-d pedestrian pose and gait prediction,

X. Du, R. Vasudevan, and M. Johnson-Roberson, “Bio-lstm: A biome- chanically inspired recurrent neural network for 3-d pedestrian pose and gait prediction,”IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 1501–1508, 2019

work page 2019

[20] [20]

A bi-directional attention guided cross-modal network for music based dance generation,

D. Fan, L. Wan, W. Xu, and S. Wang, “A bi-directional attention guided cross-modal network for music based dance generation,”Computers and Electrical Engineering, vol. 103, p. 108310, 2022

work page 2022

[21] [21]

Genre-conditioned long-term 3d dance generation driven by music,

Y . Huang, J. Zhang, S. Liu, Q. Bao, D. Zeng, Z. Chen, and W. Liu, “Genre-conditioned long-term 3d dance generation driven by music,” in ICASSP, 2022, pp. 4858–4862

work page 2022

[22] [22]

Danceformer: Music con- ditioned 3d dance generation with parametric motion transformer,

B. Li, Y . Zhao, S. Zhelun, and L. Sheng, “Danceformer: Music con- ditioned 3d dance generation with parametric motion transformer,” in AAAI, vol. 36, no. 2, 2022, pp. 1272–1279

work page 2022

[23] [23]

Mu- sic2dance: Dancenet for music-driven dance generation,

W. Zhuang, C. Wang, J. Chai, Y . Wang, M. Shao, and S. Xia, “Mu- sic2dance: Dancenet for music-driven dance generation,”ACM Trans- actions on Multimedia Computing, Communications, and Applications (TOMM), vol. 18, no. 2, pp. 1–21, 2022

work page 2022

[24] [24]

Improved denoising diffusion proba- bilistic models,

A. Q. Nichol and P. Dhariwal, “Improved denoising diffusion proba- bilistic models,” inICML. PMLR, 2021, pp. 8162–8171

work page 2021

[25] [25]

Diffusion models in vision: A survey,

F.-A. Croitoru, V . Hondru, R. T. Ionescu, and M. Shah, “Diffusion models in vision: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 9, pp. 10 850–10 869, 2023

work page 2023

[26] [26]

Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,

N. Ruiz, Y . Li, V . Jampani, Y . Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,” inCVPR, 2023, pp. 22 500–22 510

work page 2023

[27] [27]

Hierarchical Text-Conditional Image Generation with CLIP Latents

A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,”arXiv preprint arXiv:2204.06125, vol. 1, no. 2, p. 3, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[28] [28]

Audioldm 2: Learning holistic audio generation with self-supervised pretraining,

H. Liu, Y . Yuan, X. Liu, X. Mei, and et al, “Audioldm 2: Learning holistic audio generation with self-supervised pretraining,”IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 32, pp. 2871–2883, 2024

work page 2024

[29] [29]

AudioLDM: Text-to-audio generation with latent diffusion models,

H. Liu, Z. Chen, Y . Yuan, and et al., “AudioLDM: Text-to-audio generation with latent diffusion models,”ICML, 2023

work page 2023

[30] [30]

DiffWave: A Versatile Diffusion Model for Audio Synthesis

Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, “Dif- fwave: A versatile diffusion model for audio synthesis,”arXiv preprint arXiv:2009.09761, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009

[31] [31]

Latent diffusion for language generation,

J. Lovelace, V . Kishore, C. Wan, E. Shekhtman, and K. Q. Weinberger, “Latent diffusion for language generation,”NeurIPS, vol. 36, 2024

work page 2024

[32] [32]

Diffusionbert: Improving generative masked language models with diffusion models,

Z. He, T. Sun, K. Wang, X. Huang, and X. Qiu, “Diffusionbert: Improving generative masked language models with diffusion models,” arXiv preprint arXiv:2211.15029, 2022

work page arXiv 2022

[33] [33]

Improving diffusion models for inverse problems using manifold constraints,

H. Chung, B. Sim, D. Ryu, and J. C. Ye, “Improving diffusion models for inverse problems using manifold constraints,”NeurIPS, vol. 35, pp. 25 683–25 696, 2022

work page 2022

[34] [34]

Diffusion models beat gans on image synthesis,

P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,”NeurIPS, vol. 34, pp. 8780–8794, 2021

work page 2021

[35] [35]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image gen- eration and editing with text-guided diffusion models,”arXiv preprint arXiv:2112.10741, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[36] [36]

High- resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inCVPR, 2022, pp. 10 684–10 695

work page 2022

[37] [37]

Blended diffusion for text- driven editing of natural images,

O. Avrahami, D. Lischinski, and O. Fried, “Blended diffusion for text- driven editing of natural images,” inCVPR, 2022, pp. 18 208–18 218

work page 2022

[38] [38]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inICML. PMLR, 2021, pp. 8748–8763

work page 2021

[39] [39]

Guided motion diffusion for controllable human motion synthesis,

K. Karunratanakul, K. Preechakul, S. Suwajanakorn, and S. Tang, “Guided motion diffusion for controllable human motion synthesis,” in ICCV, 2023, pp. 2151–2162

work page 2023

[40] [40]

Listen, denoise, action! audio-driven motion synthesis with diffusion models,

S. Alexanderson, R. Nagy, J. Beskow, and G. E. Henter, “Listen, denoise, action! audio-driven motion synthesis with diffusion models,”ACM Transactions on Graphics (TOG), vol. 42, no. 4, pp. 1–20, 2023

work page 2023

[41] [41]

Which tasks should be learned together in multi-task learning?

T. Standley, A. Zamir, D. Chen, L. Guibas, J. Malik, and S. Savarese, “Which tasks should be learned together in multi-task learning?” in ICML, 2020, pp. 9120–9132

work page 2020

[42] [42]

End-to-end multi-task learning with attention,

S. Liu, E. Johns, and A. J. Davison, “End-to-end multi-task learning with attention,” inCVPR, 2019, pp. 1871–1880

work page 2019

[43] [43]

Towards impartial multi-task learning

L. Liu, Y . Li, Z. Kuang, J. Xue, Y . Chen, W. Yang, Q. Liao, and W. Zhang, “Towards impartial multi-task learning.” iclr, 2021

work page 2021

[44] [44]

Multi-Task Learning as a Bargaining Game,

A. Navon, A. Shamsian, I. Achituve, H. Maron, K. Kawaguchi, G. Chechik, and E. Fetaya, “Multi-Task Learning as a Bargaining Game,” inICML, 2022, pp. 16 428–16 446

work page 2022

[45] [45]

Two-Person Cooperative Games,

J. Nash, “Two-Person Cooperative Games,”Econometrica, vol. 21, no. 1, pp. 128–140, 1953

work page 1953

[46] [46]

Independent component alignment for multi-task learning,

D. Senushkin, N. Patakin, A. Kuznetsov, and A. Konushin, “Independent component alignment for multi-task learning,” inCVPR, 2023, pp. 20 083–20 093

work page 2023

[47] [47]

Bayesian uncertainty for gradient aggregation in multi-task learning,

I. Achituve, I. Diamant, A. Netzer, G. Chechik, and E. Fetaya, “Bayesian uncertainty for gradient aggregation in multi-task learning,”arXiv preprint arXiv:2402.04005, 2024

work page arXiv 2024

[48] [48]

A modulation module for multi-task learning with applications in image retrieval,

X. Zhao, H. Li, X. Shen, X. Liang, and Y . Wu, “A modulation module for multi-task learning with applications in image retrieval,” inECCV, September 2018

work page 2018

[49] [49]

Fast graspnext: A fast self-attention neural network architecture for multi-task learning in computer vision tasks for robotic grasping on the edge,

A. Wong, Y . Wu, S. Abbasi, S. Nair, Y . Chen, and M. J. Shafiee, “Fast graspnext: A fast self-attention neural network architecture for multi-task learning in computer vision tasks for robotic grasping on the edge,” in CVPR Workshops, June 2023, pp. 2293–2297

work page 2023

[50] [50]

A multi-task learning framework for quality estimation,

S. Deoghare, P. Choudhary, D. Kanojia, T. Ranasinghe, P. Bhattacharyya, and C. Orasan, “A multi-task learning framework for quality estimation,” inFindings of the Association for Computational Linguistics: ACL 2023, 2023, pp. 9191–9205

work page 2023

[51] [51]

A multi-task learning framework for evaluating machine translation of emotion-loaded user- generated content,

S. Qian, C. Or ˘asan, D. Kanojia, and F. d. Carmo, “A multi-task learning framework for evaluating machine translation of emotion-loaded user- generated content,”arXiv preprint arXiv:2410.03277, 2024

work page arXiv 2024

[52] [52]

Smpl: A skinned multi-person linear model,

M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, “Smpl: A skinned multi-person linear model,” inSeminal Graphics Papers: Pushing the Boundaries, Volume 2, 2023, pp. 851–866

work page 2023

[53] [53]

Comparative analysis of audio classification with mfcc and stft features using machine learning techniques,

M. K. Gourisaria, R. Agrawal, and et al, “Comparative analysis of audio classification with mfcc and stft features using machine learning techniques,”Discover Internet of Things, vol. 4, no. 1, p. 1, 2024

work page 2024

[54] [54]

librosa: Audio and music signal analysis in python

B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg, and O. Nieto, “librosa: Audio and music signal analysis in python.” in SciPy, 2015, pp. 18–24

work page 2015

[55] [55]

Learning to prompt for vision- language models,

K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision- language models,”IJCV, vol. 130, no. 9, pp. 2337–2348, 2022

work page 2022

[56] [56]

Film: Visual reasoning with a general conditioning layer,

E. Perez, F. Strub, H. De Vries, and et al, “Film: Visual reasoning with a general conditioning layer,” inAAAI, vol. 32, no. 1, 2018

work page 2018

[57] [57]

Dance revolution: Long-term dance generation with music via curriculum learning,

R. Huang, H. Hu, W. Wu, K. Sawada, M. Zhang, and D. Jiang, “Dance revolution: Long-term dance generation with music via curriculum learning,”arXiv preprint arXiv:2006.06119, 2020

work page arXiv 2006

[58] [58]

Gans trained by a two time-scale update rule converge to a local nash equilibrium,

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,”NeurIPS, vol. 30, 2017

work page 2017

[59] [59]

Popdg: Popular 3d dance generation with popdanceset,

Z. Luo, M. Ren, X. Hu, Y . Huang, and L. Yao, “Popdg: Popular 3d dance generation with popdanceset,” inCVPR, 2024, pp. 26 984–26 993

work page 2024

[60] [60]

A brand new dance partner: Music-conditioned pluralistic dancing controlled by multiple dance genres,

J. Kim, H. Oh, S. Kim, H. Tong, and S. Lee, “A brand new dance partner: Music-conditioned pluralistic dancing controlled by multiple dance genres,” inCVPR, June 2022, pp. 3490–3500

work page 2022

[61] [61]

Mu- sic2dance: Music-driven dance generation using wavenet,

W. Zhuang, C. Wang, S. Xia, J. Chai, and Y . Wang, “Mu- sic2dance: Music-driven dance generation using wavenet,”arXiv preprint arXiv:2002.03761, vol. 3, no. 4, p. 6, 2020

work page arXiv 2002

[62] [62]

Efficient content-based retrieval of motion capture data,

M. M ¨uller, T. R¨oder, and M. Clausen, “Efficient content-based retrieval of motion capture data,” inACM SIGGRAPH, 2005, pp. 677–685. 11

work page 2005

[63] [63]

Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,

Y . Wu, K. Chen, T. Zhang, Y . Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” inICASSP, 2023, pp. 1–5

work page 2023

[64] [64]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” NeurIPS, vol. 33, pp. 12 449–12 460, 2020

work page 2020

[65] [65]

Jukebox: A Generative Model for Music

P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford, and I. Sutskever, “Jukebox: A generative model for music,”arXiv preprint arXiv:2005.00341, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005