OmniDance: Multimodal Driven Dance Video Generation with Large-scale Internet Data

Chubin Chen; Hongyan Liu; Jiahong Wu; Jiashu Zhu; Jun He; Kaixing Yang; Puwei Wang; Xiangxiang Chu; Xiangyue Zhang; Xulong Tang

arxiv: 2606.30019 · v1 · pith:MSIAMFRBnew · submitted 2026-06-29 · 💻 cs.CV

OmniDance: Multimodal Driven Dance Video Generation with Large-scale Internet Data

Kaixing Yang , Jiashu Zhu , Xulong Tang , Ziqiao Peng , Xiangyue Zhang , Chubin Chen , Puwei Wang , Jiahong Wu

show 3 more authors

Xiangxiang Chu Hongyan Liu Jun He

This is my paper

Pith reviewed 2026-06-30 06:33 UTC · model grok-4.3

classification 💻 cs.CV

keywords dance video generationmultimodal conditioningmusic-driven synthesistext-to-video modelscurriculum learninglarge-scale datasetconditional video generationdepth-aware architecture

0 comments

The pith

OmniDance adds music conditioning to text-image-to-video models through specialized architecture and training, using a new 300k-clip dance dataset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to overcome limited dance datasets and weak music integration in video foundation models by releasing CIPE-Dance, a 300k-clip collection sourced from the internet and annotated with choreography-informed text via an expert pipeline. It then presents OmniDance as a set of co-designed components that treat music as high-frequency temporal dynamics complementing the low-frequency semantics of text, allowing one model to handle text-only, music-only, or joint inputs. A sympathetic reader would care because the approach claims to deliver state-of-the-art results on dance video generation while preserving the base model's controllability and visual quality across diverse genres and environments.

Core claim

OmniDance co-designs a depth-aware specialization architecture, an anchored easy-to-hard curriculum learning strategy, and a modality-specialized time-dependent CFG strategy to integrate music into a TI2V foundation model, enabling unified TI2V, MI2V, and MTI2V dance video generation that achieves state-of-the-art performance on the CIPE-Dance dataset of 300k high-quality clips spanning 400 hours.

What carries the argument

The depth-aware specialization architecture together with anchored easy-to-hard curriculum learning and modality-specialized time-dependent CFG, which lets music supply high-frequency temporal alignment without eroding text controllability or visual fidelity.

If this is right

A single model can now perform text-driven, music-driven, or joint multimodal dance video generation without separate retraining or loss of original capabilities.
Training on 300k internet-sourced clips with expert text annotations produces measurable improvements across all three task variants.
The complementary-frequency view of text and music leads to concrete architectural and scheduling choices that maintain robustness when modalities are combined or ablated.
The same framework-level recipe can be applied to any TI2V foundation model while keeping its original controllability intact.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The curriculum and CFG specialization pattern could transfer to other temporal signals such as speech or environmental audio in general video generation.
The expert-annotation pipeline for internet video may reduce reliance on synthetic data for other motion-centric domains like sports or sign language.
If the high-frequency role of music generalizes, similar modality splitting might improve efficiency in training large multimodal models for non-dance tasks.

Load-bearing premise

The CIPE-Dance dataset built through the progressive expert pipeline supplies sufficiently high-quality and unbiased training data that the reported gains come from the framework rather than hidden dataset artifacts or domain shifts.

What would settle it

An independent replication on a fresh dance video test set collected outside the CIPE-Dance pipeline where OmniDance fails to exceed prior methods on quantitative measures of music-motion alignment, visual fidelity, and multimodal controllability.

Figures

Figures reproduced from arXiv: 2606.30019 by Chubin Chen, Hongyan Liu, Jiahong Wu, Jiashu Zhu, Jun He, Kaixing Yang, Puwei Wang, Xiangxiang Chu, Xiangyue Zhang, Xulong Tang, Ziqiao Peng.

**Figure 1.** Figure 1: Given a reference image, OmniDance supports multimodal driven dance video generation, including music only (MI2V, right), text only (TI2V, middle), and both (MTI2V, left). OmniDance not only preserves identity and visual fidelity, but also produces artistically expressive and music-aligned motion. tasks, while exhibiting robust multimodal integration capability. Project is available at https://github.com/A… view at source ↗

**Figure 2.** Figure 2: Overview of the Progressive Expert-Based Data Collection Pipeline. 2.2 Music-Driven Dance Video Generation Finally, research on music-driven dance video generation remains limited and can be broadly categorized into three paradigms: 2D keypoint–based generation, 3D SMPL-based generation, and end-to-end video generation. (1) 2D keypointbased generation. X-Dancer [7], M2PE-Diff [26], and STG-Mamba [33] firs… view at source ↗

**Figure 3.** Figure 3: Dataset analysis. (Left) Display of typical annotations; (Middle) Dance genre distribution from a random sample of 400 entries; (Right) Word cloud summarizing the caption content. Thomas flair). (3) Expressiveness. Characterizes the overall emotional expression and performance intent conveyed by the dancer (e.g., joyful, playful, confident, melancholic), reflected through facial expression, posture, ener… view at source ↗

**Figure 4.** Figure 4: Overview of OmniDance: music–text progressive specialization model architecture, curriculum learning training strategy, and modality-specialized inference strategy. 4 Methodology 4.1 Overview To effectively integrate music conditioning into a Video Generation Foundation Model without degrading its original generative capabilities, OmniDance introduces a three-level design: (1) a music–text progressive sp… view at source ↗

**Figure 5.** Figure 5: Comparison with SOTAs on Text-Image-to-Video (TI2V, left) and MusicImage-to-Video (MI2V, right) tasks on the CIPE-Dance dataset. beats, while maintaining stable identity and framing. In contrast, Hallo2 suffers from degraded visual quality with noticeable artifacts and blur; Echomimic-V3 often yields oversimplified and occasionally distorted motions; WAN-S2V tends to lose dancer identity over time (e.g., … view at source ↗

read the original abstract

Music-driven dance video generation aims to synthesize expressive human motion that is temporally aligned with music while maintaining high visual fidelity. Despite recent progress, existing methods still face two key limitations: the lack of large-scale, high-quality dance video datasets, and the absence of principled frameworks for integrating music as a complementary conditioning signal into Video Generation Foundation Models. To address these limitations, we introduce CIPE-Dance, a large-scale Internet-sourced dance video dataset with choreography-informed text annotations, constructed via a progressive expert pipeline. To the best of our knowledge, CIPE-Dance is the largest dataset for dance video generation to date, comprising 300k high-quality clips over 400 hours and covering diverse dancers, environments, and dance genres. We further propose OmniDance, a framework-level recipe for integrating music into a TI2V foundation model without sacrificing its original controllability or visual fidelity. Motivated by the complementary roles of text as low-frequency semantics and music as high-frequency temporal dynamics, OmniDance co-designs a depth-aware specialization architecture, an anchored easy-to-hard curriculum learning strategy, and a modality-specialized time-dependent CFG strategy, enabling unified TI2V, MI2V, and MTI2V generation. Extensive experiments on CIPE-Dance demonstrate that OmniDance achieves state-of-the-art performance across all three tasks and exhibits robust multimodal integration capability. Project is available at https://github.com/AMAP-ML/OmniDance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper releases a large new dance dataset and a concrete recipe for adding music to TI2V models, but all SOTA numbers stay inside that dataset.

read the letter

The main things here are CIPE-Dance, a 300k-clip internet-sourced dance collection with choreography text, and OmniDance, which adds three targeted changes (depth-aware specialization, anchored curriculum, time-dependent CFG) to let a text-to-video model also condition on music.

They do a clean job explaining the split between text for semantics and music for timing, then building the three pieces around it. Releasing the data and code on GitHub is the part that actually helps other people.

The soft spot is exactly what the stress-test note flags: every comparison is on the CIPE-Dance test split. No numbers on older dance sets, no re-run of published baselines on this data. That leaves open how much of the reported gains come from the method versus from training and testing on the same distribution. The abstract gives no metrics or ablations, so the full paper needs to show those clearly.

This is for groups working on music-driven video or dance synthesis who want more data. It deserves a serious referee because the dataset scale and the integration approach are concrete enough to review, even if the experiments need external checks.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces CIPE-Dance, a 300k-clip (400-hour) Internet-sourced dance video dataset with choreography-informed text annotations built via a progressive expert pipeline, and proposes OmniDance, a framework that augments a TI2V foundation model with music conditioning through depth-aware specialization, an anchored easy-to-hard curriculum, and modality-specialized time-dependent classifier-free guidance. This enables unified TI2V, MI2V, and MTI2V generation while preserving base-model controllability. Extensive experiments on the CIPE-Dance test split are reported to establish state-of-the-art performance across all three tasks together with robust multimodal integration.

Significance. If the results hold under broader scrutiny, the work supplies the largest public dance-video resource to date and a concrete co-design recipe for injecting high-frequency temporal signals (music) into existing video-generation backbones without catastrophic forgetting of text controllability. The explicit motivation linking text to low-frequency semantics and music to high-frequency dynamics, combined with the dataset scale, could serve as a useful reference point for subsequent multimodal video synthesis research.

major comments (2)

[Experiments] Experiments section: All reported SOTA margins for TI2V/MI2V/MTI2V are obtained exclusively on the internal CIPE-Dance test split whose distribution, motion statistics, and annotation style are defined by the same progressive expert pipeline used for training data. No quantitative results on prior public dance corpora (e.g., AIST++) nor re-evaluation of published baselines on CIPE-Dance are supplied, leaving open the possibility that the observed gains arise from alignment between the three co-designed components and dataset-specific biases rather than intrinsic superiority of the multimodal integration strategy.
[§4 and Experiments] §4 (Method) and Experiments: The central claim that the depth-aware specialization, anchored curriculum, and time-dependent CFG together enable “robust multimodal integration” without sacrificing original controllability rests on the joint training/inference recipe, yet no isolated ablation tables quantify the marginal contribution of each component (or their interactions) to the final metrics. This is load-bearing for attributing performance gains to the proposed framework rather than to dataset scale alone.

minor comments (2)

[Abstract] The abstract states that CIPE-Dance is “the largest dataset … to date” but supplies no explicit comparison table against prior dance-video corpora in terms of total duration, genre coverage, or annotation granularity.
[Introduction] Notation for the three tasks (TI2V, MI2V, MTI2V) is introduced without an early definition or diagram clarifying the exact input modalities for each.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We address each major comment point by point below.

read point-by-point responses

Referee: [Experiments] Experiments section: All reported SOTA margins for TI2V/MI2V/MTI2V are obtained exclusively on the internal CIPE-Dance test split whose distribution, motion statistics, and annotation style are defined by the same progressive expert pipeline used for training data. No quantitative results on prior public dance corpora (e.g., AIST++) nor re-evaluation of published baselines on CIPE-Dance are supplied, leaving open the possibility that the observed gains arise from alignment between the three co-designed components and dataset-specific biases rather than intrinsic superiority of the multimodal integration strategy.

Authors: We agree that this is a substantive concern. CIPE-Dance is presented as the largest public dance video dataset, and the experiments were designed to evaluate on its held-out test split to demonstrate performance at scale. However, the absence of cross-dataset results and baseline re-evaluations on CIPE-Dance does leave open the possibility of dataset-specific effects. In the revised manuscript we will add (i) re-evaluation of the published baselines on the CIPE-Dance test split and (ii) quantitative results of OmniDance on AIST++ (where motion and annotation differences permit direct comparison). These additions will allow readers to better assess whether the reported gains are attributable to the proposed multimodal integration strategy. revision: yes
Referee: [§4 and Experiments] §4 (Method) and Experiments: The central claim that the depth-aware specialization, anchored curriculum, and time-dependent CFG together enable “robust multimodal integration” without sacrificing original controllability rests on the joint training/inference recipe, yet no isolated ablation tables quantify the marginal contribution of each component (or their interactions) to the final metrics. This is load-bearing for attributing performance gains to the proposed framework rather than to dataset scale alone.

Authors: We concur that isolated ablations are necessary to attribute performance to the individual components rather than to dataset scale. The current manuscript reports only the full-model results. In the revision we will insert a dedicated ablation subsection that quantifies the marginal contribution of depth-aware specialization, the anchored easy-to-hard curriculum, and modality-specialized time-dependent CFG, together with their pairwise and triple interactions, using the same metrics and test split. revision: yes

Circularity Check

0 steps flagged

No derivation chain present; claims are empirical outcomes of dataset construction and training

full rationale

The paper introduces CIPE-Dance via a progressive expert pipeline and proposes OmniDance as a co-designed framework for multimodal integration. All performance claims are presented as results of 'extensive experiments on CIPE-Dance' with no equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations that reduce the central result to its inputs by construction. The evaluation is internal to the new dataset, which is a common empirical limitation but does not match any enumerated circularity pattern (self-definitional, fitted-input-called-prediction, etc.). The derivation chain is therefore self-contained as standard ML experimentation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no equations, fitted constants, or explicit assumptions beyond high-level descriptions of dataset curation and model co-design; no free parameters, axioms, or invented entities are identifiable.

pith-pipeline@v0.9.1-grok · 5826 in / 1263 out tokens · 39723 ms · 2026-06-30T06:33:13.216120+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 25 canonical work pages · 9 internal anchors

[1]

Advances in neural information processing systems33, 12449–12460 (2020)

Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems33, 12449–12460 (2020)

2020
[2]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Research in dance education5(1), 45–67 (2004)

Butterworth*, J.: Teaching choreography in higher education: A process continuum model. Research in dance education5(1), 45–67 (2004)

2004
[4]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Chen, C., Hu, S., Zhu, J., Wu, M., Chen, J., Li, Y., Huang, N., Fang, C., Wu, J., Chu, X., et al.: Taming preference mode collapse via directional decoupling alignment in diffusion reinforcement learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12775–12786 (2026)

2026
[5]

arXiv preprint arXiv:2508.12880 (2025)

Chen, C., Zhu, J., Feng, X., Huang, N., Wu, M., Mao, F., Wu, J., Chu, X., Li, X.: S2-Guidance: Stochastic self guidance for training-free enhancement of diffusion models. arXiv preprint arXiv:2508.12880 (2025)

work page arXiv 2025
[6]

In: Proceedings of the 33rd ACM International Conference on Multimedia

Chen, R., Sun, L., Tang, J., Li, G., Chu, X.: Finger: Content aware fine-grained evaluation with reasoning for ai-generated videos. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 3517–3526 (2025)

2025
[7]

arXiv preprint arXiv:2502.17414 (2025)

Chen,Z.,Xu,H.,Song,G.,Xie,Y.,Zhang,C.,Chen,X.,Wang,C.,Chang,D.,Luo, L.: X-dancer: Expressive music to human dance video generation. arXiv preprint arXiv:2502.17414 (2025)

work page arXiv 2025
[8]

arXiv preprint arXiv:2509.14055 (2025)

Cheng, G., Gao, X., Hu, L., Hu, S., Huang, M., Ji, C., Li, J., Meng, D., Qi, J., Qiao, P., et al.: Wan-animate: Unified character animation and replacement with holistic replication. arXiv preprint arXiv:2509.14055 (2025)

work page arXiv 2025
[9]

arXiv preprint arXiv:2410.07718 (2024)

Cui, J., Li, H., Yao, Y., Zhu, H., Shang, H., Cheng, K., Zhou, H., Zhu, S., Wang, J.: Hallo2: Long-duration and high-resolution audio-driven portrait image animation. arXiv preprint arXiv:2410.07718 (2024)

work page arXiv 2024
[10]

arXiv preprint arXiv:2501.18801 (2025)

Dong, Z., Hao, W., Wang, J.C., Zhang, P., Polak, P.: Every image listens, ev- ery image dances: Music-driven image animation. arXiv preprint arXiv:2501.18801 (2025)

work page arXiv 2025
[11]

arXiv e-prints pp

Feng, X., Yu, H., Wu, M., Hu, S., Chen, J., Zhu, C., Wu, J., Chu, X., Huang, K.: Narrlv: Towards a comprehensive narrative-centric evaluation for long video generation models. arXiv e-prints pp. arXiv–2507 (2025)

2025
[12]

arXiv preprint arXiv:2508.18621 (2025)

Gao, X., Hu, L., Hu, S., Huang, M., Ji, C., Meng, D., Qi, J., Qiao, P., Shen, Z., Song, Y., et al.: Wan-s2v: Audio-driven cinematic video generation. arXiv preprint arXiv:2508.18621 (2025)

work page arXiv 2025
[13]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Gong, K., Lian, D., Chang, H., Guo, C., Jiang, Z., Zuo, X., Mi, M.B., Wang, X.: Tm2d: Bimodality driven 3d dance generation via music-text integration. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9942–9952 (2023)

2023
[14]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Hong, S., Kemelmacher-Shlizerman, I., Curless, B., Seitz, S.M.: Musicinfuser: Mak- ing video diffusion listen and dance. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 43751–43761 (June 2026)

2026
[15]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Hu, L.: Animate anyone: Consistent and controllable image-to-video synthesis for character animation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8153–8163 (2024)

2024
[16]

Embedding-perturbed Exploration Preference Optimization for Flow Models

Hu, S., Chen, C., Zhu, J., Wu, J., Chu, X., Li, X.: Embedding-perturbed explo- ration preference optimization for flow models. arXiv preprint arXiv:2605.15803 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[17]

Jia, T., Yang, K., Yang, X., Tang, X., Qiu, K., Wei, S., Zhao, Y.: Bitdiff: Fine- grained3dconductingmotiongenerationviabimamba-transformerdiffusion.arXiv preprint arXiv:2604.04395 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., Wu, K., Lin, Q., Yuan, J., Long, Y., Wang, A., Wang, A., Li, C., Huang, D., Yang, F., Tan, H., Wang, H., Song, J., Bai, J., Wu, J., Xue, J., Wang, OmniDance 17 J., Wang, K., Liu, M., Li, P., Li, S., Wang, W., Yu, W., Deng, X., Li, Y., Chen, Y., Cui, Y., Peng, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

arXiv preprint arXiv:2510.12586 (2025)

Lei, J., Liu, K., Berner, J., Yu, H., Zheng, H., Wu, J., Chu, X.: There is no vae: End-to-end pixel-space generative modeling via self-supervised pre-training. arXiv preprint arXiv:2510.12586 (2025)

work page arXiv 2025
[20]

In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision

Li, H., Wang, Y., Huang, T., Huang, H., Wang, H., Chu, X.: Ld-rps: Zero-shot unified image restoration via latent diffusion recurrent posterior sampling. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13684–13694 (2025)

2025
[21]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, K., Wang, Y., He, Y., Li, Y., Wang, Y., Liu, Y., Wang, Z., Xu, J., Chen, G., Luo, P., et al.: Mvbench: A comprehensive multi-modal video understanding benchmark. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22195–22206 (2024)

2024
[22]

In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision

Li, R., Zhao, J., Zhang, Y., Su, M., Ren, Z., Zhang, H., Tang, Y., Li, X.: Finedance: A fine-grained choreography dataset for 3d full body dance generation. In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision. pp. 10234– 10243 (2023)

2023
[23]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Li, R., Yang, S., Ross, D.A., Kanazawa, A.: Ai choreographer: Music conditioned 3d dance generation with aist++. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13401–13412 (2021)

2021
[24]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Ling, X., Zhu, C., Wu, M., Li, H., Feng, X., Yang, C., Hao, A., Zhu, J., Wu, J., Chu, X.: Vmbench: A benchmark for perception-aligned video motion generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13087–13098 (2025)

2025
[25]

arXiv preprint arXiv:2507.03905 (2025)

Meng, R., Wang, Y., Wu, W., Zheng, R., Li, Y., Ma, C.: Echomimicv3: 1.3 b pa- rameters are all you need for unified multi-modal and multi-task human animation. arXiv preprint arXiv:2507.03905 (2025)

work page arXiv 2025
[26]

In: Proceedings of the 33rd ACM International Conference on Multimedia

Park, N.T.: M2pe-diff: Music-to-pose encoder for dance video generation lever- aging latent diffusion framework. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 2419–2428 (2025)

2025
[27]

arXiv preprint arXiv:2512.19546 (2025)

Peng, Z., Chen, Y., Ma, Y., Zhang, G., Sun, Z., Zhou, Z., Zhang, Y., Zhou, Z., Fan, Z., Liu, H., et al.: Actavatar: Temporally-aware precise action control for talking avatars. arXiv preprint arXiv:2512.19546 (2025)

work page arXiv 2025
[28]

arXiv preprint arXiv:2506.14742 (2025)

Peng, Z., Hu, W., Ma, J., Zhu, X., Zhang, X., Zhao, H., Tian, H., He, J., Liu, H., Fan, Z.: Synctalk++: High-fidelity and efficient synchronized talking heads synthesis using gaussian splatting. arXiv preprint arXiv:2506.14742 (2025)

work page arXiv 2025
[29]

In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion

Peng, Z., Hu, W., Shi, Y., Zhu, X., Zhang, X., Zhao, H., He, J., Liu, H., Fan, Z.: Synctalk: The devil is in the synchronization for talking head synthesis. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. pp. 666–676 (2024)

2024
[30]

arXiv preprint arXiv:2505.21448 (2025)

Peng, Z., Liu, J., Zhang, H., Liu, X., Tang, S., Wan, P., Zhang, D., Liu, H., He, J.: Omnisync: Towards universal lip synchronization via diffusion transformers. arXiv preprint arXiv:2505.21448 (2025)

work page arXiv 2025
[31]

In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Siyao, L., Yu, W., Gu, T., Lin, C., Wang, Q., Qian, C., Loy, C.C., Liu, Z.: Bailando: 3d dance generation by actor-critic gpt with choreographic memory. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11050–11059 (2022) 18 K. Yang et al

2022
[32]

arXiv preprint arXiv:2410.10306 (2024)

Tan, S., Gong, B., Wang, X., Zhang, S., Zheng, D., Zheng, R., Zheng, K., Chen, J., Yang, M.: Animate-x: Universal character image animation with enhanced motion representation. arXiv preprint arXiv:2410.10306 (2024)

work page arXiv 2024
[33]

IEEE transactions on pattern anal- ysis and machine intelligence (2025)

Tang, H., Shao, L., Zhang, Z., Van Gool, L., Sebe, N.: Spatial-temporal graph mamba for music-guided dance video synthesis. IEEE transactions on pattern anal- ysis and machine intelligence (2025)

2025
[34]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Tseng, J., Castellon, R., Liu, K.: Edge: Editable dance generation from music. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 448–458 (2023)

2023
[35]

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., Wang, J., Zhang, J., Zhou, J., Wang, J., Chen, J., Zhu, K., Zhao, K., Yan, K., Huang, L., Feng, M., Zhang, N., Li, P., Wu, P., Chu, R., Feng, R., Zhang, S., Sun, S., Fang, T., Wang, T., Gui, T., Weng, T., Shen, T., Lin, W., Wang, W., Wang, W., Zhou, W.,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

In: Proceedings of the 33rd ACM International Conference on Multimedia

Wang, X., Wang, H., Cai, W.: Choreomuse: Robust music-to-dance video gener- ation with style transfer and beat-adherent motion. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 7912–7921 (2025)

2025
[37]

In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Wang, X., Wang, H., Liu, D., Cai, W.: Dance any beat: Blending beats with visuals in dance video generation. In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). pp. 5136–5146. IEEE (2025)

2025
[38]

Advances in neural information processing systems35, 38571–38584 (2022)

Xu, Y., Zhang, J., Zhang, Q., Tao, D.: Vitpose: Simple vision transformer baselines for human pose estimation. Advances in neural information processing systems35, 38571–38584 (2022)

2022
[39]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Xu, Z., Zhang, J., Liew, J.H., Yan, H., Liu, J.W., Zhang, C., Feng, J., Shou, M.Z.: Magicanimate: Temporally consistent human image animation using diffu- sion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1481–1490 (2024)

2024
[40]

In: Proceedings of the 2024 International Conference on Multimedia Retrieval

Yang, K., Tang, X., Diao, R., Liu, H., He, J., Fan, Z.: Codancers: Music-driven coherent group dance generation with choreographic unit. In: Proceedings of the 2024 International Conference on Multimedia Retrieval. pp. 675–683 (2024)

2024
[41]

arXiv preprint arXiv:2505.14222 (2025)

Yang, K., Tang, X., Hu, Y., Yang, J., Liu, H., Zhang, Q., He, J., Fan, Z.: Match- dance: Collaborative mamba-transformer architecture matching for high-quality 3d dance synthesis. arXiv preprint arXiv:2505.14222 (2025)

work page arXiv 2025
[42]

arXiv preprint arXiv:2505.17543 (2025)

Yang, K., Tang, X., Peng, Z., Hu, Y., He, J., Liu, H.: Megadance: Mixture- of-experts architecture for genre-aware 3d dance generation. arXiv preprint arXiv:2505.17543 (2025)

work page arXiv 2025
[43]

FlowerDance: MeanFlow for Efficient and Refined 3D Dance Generation

Yang, K., Tang, X., Peng, Z., Zhang, X., Wang, P., He, J., Liu, H.: Flower- dance: Meanflow for efficient and refined 3d dance generation. arXiv preprint arXiv:2511.21029 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

arXiv preprint arXiv:2412.19123 (2024)

Yang, K., Tang, X., Wu, H., Xue, Q., Qin, B., Liu, H., Fan, Z.: Cohedancers: Enhancing interactive group dance generation through music-driven coherence de- composition. arXiv preprint arXiv:2412.19123 (2024)

work page arXiv 2024
[45]

In: Proceedings of the 2024 International Conference on Multimedia Retrieval

Yang, K., Zhou, X., Tang, X., Diao, R., Liu, H., He, J., Fan, Z.: Beatdance: A beat- based model-agnostic contrastive learning framework for music-dance retrieval. In: Proceedings of the 2024 International Conference on Multimedia Retrieval. pp. 11–19 (2024) OmniDance 19

2024
[46]

MACE-Dance: Motion-Appearance Cascaded Experts for Music-Driven Dance Video Generation

Yang, K., Zhu, J., Tang, X., Peng, Z., Zhang, X., Wang, P., Wu, J., Chu, X., Liu, H., He, J.: Mace-dance: Motion-appearance cascaded experts for music-driven dance video generation. arXiv preprint arXiv:2512.18181 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Yang, Z., Yang, K., Tang, X.: Tokendance: Token-to-token music-to-dance gener- ation with bidirectional mamba. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5520–5529 (2026)

2026
[49]

PersonaGesture: Single-Reference Co-Speech Gesture Personalization for Unseen Speakers

Zhang, X., Cai, Y., Li, K., Yang, K., Zhou, Y., Li, Z., Chu, X., Zhang, J., Liu, H.: Personagesture: Single-reference co-speech gesture personalization for unseen speakers. arXiv preprint arXiv:2605.06064 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[50]

IEEE Transactions on Circuits and Systems for Video Technology (2025)

Zhang, X., Jia, Y., Zhang, J., Yang, Y., Tu, Z.: Robust 2d skeleton action recogni- tion via decoupling and distilling 3d latent features. IEEE Transactions on Circuits and Systems for Video Technology (2025)

2025
[51]

In: Pro- ceedings of the AAAIConference on Artificial Intelligence.vol

Zhang, X., Li, J., Ren, J., Zhang, J.: Mitigating error accumulation in co-speech motion generation via global rotation diffusion and multi-level constraints. In: Pro- ceedings of the AAAIConference on Artificial Intelligence.vol. 40, pp. 12834–12842 (2026)

2026
[52]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Zhang, X., Li, J., Zhang, J., Dang, Z., Ren, J., Bo, L., Tu, Z.: Semtalk: Holistic co-speech motion generation with frame-level semantic emphasis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13761–13771 (2025)

2025
[53]

In: Pro- ceedings of the 33rd ACM International Conference on Multimedia

Zhang, X., Li, J., Zhang, J., Ren, J., Bo, L., Tu, Z.: Echomask: Speech-queried attention-based mask modeling for holistic co-speech motion generation. In: Pro- ceedings of the 33rd ACM International Conference on Multimedia. pp. 10827– 10836 (2025)

2025
[54]

arXiv preprint arXiv:2603.29655 (2026)

Zhou, P., Zhang, X., Shen, X., Hu, Y.: Not all frames are equal: Complexity- aware masked motion generation via motion spectral descriptors. arXiv preprint arXiv:2603.29655 (2026)

work page arXiv 2026

[1] [1]

Advances in neural information processing systems33, 12449–12460 (2020)

Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems33, 12449–12460 (2020)

2020

[2] [2]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Research in dance education5(1), 45–67 (2004)

Butterworth*, J.: Teaching choreography in higher education: A process continuum model. Research in dance education5(1), 45–67 (2004)

2004

[4] [4]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Chen, C., Hu, S., Zhu, J., Wu, M., Chen, J., Li, Y., Huang, N., Fang, C., Wu, J., Chu, X., et al.: Taming preference mode collapse via directional decoupling alignment in diffusion reinforcement learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12775–12786 (2026)

2026

[5] [5]

arXiv preprint arXiv:2508.12880 (2025)

Chen, C., Zhu, J., Feng, X., Huang, N., Wu, M., Mao, F., Wu, J., Chu, X., Li, X.: S2-Guidance: Stochastic self guidance for training-free enhancement of diffusion models. arXiv preprint arXiv:2508.12880 (2025)

work page arXiv 2025

[6] [6]

In: Proceedings of the 33rd ACM International Conference on Multimedia

Chen, R., Sun, L., Tang, J., Li, G., Chu, X.: Finger: Content aware fine-grained evaluation with reasoning for ai-generated videos. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 3517–3526 (2025)

2025

[7] [7]

arXiv preprint arXiv:2502.17414 (2025)

Chen,Z.,Xu,H.,Song,G.,Xie,Y.,Zhang,C.,Chen,X.,Wang,C.,Chang,D.,Luo, L.: X-dancer: Expressive music to human dance video generation. arXiv preprint arXiv:2502.17414 (2025)

work page arXiv 2025

[8] [8]

arXiv preprint arXiv:2509.14055 (2025)

Cheng, G., Gao, X., Hu, L., Hu, S., Huang, M., Ji, C., Li, J., Meng, D., Qi, J., Qiao, P., et al.: Wan-animate: Unified character animation and replacement with holistic replication. arXiv preprint arXiv:2509.14055 (2025)

work page arXiv 2025

[9] [9]

arXiv preprint arXiv:2410.07718 (2024)

Cui, J., Li, H., Yao, Y., Zhu, H., Shang, H., Cheng, K., Zhou, H., Zhu, S., Wang, J.: Hallo2: Long-duration and high-resolution audio-driven portrait image animation. arXiv preprint arXiv:2410.07718 (2024)

work page arXiv 2024

[10] [10]

arXiv preprint arXiv:2501.18801 (2025)

Dong, Z., Hao, W., Wang, J.C., Zhang, P., Polak, P.: Every image listens, ev- ery image dances: Music-driven image animation. arXiv preprint arXiv:2501.18801 (2025)

work page arXiv 2025

[11] [11]

arXiv e-prints pp

Feng, X., Yu, H., Wu, M., Hu, S., Chen, J., Zhu, C., Wu, J., Chu, X., Huang, K.: Narrlv: Towards a comprehensive narrative-centric evaluation for long video generation models. arXiv e-prints pp. arXiv–2507 (2025)

2025

[12] [12]

arXiv preprint arXiv:2508.18621 (2025)

Gao, X., Hu, L., Hu, S., Huang, M., Ji, C., Meng, D., Qi, J., Qiao, P., Shen, Z., Song, Y., et al.: Wan-s2v: Audio-driven cinematic video generation. arXiv preprint arXiv:2508.18621 (2025)

work page arXiv 2025

[13] [13]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Gong, K., Lian, D., Chang, H., Guo, C., Jiang, Z., Zuo, X., Mi, M.B., Wang, X.: Tm2d: Bimodality driven 3d dance generation via music-text integration. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9942–9952 (2023)

2023

[14] [14]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Hong, S., Kemelmacher-Shlizerman, I., Curless, B., Seitz, S.M.: Musicinfuser: Mak- ing video diffusion listen and dance. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 43751–43761 (June 2026)

2026

[15] [15]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Hu, L.: Animate anyone: Consistent and controllable image-to-video synthesis for character animation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8153–8163 (2024)

2024

[16] [16]

Embedding-perturbed Exploration Preference Optimization for Flow Models

Hu, S., Chen, C., Zhu, J., Wu, J., Chu, X., Li, X.: Embedding-perturbed explo- ration preference optimization for flow models. arXiv preprint arXiv:2605.15803 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[17] [17]

Jia, T., Yang, K., Yang, X., Tang, X., Qiu, K., Wei, S., Zhao, Y.: Bitdiff: Fine- grained3dconductingmotiongenerationviabimamba-transformerdiffusion.arXiv preprint arXiv:2604.04395 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[18] [18]

Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., Wu, K., Lin, Q., Yuan, J., Long, Y., Wang, A., Wang, A., Li, C., Huang, D., Yang, F., Tan, H., Wang, H., Song, J., Bai, J., Wu, J., Xue, J., Wang, OmniDance 17 J., Wang, K., Liu, M., Li, P., Li, S., Wang, W., Yu, W., Deng, X., Li, Y., Chen, Y., Cui, Y., Peng, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

arXiv preprint arXiv:2510.12586 (2025)

Lei, J., Liu, K., Berner, J., Yu, H., Zheng, H., Wu, J., Chu, X.: There is no vae: End-to-end pixel-space generative modeling via self-supervised pre-training. arXiv preprint arXiv:2510.12586 (2025)

work page arXiv 2025

[20] [20]

In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision

Li, H., Wang, Y., Huang, T., Huang, H., Wang, H., Chu, X.: Ld-rps: Zero-shot unified image restoration via latent diffusion recurrent posterior sampling. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13684–13694 (2025)

2025

[21] [21]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, K., Wang, Y., He, Y., Li, Y., Wang, Y., Liu, Y., Wang, Z., Xu, J., Chen, G., Luo, P., et al.: Mvbench: A comprehensive multi-modal video understanding benchmark. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22195–22206 (2024)

2024

[22] [22]

In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision

Li, R., Zhao, J., Zhang, Y., Su, M., Ren, Z., Zhang, H., Tang, Y., Li, X.: Finedance: A fine-grained choreography dataset for 3d full body dance generation. In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision. pp. 10234– 10243 (2023)

2023

[23] [23]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Li, R., Yang, S., Ross, D.A., Kanazawa, A.: Ai choreographer: Music conditioned 3d dance generation with aist++. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13401–13412 (2021)

2021

[24] [24]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Ling, X., Zhu, C., Wu, M., Li, H., Feng, X., Yang, C., Hao, A., Zhu, J., Wu, J., Chu, X.: Vmbench: A benchmark for perception-aligned video motion generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13087–13098 (2025)

2025

[25] [25]

arXiv preprint arXiv:2507.03905 (2025)

Meng, R., Wang, Y., Wu, W., Zheng, R., Li, Y., Ma, C.: Echomimicv3: 1.3 b pa- rameters are all you need for unified multi-modal and multi-task human animation. arXiv preprint arXiv:2507.03905 (2025)

work page arXiv 2025

[26] [26]

In: Proceedings of the 33rd ACM International Conference on Multimedia

Park, N.T.: M2pe-diff: Music-to-pose encoder for dance video generation lever- aging latent diffusion framework. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 2419–2428 (2025)

2025

[27] [27]

arXiv preprint arXiv:2512.19546 (2025)

Peng, Z., Chen, Y., Ma, Y., Zhang, G., Sun, Z., Zhou, Z., Zhang, Y., Zhou, Z., Fan, Z., Liu, H., et al.: Actavatar: Temporally-aware precise action control for talking avatars. arXiv preprint arXiv:2512.19546 (2025)

work page arXiv 2025

[28] [28]

arXiv preprint arXiv:2506.14742 (2025)

Peng, Z., Hu, W., Ma, J., Zhu, X., Zhang, X., Zhao, H., Tian, H., He, J., Liu, H., Fan, Z.: Synctalk++: High-fidelity and efficient synchronized talking heads synthesis using gaussian splatting. arXiv preprint arXiv:2506.14742 (2025)

work page arXiv 2025

[29] [29]

In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion

Peng, Z., Hu, W., Shi, Y., Zhu, X., Zhang, X., Zhao, H., He, J., Liu, H., Fan, Z.: Synctalk: The devil is in the synchronization for talking head synthesis. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. pp. 666–676 (2024)

2024

[30] [30]

arXiv preprint arXiv:2505.21448 (2025)

Peng, Z., Liu, J., Zhang, H., Liu, X., Tang, S., Wan, P., Zhang, D., Liu, H., He, J.: Omnisync: Towards universal lip synchronization via diffusion transformers. arXiv preprint arXiv:2505.21448 (2025)

work page arXiv 2025

[31] [31]

In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Siyao, L., Yu, W., Gu, T., Lin, C., Wang, Q., Qian, C., Loy, C.C., Liu, Z.: Bailando: 3d dance generation by actor-critic gpt with choreographic memory. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11050–11059 (2022) 18 K. Yang et al

2022

[32] [32]

arXiv preprint arXiv:2410.10306 (2024)

Tan, S., Gong, B., Wang, X., Zhang, S., Zheng, D., Zheng, R., Zheng, K., Chen, J., Yang, M.: Animate-x: Universal character image animation with enhanced motion representation. arXiv preprint arXiv:2410.10306 (2024)

work page arXiv 2024

[33] [33]

IEEE transactions on pattern anal- ysis and machine intelligence (2025)

Tang, H., Shao, L., Zhang, Z., Van Gool, L., Sebe, N.: Spatial-temporal graph mamba for music-guided dance video synthesis. IEEE transactions on pattern anal- ysis and machine intelligence (2025)

2025

[34] [34]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Tseng, J., Castellon, R., Liu, K.: Edge: Editable dance generation from music. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 448–458 (2023)

2023

[35] [35]

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., Wang, J., Zhang, J., Zhou, J., Wang, J., Chen, J., Zhu, K., Zhao, K., Yan, K., Huang, L., Feng, M., Zhang, N., Li, P., Wu, P., Chu, R., Feng, R., Zhang, S., Sun, S., Fang, T., Wang, T., Gui, T., Weng, T., Shen, T., Lin, W., Wang, W., Wang, W., Zhou, W.,...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

In: Proceedings of the 33rd ACM International Conference on Multimedia

Wang, X., Wang, H., Cai, W.: Choreomuse: Robust music-to-dance video gener- ation with style transfer and beat-adherent motion. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 7912–7921 (2025)

2025

[37] [37]

In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Wang, X., Wang, H., Liu, D., Cai, W.: Dance any beat: Blending beats with visuals in dance video generation. In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). pp. 5136–5146. IEEE (2025)

2025

[38] [38]

Advances in neural information processing systems35, 38571–38584 (2022)

Xu, Y., Zhang, J., Zhang, Q., Tao, D.: Vitpose: Simple vision transformer baselines for human pose estimation. Advances in neural information processing systems35, 38571–38584 (2022)

2022

[39] [39]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Xu, Z., Zhang, J., Liew, J.H., Yan, H., Liu, J.W., Zhang, C., Feng, J., Shou, M.Z.: Magicanimate: Temporally consistent human image animation using diffu- sion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1481–1490 (2024)

2024

[40] [40]

In: Proceedings of the 2024 International Conference on Multimedia Retrieval

Yang, K., Tang, X., Diao, R., Liu, H., He, J., Fan, Z.: Codancers: Music-driven coherent group dance generation with choreographic unit. In: Proceedings of the 2024 International Conference on Multimedia Retrieval. pp. 675–683 (2024)

2024

[41] [41]

arXiv preprint arXiv:2505.14222 (2025)

Yang, K., Tang, X., Hu, Y., Yang, J., Liu, H., Zhang, Q., He, J., Fan, Z.: Match- dance: Collaborative mamba-transformer architecture matching for high-quality 3d dance synthesis. arXiv preprint arXiv:2505.14222 (2025)

work page arXiv 2025

[42] [42]

arXiv preprint arXiv:2505.17543 (2025)

Yang, K., Tang, X., Peng, Z., Hu, Y., He, J., Liu, H.: Megadance: Mixture- of-experts architecture for genre-aware 3d dance generation. arXiv preprint arXiv:2505.17543 (2025)

work page arXiv 2025

[43] [43]

FlowerDance: MeanFlow for Efficient and Refined 3D Dance Generation

Yang, K., Tang, X., Peng, Z., Zhang, X., Wang, P., He, J., Liu, H.: Flower- dance: Meanflow for efficient and refined 3d dance generation. arXiv preprint arXiv:2511.21029 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

arXiv preprint arXiv:2412.19123 (2024)

Yang, K., Tang, X., Wu, H., Xue, Q., Qin, B., Liu, H., Fan, Z.: Cohedancers: Enhancing interactive group dance generation through music-driven coherence de- composition. arXiv preprint arXiv:2412.19123 (2024)

work page arXiv 2024

[45] [45]

In: Proceedings of the 2024 International Conference on Multimedia Retrieval

Yang, K., Zhou, X., Tang, X., Diao, R., Liu, H., He, J., Fan, Z.: Beatdance: A beat- based model-agnostic contrastive learning framework for music-dance retrieval. In: Proceedings of the 2024 International Conference on Multimedia Retrieval. pp. 11–19 (2024) OmniDance 19

2024

[46] [46]

MACE-Dance: Motion-Appearance Cascaded Experts for Music-Driven Dance Video Generation

Yang, K., Zhu, J., Tang, X., Peng, Z., Zhang, X., Wang, P., Wu, J., Chu, X., Liu, H., He, J.: Mace-dance: Motion-appearance cascaded experts for music-driven dance video generation. arXiv preprint arXiv:2512.18181 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [47]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[48] [48]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Yang, Z., Yang, K., Tang, X.: Tokendance: Token-to-token music-to-dance gener- ation with bidirectional mamba. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5520–5529 (2026)

2026

[49] [49]

PersonaGesture: Single-Reference Co-Speech Gesture Personalization for Unseen Speakers

Zhang, X., Cai, Y., Li, K., Yang, K., Zhou, Y., Li, Z., Chu, X., Zhang, J., Liu, H.: Personagesture: Single-reference co-speech gesture personalization for unseen speakers. arXiv preprint arXiv:2605.06064 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[50] [50]

IEEE Transactions on Circuits and Systems for Video Technology (2025)

Zhang, X., Jia, Y., Zhang, J., Yang, Y., Tu, Z.: Robust 2d skeleton action recogni- tion via decoupling and distilling 3d latent features. IEEE Transactions on Circuits and Systems for Video Technology (2025)

2025

[51] [51]

In: Pro- ceedings of the AAAIConference on Artificial Intelligence.vol

Zhang, X., Li, J., Ren, J., Zhang, J.: Mitigating error accumulation in co-speech motion generation via global rotation diffusion and multi-level constraints. In: Pro- ceedings of the AAAIConference on Artificial Intelligence.vol. 40, pp. 12834–12842 (2026)

2026

[52] [52]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Zhang, X., Li, J., Zhang, J., Dang, Z., Ren, J., Bo, L., Tu, Z.: Semtalk: Holistic co-speech motion generation with frame-level semantic emphasis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13761–13771 (2025)

2025

[53] [53]

In: Pro- ceedings of the 33rd ACM International Conference on Multimedia

Zhang, X., Li, J., Zhang, J., Ren, J., Bo, L., Tu, Z.: Echomask: Speech-queried attention-based mask modeling for holistic co-speech motion generation. In: Pro- ceedings of the 33rd ACM International Conference on Multimedia. pp. 10827– 10836 (2025)

2025

[54] [54]

arXiv preprint arXiv:2603.29655 (2026)

Zhou, P., Zhang, X., Shen, X., Hu, Y.: Not all frames are equal: Complexity- aware masked motion generation via motion spectral descriptors. arXiv preprint arXiv:2603.29655 (2026)

work page arXiv 2026