MACE-Dance: Motion-Appearance Cascaded Experts for Music-Driven Dance Video Generation

Hongyan Liu; Jiahong Wu; Jiashu Zhu; Jun He; Kaixing Yang; Puwei Wang; Xiangxiang Chu; Xiangyue Zhang; Xulong Tang; Ziqiao Peng

arxiv: 2512.18181 · v3 · submitted 2025-12-20 · 💻 cs.CV

MACE-Dance: Motion-Appearance Cascaded Experts for Music-Driven Dance Video Generation

Kaixing Yang , Jiashu Zhu , Xulong Tang , Ziqiao Peng , Xiangyue Zhang , Puwei Wang , Jiahong Wu , Xiangxiang Chu

show 2 more authors

Hongyan Liu Jun He

This is my paper

Pith reviewed 2026-05-16 20:26 UTC · model grok-4.3

classification 💻 cs.CV

keywords music-driven dance generationcascaded experts3D motion generationpose-driven video synthesisdiffusion modelsdance video generationmixture of experts

0 comments

The pith

Cascaded motion and appearance experts generate realistic dance videos directly from music.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MACE-Dance as a two-stage system that first converts music into kinematically plausible 3D dance motions and then renders those motions into video frames while preserving a reference dancer's appearance. This separation lets each stage specialize without the compromises that arise in single-model approaches trying to handle both motion and visuals at once. The Motion Expert relies on a diffusion model with a hybrid BiMamba-Transformer backbone and a guidance-free training trick to reach top results on 3D dance tasks, while the Appearance Expert applies separate kinematic and aesthetic fine-tuning to maintain identity and temporal smoothness. The authors also release a new large-scale dataset and a motion-appearance evaluation protocol, on which the full pipeline sets new state-of-the-art numbers. A reader would care because the method directly targets the practical gap between generating believable movement and producing visually coherent video for dance content.

Core claim

MACE-Dance decomposes music-driven dance video generation into cascaded Mixture-of-Experts stages: the Motion Expert produces 3D motions from audio using a diffusion model equipped with BiMamba-Transformer architecture and Guidance-Free Training to enforce both kinematic validity and artistic quality, while the Appearance Expert synthesizes the final video by conditioning on the generated motions and a reference image through decoupled kinematic-aesthetic fine-tuning that keeps visual identity and spatiotemporal coherence intact.

What carries the argument

The cascaded Motion-Appearance Experts, in which the Motion Expert performs music-to-3D-motion diffusion and the Appearance Expert performs motion- and reference-conditioned video synthesis.

If this is right

3D dance generation improves in both physical realism and stylistic variety when diffusion models are trained without classifier guidance.
Pose-driven image animation reaches higher visual fidelity when kinematic and aesthetic objectives are optimized separately.
A dedicated motion-appearance benchmark dataset enables consistent comparison across future music-to-video methods.
The same cascaded structure can be reused for related tasks such as music-driven character animation in virtual environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The split could reduce compute during inference if the Motion Expert runs at lower frame rate than the video renderer.
Extending the Appearance Expert to handle multiple reference images might improve robustness to changing outfits or lighting.
The BiMamba component may offer efficiency gains when applied to other long-sequence motion tasks beyond dance.

Load-bearing premise

The two experts will combine without introducing noticeable motion-appearance mismatches or coherence loss in the final video.

What would settle it

Side-by-side videos in which the cascaded output shows visible jitter, identity drift, or foot-sliding artifacts that an end-to-end baseline avoids.

Figures

Figures reproduced from arXiv: 2512.18181 by Hongyan Liu, Jiahong Wu, Jiashu Zhu, Jun He, Kaixing Yang, Puwei Wang, Xiangxiang Chu, Xiangyue Zhang, Xulong Tang, Ziqiao Peng.

**Figure 2.** Figure 2: Overview of MACE-Dance.Leveraging the cascaded Mixture-of-Experts (MoE) design, the Motion Expert generates kinematically plausible and artistically expressive 3D motion 𝑋 conditioned on the music 𝑀, and the Appearance Expert then animates the reference image 𝐼 with the 3D motion 𝑋, yielding the dance video 𝐷 that exhibits spatiotemporally coherent appearance. Thanks to the Guidance-Free Training (GFT) str… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison with SOTAs across reference image domains (real-person vs. anime-character) and music genres (Eastern Folk vs. Popping) in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: MACE-Dance generates high-quality dance videos across diverse dance genres. EDGE [Tseng et al. 2023], Lodge [Li et al. 2024b], and MEGA [Yang et al. 2025b] on the FineDance dataset with metrics following [Li et al. 2024b]. As shown in Tab. 2, the Motion Expert attains overall state-of-the-art (SOTA) performance. Specifically, it achieves the best 𝐹𝐼𝐷𝑘 = 17.83 and a competitive 𝐹𝐼𝐷𝑔 = 25.09, indicating hig… view at source ↗

**Figure 5.** Figure 5: Ablation for the model architecture of Motion Expert. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: MACE-Dance produces coherent long-sequence dance videos. GT Pose Wan-Animate w.o. Aesthetic Stage w.o. Kinematic Stage Appearance Expert [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 8.** Figure 8: Comparison with general-purpose video foundation models. [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: User study results comparing our method with four baselines. The bar charts display the percentage of user preferences across six dimensions: Dance Synchronization (DS), Dance Quality (DQ), Dance Creativity (DC), Perceptual Quality (PQ), Temporal Consistency (TC), and Identity Consistency (IC). Our method (Ours) consistently achieves the highest preference rates across all motion and appearance metrics. … view at source ↗

**Figure 10.** Figure 10: Motion Expert in MACE-Dance can generate high-quality 3D Motion with artistic expressiveness and physical plausibility. • Identity Consistency (IC): Maintenance of the subject’s identity relative to the reference image. 7.2 Result Analysis The quantitative results of the user study are summarized in [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: The Appearance Expert in MACE-Dance drives image-based dancing with spatiotemporally coherent appearance. 8 Qualitative Analysis 8.1 Music-Driven Dance Video Generation We further provide additional qualitative comparisons across diverse reference-image domains and music genres to complement the evaluations in the main paper Sec. 4.3.1. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 12.** Figure 12: More cases of qualitative comparison with SOTAs in music-driven dance video generation task. [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

**Figure 13.** Figure 13: Visualization for motion editing. From top to bottom, the first row [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗

**Figure 14.** Figure 14: MACE-Dance produces long-sequence dance videos with artistic expressiveness and physical plausibility [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

read the original abstract

With the rise of online dance-video platforms and rapid advances in AI-generated content (AIGC), music-driven dance generation has emerged as a compelling research direction. Despite substantial progress in related domains such as music-driven 3D dance generation, pose-driven image animation, and audio-driven talking-head synthesis, existing methods cannot be directly adapted to this task. Moreover, the limited studies in this area still struggle to jointly achieve high-quality visual appearance and realistic human motion. Accordingly, we present MACE-Dance, a music-driven dance video generation framework with cascaded Mixture-of-Experts (MoE). The Motion Expert performs music-to-3D motion generation while enforcing kinematic plausibility and artistic expressiveness, whereas the Appearance Expert carries out motion- and reference-conditioned video synthesis, preserving visual identity with spatiotemporal coherence. Specifically, the Motion Expert adopts a diffusion model with a BiMamba-Transformer hybrid architecture and a Guidance-Free Training (GFT) strategy, achieving state-of-the-art (SOTA) performance in 3D dance generation. The Appearance Expert employs a decoupled kinematic-aesthetic fine-tuning strategy, achieving state-of-the-art (SOTA) performance in pose-driven image animation. To better benchmark this task, we curate a large-scale and diverse dataset and design a motion-appearance evaluation protocol. Based on this protocol, MACE-Dance also achieves state-of-the-art performance. Code is available at https://github.com/AMAP-ML/MACE-Dance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MACE-Dance, a cascaded Mixture-of-Experts framework for music-driven dance video generation. The Motion Expert generates 3D dance motions from music via a diffusion model with BiMamba-Transformer hybrid architecture and Guidance-Free Training (GFT), claiming SOTA on 3D dance generation. The Appearance Expert performs motion- and reference-conditioned video synthesis via decoupled kinematic-aesthetic fine-tuning, claiming SOTA on pose-driven image animation. A new large-scale diverse dataset and motion-appearance evaluation protocol are introduced, on which the full cascaded model also claims SOTA performance.

Significance. If the integration of independently trained experts succeeds without coherence loss, the modular separation of motion generation and appearance synthesis offers a practical path to higher-quality music-driven dance videos, with direct applications in AIGC platforms. The new dataset and protocol provide independent benchmarking value for the community. Separate SOTA results on the two sub-tasks strengthen the case for the individual components, but the joint pipeline performance is the load-bearing claim.

major comments (2)

[§4 and §3.2] §4 (Experiments) and §3.2 (Appearance Expert): the overall SOTA claim for the full video generation task is not supported by any reported metric quantifying the performance drop when the Appearance Expert receives model-generated motions instead of ground-truth motions. This ablation is required to validate that distributional shift between the two experts does not produce foot-skating, identity drift, or spatiotemporal artifacts.
[Abstract and §3.1] Abstract and §3.1 (Motion Expert): the kinematic plausibility and artistic expressiveness claims rest on the GFT strategy and BiMamba-Transformer, yet no quantitative comparison (e.g., foot-contact error, motion diversity scores) is supplied against the strongest baselines when the full cascade is evaluated end-to-end.

minor comments (2)

[§4.1] The dataset curation details (size, diversity metrics, train/test split) are referenced but not tabulated; adding a summary table would improve reproducibility.
[§3.3] Notation for the cascaded inference pipeline (how motion latents are passed to the Appearance Expert) is described in prose but would benefit from an explicit diagram or pseudocode block.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We agree that additional ablations are needed to strengthen the end-to-end claims and will revise the manuscript accordingly.

read point-by-point responses

Referee: [§4 and §3.2] §4 (Experiments) and §3.2 (Appearance Expert): the overall SOTA claim for the full video generation task is not supported by any reported metric quantifying the performance drop when the Appearance Expert receives model-generated motions instead of ground-truth motions. This ablation is required to validate that distributional shift between the two experts does not produce foot-skating, identity drift, or spatiotemporal artifacts.

Authors: We agree this ablation is important to validate the cascaded pipeline. In the revised manuscript, we will add a dedicated ablation in §4 comparing Appearance Expert outputs conditioned on ground-truth motions versus Motion Expert-generated motions. Metrics will include foot-contact error for skating, face similarity for identity drift, and temporal consistency scores for artifacts, directly quantifying any performance drop due to distributional shift. revision: yes
Referee: [Abstract and §3.1] Abstract and §3.1 (Motion Expert): the kinematic plausibility and artistic expressiveness claims rest on the GFT strategy and BiMamba-Transformer, yet no quantitative comparison (e.g., foot-contact error, motion diversity scores) is supplied against the strongest baselines when the full cascade is evaluated end-to-end.

Authors: The Motion Expert's SOTA results in §3.1 and §4 already include kinematic metrics against baselines for the isolated 3D generation task. To address the full-cascade request, we will extend the motion-appearance evaluation protocol in the revision to report foot-contact error and motion diversity scores for the complete MACE-Dance pipeline versus end-to-end baselines, confirming preservation of kinematic plausibility. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on new dataset, protocol, and independent expert training

full rationale

The paper presents a cascaded MoE architecture with separately trained Motion Expert (diffusion + BiMamba-Transformer + GFT) and Appearance Expert (decoupled kinematic-aesthetic fine-tuning), supported by a newly curated large-scale dataset and a custom motion-appearance evaluation protocol. These elements provide external grounding for the SOTA claims rather than reducing any prediction or uniqueness result to fitted inputs or self-citations by construction. No equations or steps in the provided text exhibit self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations that collapse the central claim.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities beyond reliance on standard diffusion models and MoE concepts; no ad-hoc inventions or fitted constants are named.

pith-pipeline@v0.9.0 · 5599 in / 1153 out tokens · 42788 ms · 2026-05-16T20:26:55.737006+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MACE-Dance, a music-driven dance video generation framework with cascaded Mixture-of-Experts (MoE). The Motion Expert performs music-to-3D motion generation while enforcing kinematic plausibility... The Appearance Expert carries out motion- and reference-conditioned video synthesis
IndisputableMonolith/Cost/FunctionalEquation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Motion Expert adopts Diffusion Model with BiMamba-Transformer hybrid architecture and Guidance-Free Training (GFT) strategy

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Interactive Multi-Turn Retrieval for Health Videos
cs.IR 2026-05 unverdicted novelty 6.0

DATR combines coarse CLIP-based retrieval with multi-turn query fusion and cross-encoder re-ranking to improve health video retrieval, supported by the new MHVRC corpus.
CustomDancer: Customized Dance Recommendation by Text-Dance Retrieval
cs.MM 2026-05 unverdicted novelty 6.0

CustomDancer achieves state-of-the-art text-to-dance retrieval with 10.23% Recall@1 on the new TD-Data dataset by aligning text, music, and motion features through a CLIP-based framework.
MG-Former: A Transformer-Based Framework for Music-Driven 3D Conducting Gesture Generation
cs.SD 2026-05 unverdicted novelty 5.0

TransConductor generates 3D conducting gestures from music via a Trans-Temporal Music Encoder and Gesture Decoder, outperforming baselines on retrieval-based alignment metrics with a new ConductorMotion dataset.

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · cited by 3 Pith papers

[1]

Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen

Echomimicv3: 1.3 b parameters are all you need for unified multi-modal and multi-task human animation.arXiv preprint arXiv:2507.03905(2025). Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. 2023. Finite scalar quantization: Vq-vae made simple.arXiv preprint arXiv:2309.15505(2023). Ziqiao Peng, Yi Chen, Yifeng Ma, Guozhen Zhang, Zhiy...

work page arXiv 2025
[2]

Animate-x: Universal character image ani- mation with enhanced motion representation.arXiv preprint arXiv:2410.10306, 2024

Film: Visual reasoning with a general conditioning layer. InProceedings of the AAAI conference on artificial intelligence, Vol. 32. Li Siyao, Weijiang Yu, Tianpei Gu, Chunze Lin, Quan Wang, Chen Qian, Chen Change Loy, and Ziwei Liu. 2022. Bailando: 3d dance generation by actor-critic gpt with choreographic memory. InProceedings of the IEEE/CVF Conference ...

work page arXiv 2022

[1] [1]

Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen

Echomimicv3: 1.3 b parameters are all you need for unified multi-modal and multi-task human animation.arXiv preprint arXiv:2507.03905(2025). Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. 2023. Finite scalar quantization: Vq-vae made simple.arXiv preprint arXiv:2309.15505(2023). Ziqiao Peng, Yi Chen, Yifeng Ma, Guozhen Zhang, Zhiy...

work page arXiv 2025

[2] [2]

Animate-x: Universal character image ani- mation with enhanced motion representation.arXiv preprint arXiv:2410.10306, 2024

Film: Visual reasoning with a general conditioning layer. InProceedings of the AAAI conference on artificial intelligence, Vol. 32. Li Siyao, Weijiang Yu, Tianpei Gu, Chunze Lin, Quan Wang, Chen Qian, Chen Change Loy, and Ziwei Liu. 2022. Bailando: 3d dance generation by actor-critic gpt with choreographic memory. InProceedings of the IEEE/CVF Conference ...

work page arXiv 2022