pith. machine review for the scientific record. sign in

arxiv: 2604.18648 · v2 · submitted 2026-04-20 · 💻 cs.CV · cs.AI

Recognition: unknown

DanceCrafter: Fine-Grained Text-Driven Controllable Dance Generation via Choreographic Syntax

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:52 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords text-driven dance generationchoreographic syntaxmotion transformerhuman motion synthesiscontrollable generationdance datasetanatomy-aware modeling
0
0 comments X

The pith

DanceCrafter generates complex dance sequences from text by using a structured Choreographic Syntax framework and a new large-scale dataset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that text can drive fine-grained, stable, and natural dance generation if choreographies are first organized through a formal syntax drawn from dance studies, anatomy, and biomechanics. It builds DanceFlow, a 41-hour dataset of motion-capture sequences paired with over six million words of detailed textual descriptions, then trains DanceCrafter, a motion transformer equipped with a continuous manifold representation, hybrid normalization, and an anatomy-aware loss. A sympathetic reader would care because this combination removes the need for manual keyframing or broad motion priors and instead lets users specify precise body-part actions through ordinary language while preserving physical plausibility.

Core claim

Grounded in Choreographic Syntax and the DanceFlow dataset, the DanceCrafter model, built on the Momentum Human Rig, produces high-fidelity complex dance sequences from text by employing a continuous manifold motion representation, hybrid normalization to stabilize training, and an anatomy-aware loss that regulates the decoupled movements of individual body parts, achieving state-of-the-art results in motion quality, controllability, and naturalness.

What carries the argument

Choreographic Syntax, a theoretical framework and annotation system that decomposes dance into spatial dynamics, directional constraints, and decoupled body-part actions, which then guides both dataset construction and the anatomy-aware components of the DanceCrafter motion transformer.

If this is right

  • Users can specify detailed actions for separate body parts through text and receive physically coherent output.
  • Professional dance archives and motion-capture recordings become directly usable for training controllable generators.
  • The same syntax-based annotation approach can scale to longer sequences without the optimization instabilities seen in earlier models.
  • Generated dances maintain consistency across decoupled limbs while following directional cues in the prompt.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The syntax and dataset could be reused to train models for other structured movement domains such as gymnastics routines or sign-language sequences.
  • Real-time interactive systems might combine this generator with live text input to let choreographers iterate on dance phrases without mocap suits.
  • The continuous manifold representation may reduce the need for post-processing cleanup steps that current diffusion-based motion models often require.

Load-bearing premise

That dance can be sufficiently described and controlled by a syntax derived from dance studies, human anatomy, and biomechanics without omitting essential artistic or improvisational elements.

What would settle it

Quantitative or user-study results on held-out complex choreographies showing that DanceCrafter scores below prior text-to-motion methods on metrics of motion quality, text alignment for specific body-part instructions, or perceived naturalness.

Figures

Figures reproduced from arXiv: 2604.18648 by Christina Dan Wang, Cong Huang, Fei Xu, Hang Yuan, Kai Chen, Menglin Gao, Qing Li, Wenzhe Yu, Xiaolin Hu, Yan Wan, Zhou Yu.

Figure 1
Figure 1. Figure 1: DanceCrafter enables fine-grained text-driven generation of 3D dance motions and expressive 2D videos. We construct [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the dance category composition in our dataset. The figure summarizes the major dance categories and [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The Space and Orientation dimensions of our Chore [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the DanceCrafter framework. (Left) Training Flow: Native MHR parameters are converted to a continuous [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison against baseline methods. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: User study results for the main experiments. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Overview of our professional motion capture recording for the DanceFlow dataset. [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Detailed visualization of the motion capture data processing and 3D reconstruction results. [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Overview of our specialized annotation interface used by domain experts to audit and refine the DanceFlow dataset. [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Representative 3D motion excerpts and fine-grained descriptions of Ballet, Breaking, Contemporary, and Spanish [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Additional dance genre examples including Modern, Dunhuang, Shenyun, and Yangge, showcasing the diversity of [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Animation workflow and generation examples. Given a choreographic description specified with our Choreographic [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Distribution of expert quality scores over 100 ran [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗
read the original abstract

Text-driven controllable dance generation remains under-explored, primarily due to the severe scarcity of high-quality datasets and the inherent difficulty of articulating complex choreographies. Characterizing dance is particularly challenging owing to its intricate spatial dynamics, strong directionality, and the highly decoupled movements of distinct body parts. To overcome these bottlenecks, we bridge principles from dance studies, human anatomy, and biomechanics to propose \textit{Choreographic Syntax}, a novel theoretical framework with a tailored annotation system. Grounded in this syntax, we combine professional dance archives with high-fidelity motion capture data to construct \textbf{DanceFlow}, the most fine-grained dance dataset to date. It encompasses 41 hours of high-quality motions paired with 6.34 million words of detailed descriptions. At the model level, we introduce \textbf{DanceCrafter}, a tailored motion transformer built upon the Momentum Human Rig. To circumvent optimization instabilities, we construct a continuous manifold motion representation paired with a hybrid normalization strategy. Furthermore, we design an anatomy-aware loss to explicitly regulate the decoupled nature of body parts. Together, these adaptations empower DanceCrafter to achieve the high-fidelity and stable generation of complex dance sequences. Extensive evaluations and user studies demonstrate our state-of-the-art performance in motion quality, fine-grained controllability, and generation naturalness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces Choreographic Syntax, a theoretical framework grounded in dance studies, human anatomy, and biomechanics, along with a tailored annotation system. It constructs the DanceFlow dataset (41 hours of high-quality motions paired with 6.34 million words of detailed descriptions) and proposes DanceCrafter, a motion transformer built on the Momentum Human Rig that uses a continuous manifold motion representation, hybrid normalization strategy, and anatomy-aware loss to enable high-fidelity, stable, text-driven controllable generation of complex dance sequences, claiming SOTA performance in motion quality, fine-grained controllability, and naturalness.

Significance. If the results hold, this work would be significant for text-to-motion synthesis in computer vision by addressing dance-specific challenges (decoupled body parts, directionality, spatial dynamics) through an interdisciplinary framework and a large-scale fine-grained dataset. The scale of DanceFlow and the explicit adaptations for manifold representation and anatomy-aware regularization represent concrete strengths that could support more controllable applications in animation and VR.

major comments (2)
  1. [Abstract] Abstract: The central SOTA claim for motion quality, controllability, and naturalness rests on Choreographic Syntax plus the anatomy-aware loss and manifold representation solving decoupling and instability, yet no ablation isolating the syntax (e.g., syntax-based vs. standard kinematic pose representations) is described to show it drives gains beyond dataset scale or transformer capacity.
  2. [Evaluation] The manuscript asserts that the syntax captures intricate spatial dynamics and highly decoupled movements, but without quantitative isolation experiments or comparisons in the evaluation sections, it remains unclear whether the framework itself, rather than other components, produces the reported stability and fidelity improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address the concerns about isolating the contribution of the Choreographic Syntax framework below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central SOTA claim for motion quality, controllability, and naturalness rests on Choreographic Syntax plus the anatomy-aware loss and manifold representation solving decoupling and instability, yet no ablation isolating the syntax (e.g., syntax-based vs. standard kinematic pose representations) is described to show it drives gains beyond dataset scale or transformer capacity.

    Authors: We agree that an explicit ablation isolating the Choreographic Syntax would strengthen the claims. The syntax framework is integral to both the dataset construction (providing the annotation system for fine-grained descriptions) and the model design (guiding the anatomy-aware loss for decoupled movements). A full isolation would require creating a parallel dataset with standard kinematic annotations, which is beyond the scope of this work due to the significant annotation effort involved. However, our evaluations compare against state-of-the-art methods that rely on standard representations, showing superior performance attributable to our approach. In the revised version, we will include additional analysis and a partial ablation study on a subset of the data to better quantify the syntax's impact. revision: partial

  2. Referee: [Evaluation] The manuscript asserts that the syntax captures intricate spatial dynamics and highly decoupled movements, but without quantitative isolation experiments or comparisons in the evaluation sections, it remains unclear whether the framework itself, rather than other components, produces the reported stability and fidelity improvements.

    Authors: We appreciate this observation. While the paper presents ablations for the manifold representation, hybrid normalization, and anatomy-aware loss, the Choreographic Syntax underpins these components. To address this, we will expand the evaluation section with a discussion on how the syntax enables these adaptations and add quantitative comparisons where feasible, such as training variants with and without syntax-informed elements on the same data subset. This will help clarify the framework's specific contributions to stability and fidelity. revision: yes

Circularity Check

0 steps flagged

No circularity: new framework, dataset, and model components are independently constructed

full rationale

The paper proposes Choreographic Syntax as a novel theoretical framework explicitly grounded in external sources (dance studies, human anatomy, biomechanics), constructs DanceFlow dataset from professional archives plus motion capture, and introduces DanceCrafter with new components (Momentum Human Rig, continuous manifold representation, hybrid normalization, anatomy-aware loss). Performance claims rest on evaluations and user studies rather than any reduction of outputs to fitted inputs or self-referential definitions. No self-citations appear as load-bearing for uniqueness theorems or ansatzes, and no equations or derivations collapse by construction to the inputs. The chain is self-contained with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the validity of the newly introduced Choreographic Syntax as an accurate characterization of dance and on the representativeness of the constructed DanceFlow dataset.

axioms (1)
  • domain assumption Dance is characterized by intricate spatial dynamics, strong directionality, and highly decoupled movements of distinct body parts.
    Invoked in the abstract to explain the challenge and motivate the syntax.
invented entities (1)
  • Choreographic Syntax no independent evidence
    purpose: Theoretical framework with tailored annotation system for fine-grained dance description.
    Newly proposed in the paper as bridging dance studies, anatomy, and biomechanics.

pith-pipeline@v0.9.0 · 5566 in / 1201 out tokens · 59894 ms · 2026-05-10T05:52:22.663624+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 12 canonical work pages · 3 internal anchors

  1. [1]

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al . 2025. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631(2025)

  2. [2]

    Anderson

    Blandine Calais-Germain and S. Anderson. 1993.Anatomy of Movement. Eastland Press. https://books.google.co.th/books?id=WoquzQEACAAJ

  3. [3]

    Brandon Castellano. 2024. PySceneDetect: Python-Based Video Scene Detector. https://github.com/Breakthrough/PySceneDetect Accessed: 2026-03-23

  4. [4]

    Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu

  5. [5]

    InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Executing your commands via motion diffusion in latent space. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18000–18010

  6. [6]

    Zeyuan Chen, Hongyi Xu, Guoxian Song, You Xie, Chenxu Zhang, Xin Chen, Chao Wang, Di Chang, and Linjie Luo. 2025. X-dancer: Expressive music to human dance video generation. InProceedings of the IEEE/CVF International Conference on Computer Vision. 10602–10611

  7. [7]

    Gang Cheng, Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, Ju Li, Dechao Meng, Jinwei Qi, Penchong Qiao, et al . 2025. Wan-animate: Unified character animation and replacement with holistic replication.arXiv preprint arXiv:2509.14055(2025)

  8. [8]

    Hyung Won Chung, Noah Constant, Xavier Garcia, Adam Roberts, Yi Tay, Sharan Narang, and Orhan Firat. 2023. Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining.arXiv preprint arXiv:2304.09151 (2023)

  9. [9]

    Aaron Ferguson, Ahmed AA Osman, Berta Bescos, Carsten Stoll, Chris Twigg, Christoph Lassner, David Otte, Eric Vignola, Fabian Prada, Federica Bogo, et al

  10. [10]

    Mhr: Momentum human rig.arXiv preprint arXiv:2511.15586(2025)

  11. [11]

    Kehong Gong, Dongze Lian, Heng Chang, Chuan Guo, Zihang Jiang, Xinxin Zuo, Michael Bi Mi, and Xinchao Wang. 2023. Tm2d: Bimodality driven 3d dance generation via music-text integration. InProceedings of the IEEE/CVF International Conference on Computer Vision. 9942–9952

  12. [12]

    Google. 2025. Gemini 3 Pro Preview. https://ai.google.dev/gemini-api/docs/ models/gemini-3-pro-preview Large language model (gemini-3-pro-preview)

  13. [13]

    Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Sen Wang, and Li Cheng. 2024. Momask: Generative masked modeling of 3d human motions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1900–1910

  14. [14]

    Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng

  15. [15]

    InProceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Generating diverse and natural 3d human motions from text. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5152–5161

  16. [16]

    Prerit Gupta, Jason Alexander Fotso-Puepi, Zhengyuan Li, Jay Mehta, and Aniket Bera. 2025. MDD: A Dataset for Text-and-Music Conditioned Duet Dance Gener- ation. InProceedings of the IEEE/CVF International Conference on Computer Vision. 13932–13941

  17. [17]

    Alex Henry, Prudhvi Raj Dachapally, Shubham Pawar, and Yuxuan Chen. 2020. Query-Key Normalization for Transformers. arXiv:2010.04245 [cs.CL] https: //arxiv.org/abs/2010.04245

  18. [18]

    Jonathan Ho and Tim Salimans. 2022. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598(2022)

  19. [19]

    Inwoo Hwang, Jian Wang, Bing Zhou, et al. 2025. Snapmogen: Human motion generation from expressive texts. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  20. [20]

    Jan-Christoph Klie, Juan Haladjian, Marc Kirchner, and Rahul Nair. 2024. On efficient and statistical quality estimation for data annotation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 15680–15696

  21. [21]

    Ruilong Li, Shan Yang, David A Ross, and Angjoo Kanazawa. 2021. Ai choreog- rapher: Music conditioned 3d dance generation with aist++. InProceedings of the IEEE/CVF international conference on computer vision. 13401–13412

  22. [22]

    Ronghui Li, YuXiang Zhang, Yachao Zhang, Hongwen Zhang, Jie Guo, Yan Zhang, Yebin Liu, and Xiu Li. 2024. Lodge: A coarse to fine diffusion network for long dance generation guided by the characteristic dance primitives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1524–1534

  23. [23]

    Xiaojie Li, Ronghui Li, Shukai Fang, Shuzhao Xie, Xiaoyang Guo, Jiaqing Zhou, Junkun Peng, and Zhi Wang. 2025. Music-aligned holistic 3d dance generation via hierarchical motion modeling. InProceedings of the IEEE/CVF International Conference on Computer Vision. 14420–14430

  24. [24]

    Jing Lin, Ailing Zeng, Shunlin Lu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, and Lei Zhang. 2023. Motion-x: A large-scale 3d expressive whole-body human motion dataset.Advances in Neural Information Processing Systems36 (2023), 25268–25280

  25. [25]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. 2023. Flow matching for generative modeling. InInternational Conference on Learning Representations

  26. [26]

    Xingchao Liu, Chengyue Gong, and Qiang Liu. 2023. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Con- ference on Learning Representations

  27. [27]

    Umile Giuseppe Longo, Sergio De Salvatore, Arianna Carnevale, Salvatore Maria Tecce, Benedetta Bandini, Alberto Lalli, Emiliano Schena, and Vincenzo Denaro

  28. [28]

    Optical motion capture systems for 3D kinematic analysis in patients with shoulder disorders.International journal of environmental research and public health19, 19 (2022), 12033

  29. [29]

    Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. 2023. SMPL: A skinned multi-person linear model. InSeminal Graphics Papers: Pushing the Boundaries, Volume 2. 851–866

  30. [30]

    James Ni, Zekai Wang, Wei Lin, Amir Bar, Yann LeCun, Trevor Darrell, Jitendra Malik, and Roei Herzig. 2025. From Generated Human Videos to Physically Plausible Robot Trajectories. arXiv:2512.05094 [cs.RO] https://arxiv.org/abs/ 2512.05094

  31. [31]

    Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. 2019. Expressive body capture: 3d hands, face, and body from a single image. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10975–10985

  32. [32]

    William Peebles and Saining Xie. 2023. Scalable diffusion models with trans- formers. InProceedings of the IEEE/CVF International Conference on Computer Vision. 4172–4182

  33. [33]

    Matthias Plappert, Christian Mandery, and Tamim Asfour. 2016. The kit motion- language dataset.Big data4, 4 (2016), 236–252

  34. [34]

    Davis Rempe, Mathis Petrovich, Ye Yuan, Haotian Zhang, Xue Bin Peng, Yifeng Jiang, Tingwu Wang, Umar Iqbal, David Minor, Michael de Ruyter, Jiefeng Li, Chen Tessler, Edy Lim, Eugene Jeong, Sam Wu, Ehsan Hassani, Michael Huang, Jin-Bey Yu, Chaeyeon Chung, Lina Song, Olivier Dionne, Jan Kautz, Simon Yuen, and Sanja Fidler. 2026. Kimodo: Scaling Controllable...

  35. [35]

    2025.Walk Before You Dance: High-fidelity and Editable Dance Synthesis via Generative Masked Motion Prior

    Foram Shah. 2025.Walk Before You Dance: High-fidelity and Editable Dance Synthesis via Generative Masked Motion Prior. Master’s thesis. The University of North Carolina at Charlotte

  36. [36]

    Li Siyao, Weijiang Yu, Tianpei Gu, Chunze Lin, Quan Wang, Chen Qian, Chen Change Loy, and Ziwei Liu. 2022. Bailando: 3d dance generation by actor- critic gpt with choreographic memory. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11050–11059

  37. [37]

    Jianlin Su, Murtadha H. M. Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. 2024. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing568 (2024), 127063

  38. [38]

    Qwen Team. 2026. Qwen3.5: Accelerating Productivity with Native Multimodal Agents. https://qwen.ai/blog?id=qwen3.5

  39. [39]

    Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H Bermano. 2022. Human motion diffusion model.arXiv preprint arXiv:2209.14916(2022)

  40. [40]

    Guy Tevet, Sigal Raab, Brian Gordon, Yoni Shafir, Daniel Cohen-or, and Amit Haim Bermano. 2023. Human Motion Diffusion Model. InThe Eleventh International Conference on Learning Representations

  41. [41]

    Karen Liu

    Jonathan Tseng, Rodrigo Castellon, and C. Karen Liu. 2023. EDGE: Editable dance generation from music. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 448–458

  42. [42]

    Vaganova

    A.I.A. Vaganova. 1969.Basic Principles of Classical Ballet: Russian Ballet Technique. Dover Publications. https://books.google.co.th/books?id=_-LwEAAAQBAJ

  43. [43]

    von Laban and F.C

    R. von Laban and F.C. Lawrence. 1974.Effort; Economy of Human Movement. Macdonald & Evans. https://books.google.co.th/books?id=fZp9AAAAMAAJ

  44. [44]

    von Laban, L

    R. von Laban, L. Ullman, and L. Ullmann. 1974.The Language of Movement: A Guidebook to Choreutics. Plays, Incorporated. https://books.google.co.th/books? id=_V61AAAAIAAJ

  45. [45]

    Yuxin Wen, Qing Shuai, Di Kang, Jing Li, Cheng Wen, Yue Qian, Ningxin Jiao, Changhai Chen, Weijie Chen, Yiran Wang, et al . 2025. HY-Motion 1.0: Scal- ing Flow Matching Models for Text-To-Motion Generation.arXiv preprint arXiv:2512.23464(2025)

  46. [46]

    Xitong Yang, Devansh Kukreja, Don Pinkus, Anushka Sagar, Taosha Fan, Jin- hyung Park, Soyong Shin, Jinkun Cao, Jiawei Liu, Nicolas Ugrinovic, et al

  47. [47]

    Sam 3d body: Robust full-body human mesh recovery.arXiv preprint arXiv:2602.15989,

    SAM 3D Body: Robust Full-Body Human Mesh Recovery.arXiv preprint arXiv:2602.15989(2026). Yuan and Hu, et al

  48. [48]

    Hengyuan Zhang, Zhe Li, Xingqun Qi, Mengze Li, Muyi Sun, Siye Wang, Man Zhang, and Sirui Han. 2025. DanceEditor: Towards Iterative Editable Music- driven Dance Generation with Open-Vocabulary Descriptions. InProceedings of the IEEE/CVF International Conference on Computer Vision. 12158–12168

  49. [49]

    Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Yong Zhang, Hongwei Zhao, Hongtao Lu, Xi Shen, and Shan Ying. 2023. Generating human motion from textual descriptions with discrete representations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14730–14740

  50. [50]

    Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu. 2022. MotionDiffuse: Text-driven human motion generation with diffusion model.arXiv preprint arXiv:2208.15001(2022)

  51. [51]

    tailwind flag

    Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. 2019. On the continuity of rotation representations in neural networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5745–5753. A Dataset Details A.1 Motion Capture Workflow To construct the high-fidelity portion of theDanceFlowdataset, we recorded profes...

  52. [52]

    Body parts involved in the movement

  53. [53]

    Changes in the center of gravity

  54. [54]

    Spatial trajectory of the movement

  55. [55]

    Left" or

    Movement dynamics (force and speed). Below is an introduction to these four components and their descriptive standards. IMPORTANT NOTE ON PERSPECTIVE: In the standards below, any reference to "Left" or "Right" is strictly based on the Camera Perspective (i.e., the viewer's/video's left and right), NOT the dancer's own anatomical left or right

  56. [56]

    The eyes are the core, capable of guiding the visual focus of the overall movement

    Body Parts 1.1 Head Human movement often originates from the head. The eyes are the core, capable of guiding the visual focus of the overall movement. Eyes: Directions: Up, Down, Left, Right, Diagonal-Up, Diagonal-Down. Description: Used to guide the movement's line of sight or act as the focal point of the head posture. 1.2 Upper Limbs The upper limbs en...

  57. [57]

    Two-Leg COG: One supporting leg acts as the primary pillar, while the working leg acts as an auxiliary

    Center of Gravity (COG) 2.1 State of Gravity Single-Leg COG: Full-foot weight or half-foot (relevé) weight. Two-Leg COG: One supporting leg acts as the primary pillar, while the working leg acts as an auxiliary. Opposing COG: Jumping/Airborne (suspended in the air). 2.2 Changes in Gravity Maintain: Keeping the center of gravity unchanged. Shift/Push: Tran...

  58. [58]

    8-o'clock System,

    Spatial Trajectory 3.1 Classification of Body Movement Space The human body moves within three primary planes: 3.2 Spatial Directions Basic Dimensions: Up - Down, Left - Right, Front - Back. Complex Dimensions: Table Plane Diagonals: Front-Left, Front-Right, Back-Left, Back-Right. Door Plane Diagonals: Top-Left, Top-Right, Bottom-Left, Bottom-Right. Wheel...

  59. [59]

    Brief: Short duration, but the process is perceptible

    Movement Dynamics 4.1 Movement Speed Objective Duration (Length of Time): Instantaneous: Extremely short completion time, process barely visible (<0.5 seconds). Brief: Short duration, but the process is perceptible. Moderate: A speed that aligns with natural, everyday rhythms. Prolonged: Deliberately lengthened movement duration, exceeding natural breathi...

  60. [60]

    5.1 Environment and Character

    Description Format and Examples Structural Order: Always describe the Environment and Character first, followed by the Action Description. 5.1 Environment and Character

  61. [61]

    Describe the environment setting

  62. [62]

    DanceCrafter: Fine-Grained Text-Driven Controllable Dance Generation via Choreographic Syntax 5.2 Action Description Determine if it is a Static Pose or a Dynamic Connection

    Describe the character's clothing and appearance. DanceCrafter: Fine-Grained Text-Driven Controllable Dance Generation via Choreographic Syntax 5.2 Action Description Determine if it is a Static Pose or a Dynamic Connection. Static Pose: Briefly describe the external form of the posture. Dynamic Connection: First check for gravity changes, then identify t...

  63. [63]

    Ignore negligible micro-movements

    Do not over-detail: Spatial paths should not be described with excessive micro-details. Ignore negligible micro-movements

  64. [64]

    No subjectivity: Subjective aesthetic adjectives or anthropomorphic metaphors are strictly forbidden

  65. [65]

    Static Pose

    Seamless integration: Do not artificially label "Static Pose" or "Dynamic Connection" in your output text. If it is static, describe the pose; if it is dynamic, detail the action. Blend them seamlessly into cohesive paragraphs

  66. [66]

    the video's left

    Camera Perspective: When describing Left and Right, always use the Camera Perspective (the video's left/right), NOT the dancer's own left/right. Do not explicitly write "the video's left"; simply state "Left" or "Right." 5.3 Example Description A barefoot male dancer wearing a dark green slanted-placket long shirt and black wide-leg pants stands in the ce...